My wc program

Discussion in 'Mac Programming' started by Bill McEnaney, Jan 31, 2011.

  1. Bill McEnaney macrumors 6502

    Joined:
    Apr 29, 2010
    #1
    Everybody,

    Here's my simplistic version of a wc command, a version that counts a text file's line of text, even when there's no '\n' on that line. I just can't figure out to revise the program, so it won't need "in_word = !isspace(line)". Any suggestions? Thanks.


    Code:
    int
    main(void)
    {
      FILE *input;
      register int i = 0, characters = 0, lines = 0, words = 0;
      char line[MAX + 1];
      bool in_word;
    
      if (!(input = fopen("words.txt", "r")))
        perror("mywc");
      else
      {
        while (fgets(line, MAX, input) != NULL)
        {
          ++lines;
          for (i = 0; line[i] != '\0'; ++i)
          {
            in_word = !isspace(line[i]);
    	if (in_word && isspace(line[i + 1]))
               ++words;
          }
          characters += i;
        }
        if (in_word)
          ++words;
      }
      printf("%d characters\n %d words\n%d lines\n", characters, words, lines);
      return 0;
    }
     
  2. lee1210 macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #2
    My suggestion is not to count the times you move from not being in a word to being in a word. So when isspace(line) is true, and you're not already in whitespace, you set inWhitespace to true. When isspace(line) is false, and inWhitespace is true, set inWhitespace to false and increment the wordcount. That way you will have to carry some state around, but you won't have to lookahead, and the count will be right for the last word of a line. At the start of the line I'd set inWhitespace to true, so if the first character is not a space you increment your word count, and if it is whitespace it's not a problem.

    -Lee
     
  3. Bill McEnaney, Jan 31, 2011
    Last edited: Jan 31, 2011

    Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #3

    Good idea, Lee. Now please excuse me while I wait for my face to turn white again. :) :eek:
     
  4. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #4
    The int type could be a problem if files are big.
     
  5. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
    #5
    I remember one exercise a long time ago where we were supposed to consider anything non-alphanumeric to be whitespace so that any misplaced punctuation etc... would not be considered an extra word.

    As it stands, both wc and this code will treat.

    Code:
    $(*%$W*(% $%^#*&% @$#@(*$@
    (*#% &*%$#
    Supercalifragilisticexpialidocious 3424 !
    
    as having 8 "words", when some of them are gibberish or cartoon swear words and others are errant punctuation marks.

    It's not wc, but like Bill's intent with counting the last line it may be more accurate for some kinds of documents.

    B
     
  6. Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #6
    Then I'll declare the integer variables as long integer variables.
     
  7. lee1210 macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #7
    May as well make them unsigned, unless you can have nega-text (it's like dark matter).

    -Lee
     
  8. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
    #8
    So what happens when text meets nega-text? Is it the end of both strings?

    B
     
  9. lee1210 macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #9
    Is there exactly the same amount of text and nega-text? If so, the result is nothing, both are destroyed. If there is more of one or the other, the difference in volume of that type is what will remain. This is why, if you have access to nega-text, you need to be sure that when you're writing a po
     
  10. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
  11. Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #11
    Since my program embarrassed me when I posted it, it may deserve to collide with its antimatter twin. The program hardly the best code I've ever written. It's not the worst either. My first version of wc crawled because it used strtok. So I needed to know that I could write something better. I think I wrote a better wc, but the better one still disappoints me.
     
  12. Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #12
    I thought about unsigned ints, too.
     
  13. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
    #13
    Have you seen xkcd's flowchart on how to write good code?

    http://xkcd.com/844/

    [​IMG]

    We're all always looking for that question mark :p

    B
     
  14. Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #14
    Maybe a need to write a boolean function that tells whether its argument can be part of a word. That function might look something like this.

    Code:
    bool may_be_in_word(char character)
    {
       return isalnum(character) || character == '-' || character == APOSTROPHE;
    }
     
  15. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
    #15
    Then you get into other issues like is state-of-the-art one word or 4? It all depends on what kind of input is expected.

    B
     
  16. Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #16
    I've just been reading a book about how to write prose. In the chapter about revision, the author quotes another author who exclaims, "Murder your little darlings." So, in true Unix style, I'll impale my program with the kernel's fork. :)
     
  17. Bill McEnaney thread starter macrumors 6502

    Joined:
    Apr 29, 2010
    #17
    Why don't I change my function to something like this?

    Code:
    bool can_be_in_a_word(character)
    {
      return isalpha(character) || isdigit(character) || character == '-' || character == APOSTROPHE;
    }
    I wish I could write:

    Code:
    (define word-count document)
       (length (filter word? document)))
    I know that a programs specs would tell me what counts as a word. Luckily, my program won't go production. No, I'll just rewrite it for the rest of my life. ;)
     

Share This Page