check if word is in english dictionary

Discussion in 'Community Discussion' started by unplugme71, Nov 8, 2013.

  1. unplugme71 macrumors 68030

    Joined:
    May 20, 2011
    Location:
    Earth
    #1
    I have a document that has about 7 million rows of text. I need a way to provide me a list of only the rows that contain actual words.

    eg.

    AAx7yss - not a word
    TodayGone - not a word
    Today - word


    It could be in excel, a website that does it, or a way to write it in PHP.
     
  2. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #2
    What kind of document is it? Excel? Word? Something else?

    Can you export the data into a plain text file, with one word per line? If not one word per line, what exportable plain-text formats can you produce?

    Exactly what do you want in the output? What order should the lines be in? Do you need any other output, such as the line-number (row number) where the non-word occurs?

    How definitive does "not English" need to be? Do proper names, such as "Aaron" or "Abel", count as English or non-English? What about names of countries or cities?


    This seems like a fairly straightforward programming problem, and there are different ways of solving it. One way is to use the Unix command-line tools, such as 'awk', to build a set of known English words, then compare each word in your plain-text list against the known-English word-set, emitting any non-matching words to another plain-text file.

    I can take a look at this later today. Or someone else might post something sooner. I plan to use the words in /usr/share/dict/words as the set of English words, unless there are some better ideas.
     
  3. mobilehaathi macrumors G3

    mobilehaathi

    Joined:
    Aug 19, 2008
    Location:
    The Anthropocene
    #3
    Pretty much a trivial scripting problem.

    Code:
    1) load a list of valid words into a hash
    2) load query file
      A) cut file into words
      B) strip punctuation
    3) check if each word exists in the hash
    
    The details of your desired output could add a few other steps, but really that is it.
     
  4. unplugme71 thread starter macrumors 68030

    Joined:
    May 20, 2011
    Location:
    Earth
    #4
    The list is one column, one word per row. It's in Excel.

    It can be any English word that is pronounced.

    I just don't know how to compare it and output only the valid words.
     
  5. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #5
    I wrote a program to do this, but it's limited in what it can deal with.

    You must export the words to a plain text file, with one word per line.
    There must be no punctuation, such as quotes, commas, etc.


    First, copy and paste the following into a plain text file named awkwords.txt, located in your Documents folder.

    File: awkwords.txt
    Code:
    # The program starts in words-building mode.
    # In this mode, input files are used to build the known_words dictionary.
    # In each file, the 1st word on a line ($1) is added to known_words.
    #
    # When the 1st word on a line is "---", the mode changes to word-matching mode.
    # The mode can also be changed by setting the mode-variable to 1, like so:
    #   MATCHING=1
    # Put that on the command-line instead of an input-file (example below).
    #
    # In word-matching mode, input files are checked against the known words,
    # and any line whose 1st word is known is output to stdout.
    # Any line whose 1st word isn't known is not output.
    #
    # At the end, a count of known words and a count of matched lines is
    # output to the stderr stream.
    #
    # Examples:
    # Test using short dictionary (test_words.txt):
    #  awk -f awkwords.txt  test_words.txt MATCHING=1 in-words.txt
    #
    # Test using full dictionary:
    #  awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 in-words.txt
    #
    # Timed test using full dictionary for input:
    #  time awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 /usr/share/dict/words | wc
    
    
    # Begins in words-building mode, not word-matching mode.
    BEGIN {  MATCHING = 0;  DIAGNOSTICS = 0;  }
    
    
    # Pattern in input data that switches modes.
    $1 == "---"  {  MATCHING = 1;  next;  }
    
    
    # Catch-all action that builds words-array or matches against it,
    # depending on the state of the MATCHING variable.
    {
      if ( MATCHING )  {
        # In word-matching mode, check 1st word against known_words.
        # If found, output its entire line.
        if ( known_words[ tolower( $1 ) ] )  {
          print $0
    
          # Counter for END
          ++countMatched
    
        } else {
          ++countUnmatched
        }
    
      }  else  {
        # Add first word on line to known words.
        known_words[ tolower( $1 ) ] = 1
      }
    }
    
    
    END {
      if ( DIAGNOSTICS )  {
        wordCount = 0
        for ( w in known_words ) {
          ++wordCount
        }
        print "known_words: " wordCount >"/dev/stderr"
        print "    matched: " countMatched >"/dev/stderr"
        print "  unmatched: " countUnmatched >"/dev/stderr"
      }
    }
    Second, use Excel to export your data. You must export it to a plain text file in-words.txt, located in your Documents folder.

    Third, launch Terminal.app, then copy and paste the following into the Terminal window:
    Code:
    cd ~/Documents;  awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 in-words.txt >out-words.txt
    The output will be in the new file out-words.txt in your Documents folder.

    There are ways of doing the same thing using various other languages.


    I can't really predict how long it will take to run. I suggest trying it first with an in-words.txt file of around 1000 lines. That should finish in a few seconds at most. You should then check the out-words.txt file to make sure it looks correct.

    If 1000 lines works, try it with increasingly larger files. Time it, then calculate approximately how long 100,000 or a million lines would take.

    When I did a test run where the input file was the /usr/share/dict/words dictionary of English words, it finished in about 1 second. The result contained 235886 words of output.


    Here's my test-case for in-words.txt:
    Code:
    abacus
    aberrant
    abet
    bizarre
    Candle
    Capital
    UPPER
    
    borked is not a word.
    fubar isn't a word.
    
    Example of multiple words on line, known-good English word.
    
    The lines for "borked" and "fubar" shouldn't appear in the output. The other lines should.

    Also, there won't be any blank lines in the output.


    If there's a problem, post again, and include the exact text of any error messages (copy and paste it).
     

Share This Page