check if word is in english dictionary

Discussion in 'Community Discussion' started by unplugme71, Nov 8, 2013.

  1. unplugme71 macrumors 68030

    May 20, 2011
    I have a document that has about 7 million rows of text. I need a way to provide me a list of only the rows that contain actual words.


    AAx7yss - not a word
    TodayGone - not a word
    Today - word

    It could be in excel, a website that does it, or a way to write it in PHP.
  2. chown33 macrumors 604

    Aug 9, 2009
    Sailing beyond the sunset
    What kind of document is it? Excel? Word? Something else?

    Can you export the data into a plain text file, with one word per line? If not one word per line, what exportable plain-text formats can you produce?

    Exactly what do you want in the output? What order should the lines be in? Do you need any other output, such as the line-number (row number) where the non-word occurs?

    How definitive does "not English" need to be? Do proper names, such as "Aaron" or "Abel", count as English or non-English? What about names of countries or cities?

    This seems like a fairly straightforward programming problem, and there are different ways of solving it. One way is to use the Unix command-line tools, such as 'awk', to build a set of known English words, then compare each word in your plain-text list against the known-English word-set, emitting any non-matching words to another plain-text file.

    I can take a look at this later today. Or someone else might post something sooner. I plan to use the words in /usr/share/dict/words as the set of English words, unless there are some better ideas.
  3. mobilehaathi macrumors G3


    Aug 19, 2008
    The Anthropocene
    Pretty much a trivial scripting problem.

    1) load a list of valid words into a hash
    2) load query file
      A) cut file into words
      B) strip punctuation
    3) check if each word exists in the hash
    The details of your desired output could add a few other steps, but really that is it.
  4. unplugme71 thread starter macrumors 68030

    May 20, 2011
    The list is one column, one word per row. It's in Excel.

    It can be any English word that is pronounced.

    I just don't know how to compare it and output only the valid words.
  5. chown33 macrumors 604

    Aug 9, 2009
    Sailing beyond the sunset
    I wrote a program to do this, but it's limited in what it can deal with.

    You must export the words to a plain text file, with one word per line.
    There must be no punctuation, such as quotes, commas, etc.

    First, copy and paste the following into a plain text file named awkwords.txt, located in your Documents folder.

    File: awkwords.txt
    # The program starts in words-building mode.
    # In this mode, input files are used to build the known_words dictionary.
    # In each file, the 1st word on a line ($1) is added to known_words.
    # When the 1st word on a line is "---", the mode changes to word-matching mode.
    # The mode can also be changed by setting the mode-variable to 1, like so:
    #   MATCHING=1
    # Put that on the command-line instead of an input-file (example below).
    # In word-matching mode, input files are checked against the known words,
    # and any line whose 1st word is known is output to stdout.
    # Any line whose 1st word isn't known is not output.
    # At the end, a count of known words and a count of matched lines is
    # output to the stderr stream.
    # Examples:
    # Test using short dictionary (test_words.txt):
    #  awk -f awkwords.txt  test_words.txt MATCHING=1 in-words.txt
    # Test using full dictionary:
    #  awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 in-words.txt
    # Timed test using full dictionary for input:
    #  time awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 /usr/share/dict/words | wc
    # Begins in words-building mode, not word-matching mode.
    # Pattern in input data that switches modes.
    $1 == "---"  {  MATCHING = 1;  next;  }
    # Catch-all action that builds words-array or matches against it,
    # depending on the state of the MATCHING variable.
      if ( MATCHING )  {
        # In word-matching mode, check 1st word against known_words.
        # If found, output its entire line.
        if ( known_words[ tolower( $1 ) ] )  {
          print $0
          # Counter for END
        } else {
      }  else  {
        # Add first word on line to known words.
        known_words[ tolower( $1 ) ] = 1
    END {
      if ( DIAGNOSTICS )  {
        wordCount = 0
        for ( w in known_words ) {
        print "known_words: " wordCount >"/dev/stderr"
        print "    matched: " countMatched >"/dev/stderr"
        print "  unmatched: " countUnmatched >"/dev/stderr"
    Second, use Excel to export your data. You must export it to a plain text file in-words.txt, located in your Documents folder.

    Third, launch, then copy and paste the following into the Terminal window:
    cd ~/Documents;  awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 in-words.txt >out-words.txt
    The output will be in the new file out-words.txt in your Documents folder.

    There are ways of doing the same thing using various other languages.

    I can't really predict how long it will take to run. I suggest trying it first with an in-words.txt file of around 1000 lines. That should finish in a few seconds at most. You should then check the out-words.txt file to make sure it looks correct.

    If 1000 lines works, try it with increasingly larger files. Time it, then calculate approximately how long 100,000 or a million lines would take.

    When I did a test run where the input file was the /usr/share/dict/words dictionary of English words, it finished in about 1 second. The result contained 235886 words of output.

    Here's my test-case for in-words.txt:
    borked is not a word.
    fubar isn't a word.
    Example of multiple words on line, known-good English word.
    The lines for "borked" and "fubar" shouldn't appear in the output. The other lines should.

    Also, there won't be any blank lines in the output.

    If there's a problem, post again, and include the exact text of any error messages (copy and paste it).

Share This Page