Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

unplugme71

macrumors 68030
Original poster
May 20, 2011
2,827
754
Earth
I have a document that has about 7 million rows of text. I need a way to provide me a list of only the rows that contain actual words.

eg.

AAx7yss - not a word
TodayGone - not a word
Today - word


It could be in excel, a website that does it, or a way to write it in PHP.
 
What kind of document is it? Excel? Word? Something else?

Can you export the data into a plain text file, with one word per line? If not one word per line, what exportable plain-text formats can you produce?

Exactly what do you want in the output? What order should the lines be in? Do you need any other output, such as the line-number (row number) where the non-word occurs?

How definitive does "not English" need to be? Do proper names, such as "Aaron" or "Abel", count as English or non-English? What about names of countries or cities?


This seems like a fairly straightforward programming problem, and there are different ways of solving it. One way is to use the Unix command-line tools, such as 'awk', to build a set of known English words, then compare each word in your plain-text list against the known-English word-set, emitting any non-matching words to another plain-text file.

I can take a look at this later today. Or someone else might post something sooner. I plan to use the words in /usr/share/dict/words as the set of English words, unless there are some better ideas.
 
Pretty much a trivial scripting problem.

Code:
1) load a list of valid words into a hash
2) load query file
  A) cut file into words
  B) strip punctuation
3) check if each word exists in the hash

The details of your desired output could add a few other steps, but really that is it.
 
What kind of document is it? Excel? Word? Something else?

Can you export the data into a plain text file, with one word per line? If not one word per line, what exportable plain-text formats can you produce?

Exactly what do you want in the output? What order should the lines be in? Do you need any other output, such as the line-number (row number) where the non-word occurs?

How definitive does "not English" need to be? Do proper names, such as "Aaron" or "Abel", count as English or non-English? What about names of countries or cities?


This seems like a fairly straightforward programming problem, and there are different ways of solving it. One way is to use the Unix command-line tools, such as 'awk', to build a set of known English words, then compare each word in your plain-text list against the known-English word-set, emitting any non-matching words to another plain-text file.

I can take a look at this later today. Or someone else might post something sooner. I plan to use the words in /usr/share/dict/words as the set of English words, unless there are some better ideas.

The list is one column, one word per row. It's in Excel.

It can be any English word that is pronounced.

I just don't know how to compare it and output only the valid words.
 
I wrote a program to do this, but it's limited in what it can deal with.

You must export the words to a plain text file, with one word per line.
There must be no punctuation, such as quotes, commas, etc.


First, copy and paste the following into a plain text file named awkwords.txt, located in your Documents folder.

File: awkwords.txt
Code:
# The program starts in words-building mode.
# In this mode, input files are used to build the known_words dictionary.
# In each file, the 1st word on a line ($1) is added to known_words.
#
# When the 1st word on a line is "---", the mode changes to word-matching mode.
# The mode can also be changed by setting the mode-variable to 1, like so:
#   MATCHING=1
# Put that on the command-line instead of an input-file (example below).
#
# In word-matching mode, input files are checked against the known words,
# and any line whose 1st word is known is output to stdout.
# Any line whose 1st word isn't known is not output.
#
# At the end, a count of known words and a count of matched lines is
# output to the stderr stream.
#
# Examples:
# Test using short dictionary (test_words.txt):
#  awk -f awkwords.txt  test_words.txt MATCHING=1 in-words.txt
#
# Test using full dictionary:
#  awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 in-words.txt
#
# Timed test using full dictionary for input:
#  time awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 /usr/share/dict/words | wc


# Begins in words-building mode, not word-matching mode.
BEGIN {  MATCHING = 0;  DIAGNOSTICS = 0;  }


# Pattern in input data that switches modes.
$1 == "---"  {  MATCHING = 1;  next;  }


# Catch-all action that builds words-array or matches against it,
# depending on the state of the MATCHING variable.
{
  if ( MATCHING )  {
    # In word-matching mode, check 1st word against known_words.
    # If found, output its entire line.
    if ( known_words[ tolower( $1 ) ] )  {
      print $0

      # Counter for END
      ++countMatched

    } else {
      ++countUnmatched
    }

  }  else  {
    # Add first word on line to known words.
    known_words[ tolower( $1 ) ] = 1
  }
}


END {
  if ( DIAGNOSTICS )  {
    wordCount = 0
    for ( w in known_words ) {
      ++wordCount
    }
    print "known_words: " wordCount >"/dev/stderr"
    print "    matched: " countMatched >"/dev/stderr"
    print "  unmatched: " countUnmatched >"/dev/stderr"
  }
}

Second, use Excel to export your data. You must export it to a plain text file in-words.txt, located in your Documents folder.

Third, launch Terminal.app, then copy and paste the following into the Terminal window:
Code:
cd ~/Documents;  awk -f awkwords.txt  /usr/share/dict/words MATCHING=1 in-words.txt >out-words.txt

The output will be in the new file out-words.txt in your Documents folder.

There are ways of doing the same thing using various other languages.


I can't really predict how long it will take to run. I suggest trying it first with an in-words.txt file of around 1000 lines. That should finish in a few seconds at most. You should then check the out-words.txt file to make sure it looks correct.

If 1000 lines works, try it with increasingly larger files. Time it, then calculate approximately how long 100,000 or a million lines would take.

When I did a test run where the input file was the /usr/share/dict/words dictionary of English words, it finished in about 1 second. The result contained 235886 words of output.


Here's my test-case for in-words.txt:
Code:
abacus
aberrant
abet
bizarre
Candle
Capital
UPPER

borked is not a word.
fubar isn't a word.

Example of multiple words on line, known-good English word.
The lines for "borked" and "fubar" shouldn't appear in the output. The other lines should.

Also, there won't be any blank lines in the output.


If there's a problem, post again, and include the exact text of any error messages (copy and paste it).
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.