I wrote a program to do this, but it's limited in what it can deal with.
You must export the words to a plain text file, with one word per line.
There must be no punctuation, such as quotes, commas, etc.
First, copy and paste the following into a plain text file named
awkwords.txt, located in your Documents folder.
File: awkwords.txt
Code:
# The program starts in words-building mode.
# In this mode, input files are used to build the known_words dictionary.
# In each file, the 1st word on a line ($1) is added to known_words.
#
# When the 1st word on a line is "---", the mode changes to word-matching mode.
# The mode can also be changed by setting the mode-variable to 1, like so:
# MATCHING=1
# Put that on the command-line instead of an input-file (example below).
#
# In word-matching mode, input files are checked against the known words,
# and any line whose 1st word is known is output to stdout.
# Any line whose 1st word isn't known is not output.
#
# At the end, a count of known words and a count of matched lines is
# output to the stderr stream.
#
# Examples:
# Test using short dictionary (test_words.txt):
# awk -f awkwords.txt test_words.txt MATCHING=1 in-words.txt
#
# Test using full dictionary:
# awk -f awkwords.txt /usr/share/dict/words MATCHING=1 in-words.txt
#
# Timed test using full dictionary for input:
# time awk -f awkwords.txt /usr/share/dict/words MATCHING=1 /usr/share/dict/words | wc
# Begins in words-building mode, not word-matching mode.
BEGIN { MATCHING = 0; DIAGNOSTICS = 0; }
# Pattern in input data that switches modes.
$1 == "---" { MATCHING = 1; next; }
# Catch-all action that builds words-array or matches against it,
# depending on the state of the MATCHING variable.
{
if ( MATCHING ) {
# In word-matching mode, check 1st word against known_words.
# If found, output its entire line.
if ( known_words[ tolower( $1 ) ] ) {
print $0
# Counter for END
++countMatched
} else {
++countUnmatched
}
} else {
# Add first word on line to known words.
known_words[ tolower( $1 ) ] = 1
}
}
END {
if ( DIAGNOSTICS ) {
wordCount = 0
for ( w in known_words ) {
++wordCount
}
print "known_words: " wordCount >"/dev/stderr"
print " matched: " countMatched >"/dev/stderr"
print " unmatched: " countUnmatched >"/dev/stderr"
}
}
Second, use Excel to export your data. You must export it to a plain text file
in-words.txt, located in your Documents folder.
Third, launch Terminal.app, then copy and paste the following into the Terminal window:
Code:
cd ~/Documents; awk -f awkwords.txt /usr/share/dict/words MATCHING=1 in-words.txt >out-words.txt
The output will be in the new file
out-words.txt in your Documents folder.
There are ways of doing the same thing using various other languages.
I can't really predict how long it will take to run. I suggest trying it first with an
in-words.txt file of around 1000 lines. That should finish in a few seconds at most. You should then check the
out-words.txt file to make sure it looks correct.
If 1000 lines works, try it with increasingly larger files. Time it, then calculate approximately how long 100,000 or a million lines would take.
When I did a test run where the input file was the /usr/share/dict/words dictionary of English words, it finished in about 1 second. The result contained 235886 words of output.
Here's my test-case for
in-words.txt:
Code:
abacus
aberrant
abet
bizarre
Candle
Capital
UPPER
borked is not a word.
fubar isn't a word.
Example of multiple words on line, known-good English word.
The lines for "borked" and "fubar" shouldn't appear in the output. The other lines should.
Also, there won't be any blank lines in the output.
If there's a problem, post again, and include the exact text of any error messages (copy and paste it).