HP's printer software and utilities were replaced with built-in items in OS X Snow Leopard after a few tense days where HP initially declined to support the new operating system. Many of us remember our outrage when some HP spokesdrone actually stated that users should buy new printers if they wanted Snow Leopard support. Those of us who didn't run screaming to Canon or some other brand have made do with a reduced feature-set for our HP printers. In particular, OCR functionality is no longer supported by HP.
I tried the following, and not only did it work for me, but it was a really sweet solution. At its heart is Tesseract, an open-source OCR engine that originated at HP (how's that for irony) and is currently maintained (and used) by Google. Their project is documented at http://code.google.com/p/tesseract-ocr/
I found that some of the online instructions for deploying and using Tesseract were a little bit confusing and contradictory. But here's what worked for me. As always: proceed at your own risk; I make no guarantees, and while I have tried to be careful I can't be certain that all the necessary steps are present, or safe, nor can I provide support. With that caveat out of the way:
o Be logged in to your Mac as an Adminstrator.
o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
o Expand it. The expansion process will create a new subfolder, "tesseract-2.04", inside your Downloads folder.
o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz --this is the English dictionary file.
o Expand it. Copy the CONTENTS of the resulting folder to the tessdata subfolder inside the aforementioned "tesseract-2.04" folder.
o Now open a terminal window (sorry). Navigate to ~/Downloads and from there into the "tesseract-2.04" folder.
o Issue the following commands. Each will take a minute or two to run:
(Actually, if you're logged in as an Administrator as I recommended, you won't need the "sudo")
o You have now installed your OCR engine. It can OCR uncompressed .tif files (only) very quickly. (However, they must be named "[something].tif". Not .tiff.)
===Performing OCR===
There's a quick way with the terminal, and a cool way with Folder Actions. I'll describe both.
Terminal:
o To create a converted text file of name "someimage.txt" from an uncompressed .tif file named "someimage.tif" [note the .tif extension], go to the terminal, navigate to where the file is, and issue:
...It's that easy.
Now, my HP all-in-one's scanner produces a few graphics formats including .PNG, but not uncompressed .TIF. No matter, Apple's built-in Folder Action which converts images to .TIF outputs uncompressed .TIF files when fed a .PNG file from the scanner. Or, you can load your scanned image into Preview and Save As a .TIF format; just be sure to select no compression.
Folder Action OCR:
I made a sweet Folder Action script (my very first!) by slightly modifying an existing Folder Action script from Apple which converts file formats by calling a shell script. In this case, I modified the script to call the shell script above.
Just open the Folder Action script editor and paste the following code in. It works beautifully.
You can get fancy and create an output folder for your scanner with a Folder Action selected to convert incoming files to .TIF, and then attach another Folder action to its TIF output folder to convert incoming files to text via OCR! Then your scans are automatically OCR'd as they arrive from the scanner. Sweet!
===Folder Action script to use Tesseract to OCR uncompressed .TIF files:===>
Enjoy!
I tried the following, and not only did it work for me, but it was a really sweet solution. At its heart is Tesseract, an open-source OCR engine that originated at HP (how's that for irony) and is currently maintained (and used) by Google. Their project is documented at http://code.google.com/p/tesseract-ocr/
I found that some of the online instructions for deploying and using Tesseract were a little bit confusing and contradictory. But here's what worked for me. As always: proceed at your own risk; I make no guarantees, and while I have tried to be careful I can't be certain that all the necessary steps are present, or safe, nor can I provide support. With that caveat out of the way:
o Be logged in to your Mac as an Adminstrator.
o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
o Expand it. The expansion process will create a new subfolder, "tesseract-2.04", inside your Downloads folder.
o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz --this is the English dictionary file.
o Expand it. Copy the CONTENTS of the resulting folder to the tessdata subfolder inside the aforementioned "tesseract-2.04" folder.
o Now open a terminal window (sorry). Navigate to ~/Downloads and from there into the "tesseract-2.04" folder.
o Issue the following commands. Each will take a minute or two to run:
Code:
./configure
make
sudo make install
(Actually, if you're logged in as an Administrator as I recommended, you won't need the "sudo")
o You have now installed your OCR engine. It can OCR uncompressed .tif files (only) very quickly. (However, they must be named "[something].tif". Not .tiff.)
===Performing OCR===
There's a quick way with the terminal, and a cool way with Folder Actions. I'll describe both.
Terminal:
o To create a converted text file of name "someimage.txt" from an uncompressed .tif file named "someimage.tif" [note the .tif extension], go to the terminal, navigate to where the file is, and issue:
Code:
/usr/local/bin/tesseract someimage.tif someimage_text
...It's that easy.
Now, my HP all-in-one's scanner produces a few graphics formats including .PNG, but not uncompressed .TIF. No matter, Apple's built-in Folder Action which converts images to .TIF outputs uncompressed .TIF files when fed a .PNG file from the scanner. Or, you can load your scanned image into Preview and Save As a .TIF format; just be sure to select no compression.
Folder Action OCR:
I made a sweet Folder Action script (my very first!) by slightly modifying an existing Folder Action script from Apple which converts file formats by calling a shell script. In this case, I modified the script to call the shell script above.
Just open the Folder Action script editor and paste the following code in. It works beautifully.
You can get fancy and create an output folder for your scanner with a Folder Action selected to convert incoming files to .TIF, and then attach another Folder action to its TIF output folder to convert incoming files to text via OCR! Then your scans are automatically OCR'd as they arrive from the scanner. Sweet!
===Folder Action script to use Tesseract to OCR uncompressed .TIF files:===>
Code:
(*
convert - do ocr via shell script
This Folder Action handler is triggered whenever items are added to the attached folder.
The script convert files from uncompressed .tif format to PDF using the open-source Tesseract OCR engine, http://code.google.com/p/tesseract-ocr/
Copyright © 2002–2007 Apple Inc. [with mods by sjinsjca]
You may incorporate this Apple sample code into your program(s) without
restriction. This Apple sample code has been provided "AS IS" and the
responsibility for its operation is yours. You are not permitted to
redistribute this Apple sample code as "Apple sample code" after having
made changes. If you're going to redistribute the code, we require
that you make it clear that the code was descended from Apple sample
code, but that you've made changes. ===> Duly noted, changes have been made. --sjinsjca
*)
property done_foldername : "OCR Files"
property originals_foldername : "Original Files"
property newimage_extension : ""
-- the list of file types which will be processed
-- eg: {"PICT", "JPEG", "TIFF", "GIFf"}
property type_list : {"TIFF"}
-- since file types are optional in Mac OS X,
-- check the name extension if there is no file type
-- NOTE: do not use periods (.) with the items in the name extensions list
-- eg: {"txt", "text", "jpg", "jpeg"}, NOT: {".txt", ".text", ".jpg", ".jpeg"}
property extension_list : {"tif"}
on adding folder items to this_folder after receiving these_items
tell application "Finder"
if not (exists folder done_foldername of this_folder) then
make new folder at this_folder with properties {name:done_foldername}
end if
set the results_folder to (folder done_foldername of this_folder) as alias
if not (exists folder originals_foldername of this_folder) then
make new folder at this_folder with properties {name:originals_foldername}
set current view of container window of this_folder to list view
end if
set the originals_folder to folder originals_foldername of this_folder
end tell
try
repeat with i from 1 to number of items in these_items
set this_item to item i of these_items
set the item_info to the info for this_item
if (alias of the item_info is false and the file type of the item_info is in the type_list) or (the name extension of the item_info is in the extension_list) then
tell application "Finder"
my resolve_conflicts(this_item, originals_folder, "")
set the new_name to my resolve_conflicts(this_item, results_folder, newimage_extension)
set the source_file to (move this_item to the originals_folder with replacing) as alias
end tell
process_item(source_file, new_name, results_folder)
end if
end repeat
on error error_message number error_number
if the error_number is not -128 then
tell application "Finder"
activate
display dialog error_message buttons {"Cancel"} default button 1 giving up after 120
end tell
end if
end try
end adding folder items to
on resolve_conflicts(this_item, target_folder, new_extension)
tell application "Finder"
set the file_name to the name of this_item
set file_extension to the name extension of this_item
if the file_extension is "" then
set the trimmed_name to the file_name
else
set the trimmed_name to text 1 thru -((length of file_extension) + 2) of the file_name
end if
if the new_extension is "" then
set target_name to file_name
set target_extension to file_extension
else
set target_extension to new_extension
set target_name to (the trimmed_name & "." & target_extension) as string
end if
if (exists document file target_name of target_folder) then
set the name_increment to 1
repeat
set the new_name to (the trimmed_name & "." & (name_increment as string) & "." & target_extension) as string
if not (exists document file new_name of the target_folder) then
-- rename to conflicting file
set the name of document file target_name of the target_folder to the new_name
exit repeat
else
set the name_increment to the name_increment + 1
end if
end repeat
end if
end tell
return the target_name
end resolve_conflicts
-- this sub-routine processes files
on process_item(source_file, new_name, results_folder)
-- NOTE that the variable this_item is a file reference in alias format
-- FILE PROCESSING STATEMENTS GO HERE
try
set the source_item to the quoted form of the POSIX path of the source_file
-- the target path is the destination folder and the new file name
set the target_path to the quoted form of the POSIX path of (((results_folder as string) & new_name) as string)
with timeout of 900 seconds
do shell script ("/usr/local/bin/tesseract " & source_item & " " & target_path)
end timeout
on error error_message
tell application "Finder"
activate
display dialog error_message buttons {"Cancel"} default button 1 giving up after 120
end tell
end try
end process_item
Enjoy!