How to do OCR after HP screwed up its printer software for Snow Leopard

sjinsjca · Feb 8, 2010

HP's printer software and utilities were replaced with built-in items in OS X Snow Leopard after a few tense days where HP initially declined to support the new operating system. Many of us remember our outrage when some HP spokesdrone actually stated that users should buy new printers if they wanted Snow Leopard support. Those of us who didn't run screaming to Canon or some other brand have made do with a reduced feature-set for our HP printers. In particular, OCR functionality is no longer supported by HP.

I tried the following, and not only did it work for me, but it was a really sweet solution. At its heart is Tesseract, an open-source OCR engine that originated at HP (how's that for irony) and is currently maintained (and used) by Google. Their project is documented at http://code.google.com/p/tesseract-ocr/

I found that some of the online instructions for deploying and using Tesseract were a little bit confusing and contradictory. But here's what worked for me. As always: proceed at your own risk; I make no guarantees, and while I have tried to be careful I can't be certain that all the necessary steps are present, or safe, nor can I provide support. With that caveat out of the way:

o Be logged in to your Mac as an Adminstrator.

o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz

o Expand it. The expansion process will create a new subfolder, "tesseract-2.04", inside your Downloads folder.

o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz --this is the English dictionary file.

o Expand it. Copy the CONTENTS of the resulting folder to the tessdata subfolder inside the aforementioned "tesseract-2.04" folder.

o Now open a terminal window (sorry). Navigate to ~/Downloads and from there into the "tesseract-2.04" folder.

o Issue the following commands. Each will take a minute or two to run:

Code:

./configure
make
sudo make install

(Actually, if you're logged in as an Administrator as I recommended, you won't need the "sudo")

o You have now installed your OCR engine. It can OCR uncompressed .tif files (only) very quickly. (However, they must be named "[something].tif". Not .tiff.)

===Performing OCR===

There's a quick way with the terminal, and a cool way with Folder Actions. I'll describe both.

Terminal:

o To create a converted text file of name "someimage.txt" from an uncompressed .tif file named "someimage.tif" [note the .tif extension], go to the terminal, navigate to where the file is, and issue:

Code:

/usr/local/bin/tesseract someimage.tif someimage_text

...It's that easy.

Now, my HP all-in-one's scanner produces a few graphics formats including .PNG, but not uncompressed .TIF. No matter, Apple's built-in Folder Action which converts images to .TIF outputs uncompressed .TIF files when fed a .PNG file from the scanner. Or, you can load your scanned image into Preview and Save As a .TIF format; just be sure to select no compression.

Folder Action OCR:

I made a sweet Folder Action script (my very first!) by slightly modifying an existing Folder Action script from Apple which converts file formats by calling a shell script. In this case, I modified the script to call the shell script above.

Just open the Folder Action script editor and paste the following code in. It works beautifully.

You can get fancy and create an output folder for your scanner with a Folder Action selected to convert incoming files to .TIF, and then attach another Folder action to its TIF output folder to convert incoming files to text via OCR! Then your scans are automatically OCR'd as they arrive from the scanner. Sweet!

===Folder Action script to use Tesseract to OCR uncompressed .TIF files:===>

Code:

(*
convert - do ocr via shell script

This Folder Action handler is triggered whenever items are added to the attached folder.

The script convert files from uncompressed .tif format to PDF using the open-source Tesseract OCR engine, http://code.google.com/p/tesseract-ocr/

Copyright © 2002–2007 Apple Inc. [with mods by sjinsjca]

You may incorporate this Apple sample code into your program(s) without
restriction.  This Apple sample code has been provided "AS IS" and the
responsibility for its operation is yours.  You are not permitted to
redistribute this Apple sample code as "Apple sample code" after having
made changes.  If you're going to redistribute the code, we require
that you make it clear that the code was descended from Apple sample
code, but that you've made changes.  ===> Duly noted, changes have been made. --sjinsjca
*)

property done_foldername : "OCR Files"
property originals_foldername : "Original Files"
property newimage_extension : ""
-- the list of file types which will be processed 
-- eg: {"PICT", "JPEG", "TIFF", "GIFf"} 
property type_list : {"TIFF"}
-- since file types are optional in Mac OS X, 
-- check the name extension if there is no file type 
-- NOTE: do not use periods (.) with the items in the name extensions list 
-- eg: {"txt", "text", "jpg", "jpeg"}, NOT: {".txt", ".text", ".jpg", ".jpeg"} 
property extension_list : {"tif"}


on adding folder items to this_folder after receiving these_items
	tell application "Finder"
		if not (exists folder done_foldername of this_folder) then
			make new folder at this_folder with properties {name:done_foldername}
		end if
		set the results_folder to (folder done_foldername of this_folder) as alias
		if not (exists folder originals_foldername of this_folder) then
			make new folder at this_folder with properties {name:originals_foldername}
			set current view of container window of this_folder to list view
		end if
		set the originals_folder to folder originals_foldername of this_folder
	end tell
	try
		repeat with i from 1 to number of items in these_items
			set this_item to item i of these_items
			set the item_info to the info for this_item
			if (alias of the item_info is false and the file type of the item_info is in the type_list) or (the name extension of the item_info is in the extension_list) then
				tell application "Finder"
					my resolve_conflicts(this_item, originals_folder, "")
					set the new_name to my resolve_conflicts(this_item, results_folder, newimage_extension)
					set the source_file to (move this_item to the originals_folder with replacing) as alias
				end tell
				process_item(source_file, new_name, results_folder)
			end if
		end repeat
	on error error_message number error_number
		if the error_number is not -128 then
			tell application "Finder"
				activate
				display dialog error_message buttons {"Cancel"} default button 1 giving up after 120
			end tell
		end if
	end try
end adding folder items to

on resolve_conflicts(this_item, target_folder, new_extension)
	tell application "Finder"
		set the file_name to the name of this_item
		set file_extension to the name extension of this_item
		if the file_extension is "" then
			set the trimmed_name to the file_name
		else
			set the trimmed_name to text 1 thru -((length of file_extension) + 2) of the file_name
		end if
		if the new_extension is "" then
			set target_name to file_name
			set target_extension to file_extension
		else
			set target_extension to new_extension
			set target_name to (the trimmed_name & "." & target_extension) as string
		end if
		if (exists document file target_name of target_folder) then
			set the name_increment to 1
			repeat
				set the new_name to (the trimmed_name & "." & (name_increment as string) & "." & target_extension) as string
				if not (exists document file new_name of the target_folder) then
					-- rename to conflicting file
					set the name of document file target_name of the target_folder to the new_name
					exit repeat
				else
					set the name_increment to the name_increment + 1
				end if
			end repeat
		end if
	end tell
	return the target_name
end resolve_conflicts

-- this sub-routine processes files 
on process_item(source_file, new_name, results_folder)
	-- NOTE that the variable this_item is a file reference in alias format 
	-- FILE PROCESSING STATEMENTS GO HERE 
	try
		set the source_item to the quoted form of the POSIX path of the source_file
		-- the target path is the destination folder and the new file name
		set the target_path to the quoted form of the POSIX path of (((results_folder as string) & new_name) as string)
		with timeout of 900 seconds
			do shell script ("/usr/local/bin/tesseract " & source_item & " " & target_path)
		end timeout
	on error error_message
		tell application "Finder"
			activate
			display dialog error_message buttons {"Cancel"} default button 1 giving up after 120
		end tell
	end try
end process_item

Enjoy!

sjinsjca · Feb 8, 2010

Annnd: Here's a nice GUI to the Tesseract OCR engine:

http://download.dv8.ro/files/TesseractGUI/

Search

Search

How to do OCR after HP screwed up its printer software for Snow Leopard

sjinsjca

macrumors 68020

sjinsjca

macrumors 68020

Our Staff