How to do OCR after HP screwed up its printer software for Snow Leopard

Discussion in 'macOS' started by sjinsjca, Feb 8, 2010.

  1. sjinsjca macrumors 68000

    sjinsjca

    Joined:
    Oct 30, 2008
    #1
    HP's printer software and utilities were replaced with built-in items in OS X Snow Leopard after a few tense days where HP initially declined to support the new operating system. Many of us remember our outrage when some HP spokesdrone actually stated that users should buy new printers if they wanted Snow Leopard support. Those of us who didn't run screaming to Canon or some other brand have made do with a reduced feature-set for our HP printers. In particular, OCR functionality is no longer supported by HP.

    I tried the following, and not only did it work for me, but it was a really sweet solution. At its heart is Tesseract, an open-source OCR engine that originated at HP (how's that for irony) and is currently maintained (and used) by Google. Their project is documented at http://code.google.com/p/tesseract-ocr/

    I found that some of the online instructions for deploying and using Tesseract were a little bit confusing and contradictory. But here's what worked for me. As always: proceed at your own risk; I make no guarantees, and while I have tried to be careful I can't be certain that all the necessary steps are present, or safe, nor can I provide support. With that caveat out of the way:

    o Be logged in to your Mac as an Adminstrator.

    o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz

    o Expand it. The expansion process will create a new subfolder, "tesseract-2.04", inside your Downloads folder.

    o Download http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz --this is the English dictionary file.

    o Expand it. Copy the CONTENTS of the resulting folder to the tessdata subfolder inside the aforementioned "tesseract-2.04" folder.

    o Now open a terminal window (sorry). Navigate to ~/Downloads and from there into the "tesseract-2.04" folder.

    o Issue the following commands. Each will take a minute or two to run:

    Code:
    ./configure
    make
    sudo make install 
    (Actually, if you're logged in as an Administrator as I recommended, you won't need the "sudo")

    o You have now installed your OCR engine. It can OCR uncompressed .tif files (only) very quickly. (However, they must be named "[something].tif". Not .tiff.)


    ===Performing OCR===

    There's a quick way with the terminal, and a cool way with Folder Actions. I'll describe both.

    Terminal:

    o To create a converted text file of name "someimage.txt" from an uncompressed .tif file named "someimage.tif" [note the .tif extension], go to the terminal, navigate to where the file is, and issue:
    Code:
    /usr/local/bin/tesseract someimage.tif someimage_text
    ...It's that easy.

    Now, my HP all-in-one's scanner produces a few graphics formats including .PNG, but not uncompressed .TIF. No matter, Apple's built-in Folder Action which converts images to .TIF outputs uncompressed .TIF files when fed a .PNG file from the scanner. Or, you can load your scanned image into Preview and Save As a .TIF format; just be sure to select no compression.


    Folder Action OCR:

    I made a sweet Folder Action script (my very first!) by slightly modifying an existing Folder Action script from Apple which converts file formats by calling a shell script. In this case, I modified the script to call the shell script above.

    Just open the Folder Action script editor and paste the following code in. It works beautifully.

    You can get fancy and create an output folder for your scanner with a Folder Action selected to convert incoming files to .TIF, and then attach another Folder action to its TIF output folder to convert incoming files to text via OCR! Then your scans are automatically OCR'd as they arrive from the scanner. Sweet!

    ===Folder Action script to use Tesseract to OCR uncompressed .TIF files:===>
    Code:
    (*
    convert - do ocr via shell script
    
    This Folder Action handler is triggered whenever items are added to the attached folder.
    
    The script convert files from uncompressed .tif format to PDF using the open-source Tesseract OCR engine, http://code.google.com/p/tesseract-ocr/
    
    Copyright © 2002–2007 Apple Inc. [with mods by sjinsjca]
    
    You may incorporate this Apple sample code into your program(s) without
    restriction.  This Apple sample code has been provided "AS IS" and the
    responsibility for its operation is yours.  You are not permitted to
    redistribute this Apple sample code as "Apple sample code" after having
    made changes.  If you're going to redistribute the code, we require
    that you make it clear that the code was descended from Apple sample
    code, but that you've made changes.  ===> Duly noted, changes have been made. --sjinsjca
    *)
    
    property done_foldername : "OCR Files"
    property originals_foldername : "Original Files"
    property newimage_extension : ""
    -- the list of file types which will be processed 
    -- eg: {"PICT", "JPEG", "TIFF", "GIFf"} 
    property type_list : {"TIFF"}
    -- since file types are optional in Mac OS X, 
    -- check the name extension if there is no file type 
    -- NOTE: do not use periods (.) with the items in the name extensions list 
    -- eg: {"txt", "text", "jpg", "jpeg"}, NOT: {".txt", ".text", ".jpg", ".jpeg"} 
    property extension_list : {"tif"}
    
    
    on adding folder items to this_folder after receiving these_items
    	tell application "Finder"
    		if not (exists folder done_foldername of this_folder) then
    			make new folder at this_folder with properties {name:done_foldername}
    		end if
    		set the results_folder to (folder done_foldername of this_folder) as alias
    		if not (exists folder originals_foldername of this_folder) then
    			make new folder at this_folder with properties {name:originals_foldername}
    			set current view of container window of this_folder to list view
    		end if
    		set the originals_folder to folder originals_foldername of this_folder
    	end tell
    	try
    		repeat with i from 1 to number of items in these_items
    			set this_item to item i of these_items
    			set the item_info to the info for this_item
    			if (alias of the item_info is false and the file type of the item_info is in the type_list) or (the name extension of the item_info is in the extension_list) then
    				tell application "Finder"
    					my resolve_conflicts(this_item, originals_folder, "")
    					set the new_name to my resolve_conflicts(this_item, results_folder, newimage_extension)
    					set the source_file to (move this_item to the originals_folder with replacing) as alias
    				end tell
    				process_item(source_file, new_name, results_folder)
    			end if
    		end repeat
    	on error error_message number error_number
    		if the error_number is not -128 then
    			tell application "Finder"
    				activate
    				display dialog error_message buttons {"Cancel"} default button 1 giving up after 120
    			end tell
    		end if
    	end try
    end adding folder items to
    
    on resolve_conflicts(this_item, target_folder, new_extension)
    	tell application "Finder"
    		set the file_name to the name of this_item
    		set file_extension to the name extension of this_item
    		if the file_extension is "" then
    			set the trimmed_name to the file_name
    		else
    			set the trimmed_name to text 1 thru -((length of file_extension) + 2) of the file_name
    		end if
    		if the new_extension is "" then
    			set target_name to file_name
    			set target_extension to file_extension
    		else
    			set target_extension to new_extension
    			set target_name to (the trimmed_name & "." & target_extension) as string
    		end if
    		if (exists document file target_name of target_folder) then
    			set the name_increment to 1
    			repeat
    				set the new_name to (the trimmed_name & "." & (name_increment as string) & "." & target_extension) as string
    				if not (exists document file new_name of the target_folder) then
    					-- rename to conflicting file
    					set the name of document file target_name of the target_folder to the new_name
    					exit repeat
    				else
    					set the name_increment to the name_increment + 1
    				end if
    			end repeat
    		end if
    	end tell
    	return the target_name
    end resolve_conflicts
    
    -- this sub-routine processes files 
    on process_item(source_file, new_name, results_folder)
    	-- NOTE that the variable this_item is a file reference in alias format 
    	-- FILE PROCESSING STATEMENTS GO HERE 
    	try
    		set the source_item to the quoted form of the POSIX path of the source_file
    		-- the target path is the destination folder and the new file name
    		set the target_path to the quoted form of the POSIX path of (((results_folder as string) & new_name) as string)
    		with timeout of 900 seconds
    			do shell script ("/usr/local/bin/tesseract " & source_item & " " & target_path)
    		end timeout
    	on error error_message
    		tell application "Finder"
    			activate
    			display dialog error_message buttons {"Cancel"} default button 1 giving up after 120
    		end tell
    	end try
    end process_item
    Enjoy!
     
  2. sjinsjca thread starter macrumors 68000

    sjinsjca

    Joined:
    Oct 30, 2008

Share This Page