Which Mac for ocr of large documents ?

moreforus · Jul 16, 2017

Hello everyone,

I have a lot of scanned pdf files (typical size around 200 MB) that I would like to convert to searchable pdf with the OCR part of Devonthink Pro Office. With my old mac pro 1.1 it takes forever.

I am looking either for a macbook pro or an Imac 27 inch. Which one would be best for this task, and how fast it will be compared to my macpro ?

Thanks

PS. Sorry for my english, it is not my mother tongue.

ApolloBoy · Jul 16, 2017

Just looking at benchmarks, either computer will smoke the old Mac Pro. But if you're looking for something with the best performance I'd say get a 27" iMac with the i7, that should make short work of making your PDFs searchable.

jerwin · Jul 16, 2017

A lot of what I do on my mac is OCR. I use Abbyy Finereader Pro, so I'm more or less familiar with how such things work on my machine (2014 iMac 5K, with an i5/m290x/24 GB Ram). Devon think uses the Abbyy Finereader engine, so I figured it would be an interesting look into how it uses my computer's resources. I'm working with a demo copy of DevonThink, so ... I have 150 hours to test things...

Do you have a example document?

I found a document from my stash that is 193 MB, and 384 pages, but it may differ materially from the documents you are interested in.

http://resources.metmuseum.org/reso.../The_Metropolitan_Museum_Journal_v_3_1970.pdf
One thing I've noticed, besides the slowness, is that DevonScribbler uses 4 threads, 96 percent CPU, and about 211 MB--so it's not really using the resources that are available. After it's finished, I'll run it through Finereader Pro, and compare.

kohlson · Jul 16, 2017

There are perhaps several possible bottlenecks to the slow performance you're experiencing. How fast the pdf is read from disk, how fast it is processed by the CPU, and how fast it can be written back to the disk. Other factors include memory contention (not enough memory) and whether the CPU processing is single threaded or multithreaded. You can use Utilities/Activity Monitor to help you understand where some of the bottlenecks are.

Offhand, I would start with the disk as a bottleneck. MP 1,1 is very slow compared to today's SSD Macs.

moreforus · Jul 17, 2017

Thank you all for your answers.
[doublepost=1500292538][/doublepost]I really apreciated your remarks and suggestions
Jerwin, the kind of documents I scan are generally textbook about physics or maths. Like your document, they may contain between 300 to 800 pages. And are scanned between 200 to 300 dpi depending on the quality of the print.
Concerning the speed of ocr, it takes for example 4 hours to convert to searchable pdf a document of 775 pages and 280 MB. The converted file's size is now 410 MB.
During the ocr, devonscribbler used 100 % cpu as can be seen on activity monitor.

Kohlson, on my computer the system is on a ssd with 170 MB/s in writing and 240 MB/s in reading speeds. It seems that only one core is used during the ocr process.

kohlson · Jul 17, 2017

I believe your MP has SATA-II disk interfaces, limited to 3Gbps. Most disk interfaces now are twice that fast. Apple's MBP SSD interfaces are several times faster. Off the top of my head, either 20 or 40 Gbps.

Single-threaded apps are generally limited by a single processor's clock speed. This means that a faster processor will deliver better performance than a slower, multithreaded processor. Other factors such as cache size also have an effect on performance.

How all this comes together to improve application throughput is hard to know from afar. The best way to understand this is to try a sample on similar system.

jerwin · Jul 17, 2017

Could you provide a more representative pdf from which to benchmark? Just so things are comparable.

dollystereo · Jul 18, 2017

OCR is something that a GPU can do very fast, I don't know if your software can use the built-in GPU, but I have seen CUDA code thatt can be really fast, 30x faster than CPU based OCR.

jerwin · Jul 18, 2017

dollystereo said:
I don't know if your software can use the built-in GPU, but I have seen CUDA code thatt can be really fast, 30x faster than CPU based OCR.

I'm pretty sure that Abbyy is holding back performance improvements so that it can sell its high throughput versions at a higher cost.

Search

Search

Which Mac for ocr of large documents ?

moreforus

macrumors newbie

ApolloBoy

macrumors 6502a

jerwin

Suspended

kohlson

macrumors 68020

moreforus

macrumors newbie

Attachments

kohlson

macrumors 68020

jerwin

Suspended

dollystereo

macrumors 6502a

jerwin

Suspended

Our Staff