Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

likegadgets

macrumors 6502a
Original poster
Jul 22, 2008
788
357
US
I have a project that will require OCRing tens of thousands of pages scanned into PDFs. Typically an i9 mac is relatively fast (using Acrobat Pro as the software). I am wondering if I would benefit by setting a couple of dedicated M1 macs (perhaps minis) to OCR, and I wonder if 16 GB RAM vs 8 GB makes a difference for this application . I would welcome suggestions on faster OCR software for PDF's other than Acrobat Pro - even if PC based machine/software combo will achieve faster results.

Thanks in advance
 
  • Like
Reactions: mainemini
Single page or multi page image PDFs? In what languages? You just want a searchable PDF, a text file, or a different format for each document?
Depending on your answers I would run a first trial compiling/install the free Tesseract on the M1 and then script it.
To install Tesseract you can use Homebrew.
 
Single page or multi page image PDFs? In what languages? You just want a searchable PDF, a text file, or a different format for each document?
Depending on your answers I would run a first trial compiling/install the free Tesseract on the M1 and then script it.
To install Tesseract you can use Homebrew.
They are large files, ranging between 1,000 to 5,000 pages each. All in English. On a test a 5,200 pages file took 2 hours and 20 minutes. The PC was specked as:
  • Intel Core i5-9400 CPU @ 2.9GHz
  • 16GB RAM
  • 64-bit architecture
the upload to Google is an interesting concept, but due to confidentiality is not feasible. We need the result in a searchable PDF. Will look into Tesseract.

thank you ALL for your comments.
 
Some easy to use OCR apps that convert scanned PDFs into searchable ones. I don't know how fast these are compared to other solutions but they are very easy to use and have some other very useful features.
 
They are large files, ranging between 1,000 to 5,000 pages each. All in English. On a test a 5,200 pages file took 2 hours and 20 minutes. The PC was specked as:
  • Intel Core i5-9400 CPU @ 2.9GHz
  • 16GB RAM
  • 64-bit architecture
the upload to Google is an interesting concept, but due to confidentiality is not feasible. We need the result in a searchable PDF. Will look into Tesseract.

thank you ALL for your comments.
You can always buy an M1 to run benchmarks and then return it.
 
They are large files, ranging between 1,000 to 5,000 pages each. All in English. On a test a 5,200 pages file took 2 hours and 20 minutes.
Just out of curiosity (sorry for being nosy), what kind of files are they that need to be 1,000 - 5,000 pages each? You had mentioned that you can't use Google Drive due to confidentiality, so I'm guessing something in the medical, legal, or defense fields? I work in education, so this thread might be useful. Much of our documentation is printed out, but I think some our copiers have (or at least CAN have) OCR scanning functionality.
 
Just out of curiosity (sorry for being nosy), what kind of files are they that need to be 1,000 - 5,000 pages each? You had mentioned that you can't use Google Drive due to confidentiality, so I'm guessing something in the medical, legal, or defense fields? I work in education, so this thread might be useful. Much of our documentation is printed out, but I think some our copiers have (or at least CAN have) OCR scanning functionality.
They are medical and legal files. Unsorted. They scan full file boxes that need to be sorted, information extracted, duplicates eliminated, and other stuff. OCR is just the first step, but an important one. Trying to accelerate this step. So testing software and processing power. I rand the same job on an i3 windows machine with 16 GB ram and on a MBP i9 8 Core with 32 Gb ram. Results were very close. I suspect the limiting factor is the Adobe acrobat software OCR function, rather than the processing power
 
Single page or multi page image PDFs? In what languages? You just want a searchable PDF, a text file, or a different format for each document?
Depending on your answers I would run a first trial compiling/install the free Tesseract on the M1 and then script it.
To install Tesseract you can use Homebrew.
Tesseract doesn't seem to directly convert PDF files. I can't find a convenient way to convert a PDF to multiple PNG files. There is pdf2image on brew but it seems not to work. It errors out without finding Ghostscript even though brew installs it as a dependency.
 
Tesseract doesn't seem to directly convert PDF files. I can't find a convenient way to convert a PDF to multiple PNG files. There is pdf2image on brew but it seems not to work. It errors out without finding Ghostscript even though brew installs it as a dependency.
I'm assuming OP has images and has been using Adobe Acrobat Pro to convert them into PDFs while simultaneously performing OCR, rather than having PDFs already, running OCR on those and making new searchable PDFs? Tesseract is perfect for the former, the latter would need a preprocessing step, as you discovered.

I'm pretty sure that pdf2image on brew is a project that has been abandoned like 10? years ago and was Windows-native anyway, so I'm not surprised it's not working for you. There's a pdf2image python library that is newer and should work, though, if you can handle a tiny bit of coding.
 
  • Like
Reactions: jdb8167
@OP, why are you looking into M1 Macs for this at all? Is it because you've read about the good CPU performance? That's mostly just benchmarks right now, very little actual software is optimized or even exists for M1 at this point and the performance is only really strong in single threaded tasks. Anything that can get parallelized or needs significant RAM bandwidth (like OCR) will never be strong on M1, firstly because nobody will optimize the code and secondly because 4 high performance memory starved M1 threads will never be able to compete with up to 128 threads you can get in a threadripper equipped workstation right now, where each thread is at least as strong as M1. The GPU is also not good on M1, nor is it supported anywhere. M1 is a nice low-power laptop SoC that can compete with hungrier mobile Intel CPUs in many tasks, it's not really a competitive workstation processor or anything like that. You'd also need to use an alpha version of tesseract on M1.

Maybe take a look at this artictle. I'm assuming you aren't super techy, considering your posts so far, so Google Cloud Plarform Vision API definitely seems like a really nice choice to me. Considering this is b2b GCP side of Google, not the consumer drive stuff, the confidentiality should be more than fine and you'll also get access to support.
When you send an image to Vision API, we must store that image for a short period of time in order to perform the analysis and return the results to you. For asynchronous offline batch operations, the stored image is typically deleted right after the processing is done, with a failsafe Time to live (TTL) of a few hours. For online (immediate response) operations, the image data is processed in memory and not persisted to disk.
Pricing also seems surprisingly good, first 1000 images per month are free, with $1.5 for every following 1000 images. That'll run you about $7.5 for one of your 5000 page scans, meaning you can do about 100 5000 page scans for the price of a single M1 mini, but way faster and without needing any setting up or messing with OCR settings.
 
  • Like
Reactions: Leon1das
M1 makes sense for the OCR task due to its unbeatable single core power - and last time I checked Adobe Acrobat is still not multi-core optimized for heavy tasks.

On the other hand Acrobat still uses Rosetta 2, and I am unaware of other OCR software that has native AS binary.

If Google really does OCR during upload - I would temporarily buy extra GDrive storage for 1-2 months and did a conversion there...

You shouldnt worry about privacy - unless its illegaly obtained copyrighted material.

P.S. Google does get to know customers via various services they provide - but those data are used to offer you products (sometimes intrusive ads I agree), but they do not/cant sell them. In a 15 years of use I had 0 security breaches with Google.
Bad ad company here is Facebook - with similar business model - but also their way weaker protection of users data, and numerous breaches over the years - despite 2FA used for logins... I wouldnt scan even my cat food barcode if Facebook offered it...
 
They are medical and legal files. Unsorted. They scan full file boxes that need to be sorted, information extracted, duplicates eliminated, and other stuff. OCR is just the first step, but an important one. Trying to accelerate this step. So testing software and processing power. I rand the same job on an i3 windows machine with 16 GB ram and on a MBP i9 8 Core with 32 Gb ram. Results were very close. I suspect the limiting factor is the Adobe acrobat software OCR function, rather than the processing power
Okay, never had to deal with that much myself, hence why I asked.

At my job, we have Konica Minolta copiers. They offer scanning of documents to various locations (Scan-to-Email, Scan-to-SMB, Scan-to-FTP, Scan-to-Box, Scan-to-USB, Scan-to-WebDAV, Scan-to-DPWS, Network TWAIN scan). I checked the online documentation of it, and while my company didn't get it, you can get an add-on that enables OCR into searchable PDF and DOCX, XLSX. We also have this print server software, Papercut MF, that provides OCR service, too. Just something for you to consider. Should save a bit of work.
 
Tesseract doesn't seem to directly convert PDF files. I can't find a convenient way to convert a PDF to multiple PNG files. There is pdf2image on brew but it seems not to work. It errors out without finding Ghostscript even though brew installs it as a dependency.
well...

Step 1:

convert -density 300 in.pdf -depth 1 -strip -background white -alpha off out.tiff

(or if you like to convert specific pages:
convert -density 300 in.pdf[3-6] -depth 1 -strip -background white -alpha off out.tiff
please be aware that the page counter in a pdf starts with “0” for page 1.)

Step 2:

tesseract out.tiff out.pdf

In the case described by the OP this can be easily scripted - I mean I would get rid of the Tiffs generated immediately after running the OCR, create a multi page searchable PDF (or plain text), etc.. I suggest a look into the Tesseract manual.

EDIT: I just checked and thanks to homebrew offering a native M1-version for imagemagick and Tesseract this is... well, surprisingly fast. 😃 I do not have access to PDFs which are composed of thousands of pixel pages though...
 
Last edited:
  • Like
Reactions: jdb8167
We use Intel Mac minis in our medical office to OCR hundreds of pages of PDFs each week. FineReader for Mac uses mulltiple cores on the Intel CPU but wasn't compatible with M1 Macs via Rosetta. They just released an update to fix that issue.

Based on this thread, it seems there are no native M1 OCR solutions yet that don't involve Homebrew.
 
Last edited:
it seems there are no native M1 OCR solutions yet that don't involve Homebrew.
"Yet". macOS 12 provides a new feature to the Vision framework which is exactly designed for recognizing text in document images. It runs on the Neural Engine and is super-fast(It can even do document recognition in real-time on a camera/video feed).
 
why are you looking into M1 Macs for this at all? Is it because you've read about the good CPU performance? That's mostly just benchmarks right now, very little actual software is optimized or even exists for M1 at this point and the performance is only really strong in single threaded tasks.
Passively cooled MBA M1 for example demolishes every gaming laptop with latest AMD/Intel CPUs in PDF export task.
It is not just benchmark, it is very real task for many office related scenarios.
But I'm not expert in OCR and don't know if those tasks also optimized for M1.

Bildschirmfoto 2021-07-15 um 13.10.56.png
 
There's a thing that Acrobat Pro does very well: not just a simple ocr but the "Editable text and images" function (former clearscan in Acrobat Pro 11).

This function produces small sized pdf with very high quality text. No other software does the same job (at least I have never found one).
 
I know the OP's posts were from some time ago, but with Adobe's recent ARM-optimized release of their flagship apps, hopefully an ARM-optimized version of Acrobat is coming soon.
 
  • Like
Reactions: mainemini
I have a project that will require OCRing tens of thousands of pages scanned into PDFs. Typically an i9 mac is relatively fast (using Acrobat Pro as the software). I am wondering if I would benefit by setting a couple of dedicated M1 macs (perhaps minis) to OCR, and I wonder if 16 GB RAM vs 8 GB makes a difference for this application . I would welcome suggestions on faster OCR software for PDF's other than Acrobat Pro - even if PC based machine/software combo will achieve faster results.

Thanks in advance
How did this project work out for you in the end?

I am kind of cringing at the whole conversation in light of your software choice being Acrobat Pro. Last I checked, Adobe still hadn't updated that app to take advantage of multiple cores. I found this out when I upgraded from a dual core to a quad core back in 2018, only to find that my OCR performance in Acrobat Pro didn't change at all.

I soon after switched to a command-line tool called OCRMyPDF, which does utilize all cores, and my OCR speed went up by a factor of around three. It's not actually as efficient as Adobe with single-core performance, but the fact that it enabled the rest of the cores meant it ultimately went a good deal faster.

I just upgraded myself from the quad-core i5 to a 10-core M1 Pro, but haven't had the chance to do a head-to-head performance test yet. When I get the chance, I will run the same file through the program on my M1 Pro and on my six-core i5 Mac Mini. I expect the M1 to be a bit over twice as fast, in OCRmyPDF, but in Acrobat, I expect the M1 to be about the same or possibly even slower, if it has to run through Rosetta. But I'm curious what results you found.
 
How did this project work out for you in the end?

I am kind of cringing at the whole conversation in light of your software choice being Acrobat Pro. Last I checked, Adobe still hadn't updated that app to take advantage of multiple cores. I found this out when I upgraded from a dual core to a quad core back in 2018, only to find that my OCR performance in Acrobat Pro didn't change at all.

I soon after switched to a command-line tool called OCRMyPDF, which does utilize all cores, and my OCR speed went up by a factor of around three. It's not actually as efficient as Adobe with single-core performance, but the fact that it enabled the rest of the cores meant it ultimately went a good deal faster.

I just upgraded myself from the quad-core i5 to a 10-core M1 Pro, but haven't had the chance to do a head-to-head performance test yet. When I get the chance, I will run the same file through the program on my M1 Pro and on my six-core i5 Mac Mini. I expect the M1 to be a bit over twice as fast, in OCRmyPDF, but in Acrobat, I expect the M1 to be about the same or possibly even slower, if it has to run through Rosetta. But I'm curious what results you found.
We ran some tests on M1's and there was no perceptible benefit. We resorted to using multiple Windows/Intel machines. Maybe when there is a recompiled version of Acrobat we will try it out on an M1 Max
 
We ran some tests on M1's and there was no perceptible benefit. We resorted to using multiple Windows/Intel machines. Maybe when there is a recompiled version of Acrobat we will try it out on an M1 Max
 

Attachments

  • 139B44D1-00FD-4E96-8522-F6A059CBA752.jpeg
    139B44D1-00FD-4E96-8522-F6A059CBA752.jpeg
    1.1 MB · Views: 258
  • Like
Reactions: baef47
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.