Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

vloryan

macrumors member
Original poster
Jan 11, 2014
57
0
Hi guys,

I hope someone can help me. In an Automator workflow, I would like to search a readable PDF for a specific term and save this expression in an Automator variable. The PDF is found using an Automator action.

The term is build with the prefix "VAX", following eight numbers (like 00002285), so I need to find "VAX" plus eight more characters.

Any idea how to solve this using Applescript within Automator.

Thanks so much for your help!
 

chown33

Moderator
Staff member
Aug 9, 2009
9,891
6,872
Beyond the pale
Start by experimenting with the 'grep' command on the Terminal.app command line.

First, make a folder and put some example PDF files in it. For example, make the folder TEST on your Desktop. Then do this in Terminal:
Code:
cd ~/Desktop/TEST
grep -E -h -o 'VAX[0-9]{8}' *.pdf
The first command simply sets the working directory to TEST.

The 'grep' command searches for the pattern in every file with the ".pdf" extension. Make sure the .pdf of the test files is lower-case, because the shell by default is case-sensitive.

You can copy and paste the grep command into a Terminal window, rather than typing it in.

The option -e tells grep to use extended patterns, which enables {n}.

The -h option tells grep to NOT output the filename. If you omit this, then each found item is preceded by the filename it was found in.

The option -o tells grep to only output the exact text that matches the pattern. Otherwise grep would output the entire "line" containing the pattern, and since PDF files aren't line-oriented, you'd get a bunch of crap you'd have to remove.

The pattern is quoted so the shell won't try expanding it. The pattern VAX[0-9]{8} means:
  • VAX means the three letters "VAX" literally.
  • [0-9] means the digits 0 thru 9.
  • {8} means the digits are repeated exactly 8 times.
Try this command on several different PDFs. Be sure to make some test PDFs that have near-matching patterns such as VAX123456 and VAX1234567 to make sure that those patterns are NOT found (too few digits).

Also make a PDF with "VAX" followed by 9 or 10 digits, and observe what happens. If you can't accept this, then clearly say so in a reply post, so another pattern can be given.

Here's the man page for grep:
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/grep.1.html

Once you have the output working correctly, post again and provide some example PDFs and a list of exactly what the correct output should be.

Don't try putting the grep command into a workflow, unless you understand exactly how to add shell scripts to workflows. The grep command itself is just one step in a multi-step process, but it's the all-important first step. Take one step at a time.

Also post the Automator workflow you have now, so we don't have to guess about how it finds PDFs.


Finally, it's possible that this grep command won't find anything in your PDFs, because of how PDFs can be produced. The data may not be an actual string "VAX_8_digits", so it's possible that grep won't find any patterns. If that happens, post an example of the failing PDF, so we have something that demonstrates the problem.
 
Last edited:

vloryan

macrumors member
Original poster
Jan 11, 2014
57
0
Thank you so much for your long answer! Unfortunately nothing happens when I run
Code:
grep -E -h -o 'VAX[0-9]{8}' *.pdf
in Terminal :(
 

superscape

macrumors 6502a
Feb 12, 2008
933
222
East Riding of Yorkshire, UK
Extracting text from a PDF is not as straight forward as you'd think/hope. A lot depends on how your PDF is made. You may find it impossible. Or you may be lucky. :)

I'd suggest you do some Googling and hunting around. Try a few things, and feel free to ask for help if you get stuck. However, I'm not going to write a complete solution for you piece by piece. You should at least have a fair crack at it yourself.

That said, since you don't know where to start I'm happy to help there. Firstly, you might want to look at the "Extract PDF Text" Automator action from Preview. If you're reasonably confident with the command line, you might want to look at pdftotext too:

http://www.foolabs.com/xpdf/download.html

Regards
Rob
 

vloryan

macrumors member
Original poster
Jan 11, 2014
57
0
I thought that it wouldn't be that big problem for an expert as the PDF is searchable for that string. I'll check that out. Thanks for your help!
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.