macOS Search PDF for specific term as save as variable

vloryan · Sep 30, 2014

Hi guys,

I hope someone can help me. In an Automator workflow, I would like to search a readable PDF for a specific term and save this expression in an Automator variable. The PDF is found using an Automator action.

The term is build with the prefix "VAX", following eight numbers (like 00002285), so I need to find "VAX" plus eight more characters.

Any idea how to solve this using Applescript within Automator.

Thanks so much for your help!

superscape · Sep 30, 2014

What have you tried so far?

vloryan · Sep 30, 2014

Sorry, I have no idea where and how to start this.

chown33 · Sep 30, 2014

Start by experimenting with the 'grep' command on the Terminal.app command line.

First, make a folder and put some example PDF files in it. For example, make the folder TEST on your Desktop. Then do this in Terminal:

Code:

cd ~/Desktop/TEST
grep -E -h -o 'VAX[0-9]{8}' *.pdf

The first command simply sets the working directory to TEST.

The 'grep' command searches for the pattern in every file with the ".pdf" extension. Make sure the .pdf of the test files is lower-case, because the shell by default is case-sensitive.

You can copy and paste the grep command into a Terminal window, rather than typing it in.

The option -e tells grep to use extended patterns, which enables {n}.

The -h option tells grep to NOT output the filename. If you omit this, then each found item is preceded by the filename it was found in.

The option -o tells grep to only output the exact text that matches the pattern. Otherwise grep would output the entire "line" containing the pattern, and since PDF files aren't line-oriented, you'd get a bunch of crap you'd have to remove.

The pattern is quoted so the shell won't try expanding it. The pattern VAX[0-9]{8} means:

VAX means the three letters "VAX" literally.
[0-9] means the digits 0 thru 9.
{8} means the digits are repeated exactly 8 times.

Try this command on several different PDFs. Be sure to make some test PDFs that have near-matching patterns such as VAX123456 and VAX1234567 to make sure that those patterns are NOT found (too few digits).

Also make a PDF with "VAX" followed by 9 or 10 digits, and observe what happens. If you can't accept this, then clearly say so in a reply post, so another pattern can be given.

Here's the man page for grep:
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/grep.1.html

Once you have the output working correctly, post again and provide some example PDFs and a list of exactly what the correct output should be.

Don't try putting the grep command into a workflow, unless you understand exactly how to add shell scripts to workflows. The grep command itself is just one step in a multi-step process, but it's the all-important first step. Take one step at a time.

Also post the Automator workflow you have now, so we don't have to guess about how it finds PDFs.

Finally, it's possible that this grep command won't find anything in your PDFs, because of how PDFs can be produced. The data may not be an actual string "VAX_8_digits", so it's possible that grep won't find any patterns. If that happens, post an example of the failing PDF, so we have something that demonstrates the problem.

vloryan · Oct 1, 2014

Thank you so much for your long answer! Unfortunately nothing happens when I run

Code:

grep -E -h -o 'VAX[0-9]{8}' *.pdf

in Terminal

superscape · Oct 1, 2014

Extracting text from a PDF is not as straight forward as you'd think/hope. A lot depends on how your PDF is made. You may find it impossible. Or you may be lucky.

I'd suggest you do some Googling and hunting around. Try a few things, and feel free to ask for help if you get stuck. However, I'm not going to write a complete solution for you piece by piece. You should at least have a fair crack at it yourself.

That said, since you don't know where to start I'm happy to help there. Firstly, you might want to look at the "Extract PDF Text" Automator action from Preview. If you're reasonably confident with the command line, you might want to look at pdftotext too:

http://www.foolabs.com/xpdf/download.html

Regards
Rob

vloryan · Oct 1, 2014

I thought that it wouldn't be that big problem for an expert as the PDF is searchable for that string. I'll check that out. Thanks for your help!

superscape · Oct 1, 2014

vloryan said:
the PDF is searchable for that string.

..then you may be lucky. Fingers crossed!

vloryan · Oct 1, 2014

Well, doesn't look so

Search

Search

macOS Search PDF for specific term as save as variable

vloryan

macrumors member

superscape

macrumors 6502a

vloryan

macrumors member

chown33

Moderator

vloryan

macrumors member

superscape

macrumors 6502a

vloryan

macrumors member

superscape

macrumors 6502a

vloryan

macrumors member

Our Staff