Search PDF for specific term as save as variable

Discussion in 'Mac Programming' started by vloryan, Sep 30, 2014.

  1. vloryan macrumors member

    Joined:
    Jan 11, 2014
    #1
    Hi guys,

    I hope someone can help me. In an Automator workflow, I would like to search a readable PDF for a specific term and save this expression in an Automator variable. The PDF is found using an Automator action.

    The term is build with the prefix "VAX", following eight numbers (like 00002285), so I need to find "VAX" plus eight more characters.

    Any idea how to solve this using Applescript within Automator.

    Thanks so much for your help!
     
  2. superscape macrumors 6502a

    superscape

    Joined:
    Feb 12, 2008
    Location:
    East Riding of Yorkshire, UK
  3. vloryan thread starter macrumors member

    Joined:
    Jan 11, 2014
    #3
    Sorry, I have no idea where and how to start this.
     
  4. chown33, Sep 30, 2014
    Last edited: Sep 30, 2014

    chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #4
    Start by experimenting with the 'grep' command on the Terminal.app command line.

    First, make a folder and put some example PDF files in it. For example, make the folder TEST on your Desktop. Then do this in Terminal:
    Code:
    cd ~/Desktop/TEST
    grep -E -h -o 'VAX[0-9]{8}' *.pdf
    
    The first command simply sets the working directory to TEST.

    The 'grep' command searches for the pattern in every file with the ".pdf" extension. Make sure the .pdf of the test files is lower-case, because the shell by default is case-sensitive.

    You can copy and paste the grep command into a Terminal window, rather than typing it in.

    The option -e tells grep to use extended patterns, which enables {n}.

    The -h option tells grep to NOT output the filename. If you omit this, then each found item is preceded by the filename it was found in.

    The option -o tells grep to only output the exact text that matches the pattern. Otherwise grep would output the entire "line" containing the pattern, and since PDF files aren't line-oriented, you'd get a bunch of crap you'd have to remove.

    The pattern is quoted so the shell won't try expanding it. The pattern VAX[0-9]{8} means:
    • VAX means the three letters "VAX" literally.
    • [0-9] means the digits 0 thru 9.
    • {8} means the digits are repeated exactly 8 times.
    Try this command on several different PDFs. Be sure to make some test PDFs that have near-matching patterns such as VAX123456 and VAX1234567 to make sure that those patterns are NOT found (too few digits).

    Also make a PDF with "VAX" followed by 9 or 10 digits, and observe what happens. If you can't accept this, then clearly say so in a reply post, so another pattern can be given.

    Here's the man page for grep:
    https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/grep.1.html

    Once you have the output working correctly, post again and provide some example PDFs and a list of exactly what the correct output should be.

    Don't try putting the grep command into a workflow, unless you understand exactly how to add shell scripts to workflows. The grep command itself is just one step in a multi-step process, but it's the all-important first step. Take one step at a time.

    Also post the Automator workflow you have now, so we don't have to guess about how it finds PDFs.


    Finally, it's possible that this grep command won't find anything in your PDFs, because of how PDFs can be produced. The data may not be an actual string "VAX_8_digits", so it's possible that grep won't find any patterns. If that happens, post an example of the failing PDF, so we have something that demonstrates the problem.
     
  5. vloryan thread starter macrumors member

    Joined:
    Jan 11, 2014
    #5
    Thank you so much for your long answer! Unfortunately nothing happens when I run
    Code:
    grep -E -h -o 'VAX[0-9]{8}' *.pdf
    in Terminal :(
     
  6. superscape macrumors 6502a

    superscape

    Joined:
    Feb 12, 2008
    Location:
    East Riding of Yorkshire, UK
    #6
    Extracting text from a PDF is not as straight forward as you'd think/hope. A lot depends on how your PDF is made. You may find it impossible. Or you may be lucky. :)

    I'd suggest you do some Googling and hunting around. Try a few things, and feel free to ask for help if you get stuck. However, I'm not going to write a complete solution for you piece by piece. You should at least have a fair crack at it yourself.

    That said, since you don't know where to start I'm happy to help there. Firstly, you might want to look at the "Extract PDF Text" Automator action from Preview. If you're reasonably confident with the command line, you might want to look at pdftotext too:

    http://www.foolabs.com/xpdf/download.html

    Regards
    Rob
     
  7. vloryan thread starter macrumors member

    Joined:
    Jan 11, 2014
    #7
    I thought that it wouldn't be that big problem for an expert as the PDF is searchable for that string. I'll check that out. Thanks for your help!
     
  8. superscape macrumors 6502a

    superscape

    Joined:
    Feb 12, 2008
    Location:
    East Riding of Yorkshire, UK
    #8
    ..then you may be lucky. Fingers crossed!
     
  9. vloryan thread starter macrumors member

    Joined:
    Jan 11, 2014

Share This Page