Automator searching PDF contents

Discussion in 'Mac Apps and Mac App Store' started by MonM, Sep 11, 2013.

  1. MonM macrumors newbie

    Joined:
    Oct 15, 2007
    #1
    I feel like there has to be some way to accomplish this, I've just run up against a wall in trying to figure it out.

    I have a batch of about 8,000 automatically created PDFs that need to be filtered into separate folders depending on one number variable within the documents.

    The PDFs are in essence form letters that all follow the same format and have a line that will read "Total Acres: 5.00" (without quotes) where that number can vary into the hundreds and always has 2 decimal places. I need to separate the documents into 2 groups, one with total acres >10 and one <10. I can make automator search for a specific number and do exactly what I need without issue with the "Search PDFs" action. But I just cannot find a way to search for a number range like "Total Acres: (>10)".

    Is this even possible to do? Any help would be appreciated.
     
  2. numlock macrumors 68000

    Joined:
    Mar 13, 2006
    #2
    is this a one time thing or something you will be doing regularly?

    dont you just create a 10 instances of the search pdfs automator action and have automator apply a color label to the ones that have less than 10 acres.

    you then follow that up with automator moving the ones with a certain color label to one folder and the ones without a color label to another folder.

    you can even run a final action to remove the color label.
     
  3. MonM thread starter macrumors newbie

    Joined:
    Oct 15, 2007
    #3
    I'll only be doing this a limited number of times, but unfortunately a different instance for each number just isn't feasible. The numbers I've seen so far have ranged from .09 all the way up to 2538.95 with .01 increments all the way. That would be a little over a quarter million instances and I still wouldn't be sure I got all of them if there are higher numbers hidden in there.
     
  4. numlock macrumors 68000

    Joined:
    Mar 13, 2006
    #4
    if the increments are that small then we obviously talking about 1000 unique actions for the less than 10 square acres which is even i wouldnt do.

    then unless you can identify something else that differentiates the files you can use you would have to look at something like an applescript solution
     
  5. Cassady macrumors 6502a

    Cassady

    Joined:
    Jul 7, 2012
    Location:
    Sqornshellous
  6. numlock macrumors 68000

    Joined:
    Mar 13, 2006
    #6
    does hazel know that the text in question is a number can it find and filter all the files with less than 10 in a certain place in the file?


    mechanical turks or something like that could assist you. or perhaps you have someone around you with nothing to do that is willing to sort through those files and do a keyboard combo (applescript) on each file to sort them.

    unless you are really into scripting or programming i think it could almost take you as long to to find a solution as to actually sort it.

    for future reference can you not put some identifier on the pdfs that will assist you with this in the future?
     
  7. Cassady macrumors 6502a

    Cassady

    Joined:
    Jul 7, 2012
    Location:
    Sqornshellous
    #7
    If you set things up right - and the line is consistently in the same place - it can. The use of tokens can get quite complex, but search the forums over at Noodlesoft - some peeps have managed the same scenario. They extract dates from utility bills using tokens, to rename the files.
     
  8. MonM thread starter macrumors newbie

    Joined:
    Oct 15, 2007
    #8
    Hazel looks like it could potentially do it, but apparently it needs 10.7+, this work laptop is still living in the past on 10.6. So no go there either.

    I was afraid it was a little too complicated for Automator, and I doubt learning applescript for this task will save me any time overall. C'est la vie.

    On the plus side I did work out a method of scanning through them in Bridge and applying a label through keyboard shortcuts then sorting based on those labels that is going much quicker than I anticipated. It is still just very tedious, manual, and error prone if I'm not 100% focused. I suppose it will just have to do.

    Thanks!
     
  9. onekerato macrumors regular

    Joined:
    Jun 6, 2011
    #9
    There's a command line utility called pdftotext which dumps the text inside a PDF to a text file, which can then be parsed for the acres number using regex. You'll need a geek with bash/perl scripting knowledge to hack the script together. Any linux geek will do.
     

Share This Page