Script to name a file based on text inside the document

Discussion in 'Mac Programming' started by NikNakPAdyWak, Jun 28, 2011.

  1. macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #1
    Hi all

    I am trying to setup a paperless office process that renames the recently scanned OCR pdf file and names it based on some content (a date) inside the document. I understand that there are so many variables to this and it would be almost impossible to do, but the file that i would like to do is for a Document i receive weekly and the info within it is almost always the same.

    The first 5 lines contain information that never changes (its always the location from where it comes from). The 6th line and 7th word in the line is always "period ending: DD/MM/YYY"

    I would like the file to be named YYYY/MM/DD - DocumentName

    Is this possible?

    Thanks you for reading.
    Chris

    ps. i am also using Hazel in this paperless office process if it helps.
     
  2. Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #2
    Where is "DocumentName" coming from? Is that also within the PDF? For now I'm ignoring that.

    Here's a script that extracts the date, and renames the file. I wouldn't try to use "/" in the name - that would cause problems since "/" is the path delimiter, so I replaced it with "-".

    Code:
    [color=#aa0d91]require[/color] [color=#1c00cf]'osx/cocoa'[/color]
    [color=#aa0d91]require[/color] [color=#1c00cf]'fileutils'[/color]
    [color=#aa0d91]include[/color] OSX
    OSX.require_framework([color=#1c00cf]'PDFKit'[/color])
    path = ARGV[0]
    [color=#aa0d91]if[/color] [color=#aa0d91]not[/color] path
       puts [color=#c41a16]"Missing path."[/color]
       exit [color=#1c00cf]1[/color]
    [color=#aa0d91]end[/color]
    pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
    exit [color=#1c00cf]1[/color] [color=#aa0d91]if[/color] [color=#aa0d91]not[/color] pdfDoc
    pdfText = pdfDoc.string.to_s
    match = pdfText.match(/period ending:\s+(\d{[color=#1c00cf]2[/color],})\/(\d{[color=#1c00cf]2[/color],})\/(\d{[color=#1c00cf]4[/color]})/)
    exit [color=#1c00cf]1[/color] [color=#aa0d91]if[/color] [color=#aa0d91]not[/color] match
    day, month, year = match[1], match[2], match[3]
    parentFolder = File.dirname(path)
    newPath = parentFolder + [color=#c41a16]"/#{year}-#{month}-#{day}.pdf"[/color]
    FileUtils.mv(path, newPath)
    exit [color=#1c00cf]0[/color]
    
    In Hazel, create a "Run shell script" action and set the Shell to /usr/bin/ruby

    You can also call this from Terminal to run it manually. Save it to to the Desktop as pdf_date.rb then from within Terminal you can call it like so:
    Code:
    ruby ~/Desktop/pdf_date.rb <pdf path>
    where <pdf path> is the path to your PDF.

    If you're going to play around with this you should ensure you're working on copies of your PDFs or at least have backups. This script isn't thoroughly tested.
     
  3. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #3
    WOW.... thank you for your script. I can see that it is very powerful and way over my head. I am having a problem with it though. Hazel is just returning an script error processing script.

    Hazels logs show:
    hazelfolderwatch[2244] [Error] Shell script failed: Error processing shell script on file /Users/iMac/Dropbox/FileAway/Rent Invoice.pdf.

    Is there a way i can debug it.

    Thanks again
     
  4. kainjow, Jun 28, 2011
    Last edited: Jun 28, 2011

    Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #4
    Do the part I mentioned above about running it from the Terminal and see if that works, or if it gives an error. And might be helpful to post a screenshot of your Hazel config window where the script is.

    Also what version of OS X are you running?
     
  5. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #5
    Hi again
    I tried that too. But no luck. Im not a big scripter and hopefully i followed your instructions. In a text edit doc, i copied your script and saved it as plaint text to the desktop. I changed the extension to .rb then opened terminal, copied your command and replaced the <pdf path> with my location of the pdf file. I pressed return and the command did nothing

    [​IMG]

    Here is the line from my frequent statement that i am trying to turn into a filename.

    [​IMG]

    I hope this makes it a little clearer and perhaps i have done something wrong.

    Thanks again

    ps. to answer your earlier question, the document name would be "Rent" and i would be getting Hazel to add that.
     
  6. macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #6
    I don't know if kainjow is around, so if he doesn't mind, I'll modify his script slightly. (edits in blue)

    Code:
    require 'osx/cocoa'
    require 'fileutils'
    include OSX
    OSX.require_framework('PDFKit')
    path = ARGV[0]
    if not path
       puts "Missing path."
       exit 1
    end
    pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
    [COLOR=BLUE]if not pdfDoc
       puts "Failed to open pdf"
       exit 1
    end[/COLOR]
    pdfText = pdfDoc.string.to_s
    match = pdfText.match(/period[COLOR=blue]\s+[/COLOR]ending:\s+(\d{2,})\/(\d{2,})\/(\d{4})/[COLOR=blue]i[/COLOR])
    [COLOR=BLUE]if not match
       puts "Didn't find period ending"
       exit 1
    end[/COLOR]
    day, month, year = match[1], match[2], match[3]
    parentFolder = File.dirname(path)
    newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
    FileUtils.mv(path, newPath)
    exit 0
    
    This will tell us why it's not doing anything. It will also match period ending: regardless of capitalisation.
     
  7. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #7
    Thanks for picking this up jiminaus. My error message now says:

    Didn't find period ending

    Cheers
     
  8. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #8
    oh..and im running Snow Leopard 10.6.7
    Thanks
     
  9. macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #9
    Can you run the following script?

    Code:
    require 'osx/cocoa'
    require 'fileutils'
    include OSX
    OSX.require_framework('PDFKit')
    path = ARGV[0]
    if not path
       puts "Missing path."
       exit 1
    end
    pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
    if not pdfDoc
       puts "Failed to open pdf"
       exit 1
    end
    pdfText = pdfDoc.string.to_s
    File.open('pdftext.txt', 'w') do |file|
       file.write(pdfText)
    end
    exit 0
    
    This will output a file called pdftext.txt containing the text in the PDF. Please strip down the file so it just contains the line(s) related to period ending and then reply by copy and pasting those lines between
    Code:
     
    tags.

    We need to work out a regular expression which will match the period ending date.
     
  10. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #10
    Hi Jiminaus

    Im so sorry for the newbie replies. Until today, i had never done anything like this with shellscript...and im not too sure on how to run your script.

    I have copied your script over the old one and typed the pdf_date.rb full path out in terminal, hit return and it returned Permission Denied.
    So i briefly remeber something about being a super user, so i typed sudo -s then return, entered my password, entered my pdf path again, and it still said permission denied

    [​IMG]

    I had even read to add .command as the extension and tried that (inside a seperate folder with the pdf in it) but nothing happened.

    What am i doing wrong?
    Thanks again.
     
  11. macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #11
    Run it just like the other script.

    Code:
    ruby ~/Desktop/pdf_date.rb <pdf path>
    
    BTW The permission denied errors are because the script isn't executable. But making it executable wouldn't have help in this case, because the script isn't setup to be run that way.
     
  12. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
  13. macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #13
    Code:
    #!/usr/bin/ruby
    require 'osx/cocoa'
    require 'fileutils'
    include OSX
    OSX.require_framework('PDFKit')
    path = ARGV[0]
    if not path
       puts "Missing path."
       exit 1
    end
    pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
    if not pdfDoc
       puts "Failed to open pdf"
       exit 1
    end
    pdfText = pdfDoc.string.to_s
    match = pdfText.match(/Period\s+Ending:\s*(\d+)\/(\d+)\/(\d+)/i)
    if not match
       puts "Didn't find period ending"
       exit 1
    end
    day, month, year = match[1], match[2], match[3]
    year = year.to_i
    year += 2000 if year < 50
    year += 1900 if year < 100
    parentFolder = File.dirname(path)
    newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
    FileUtils.mv(path, newPath)
    puts "Moved #{path} to #{newPath}"
    exit 0
    
    The trouble ended up being that the year in the period ending is only 2 digits. This is not good. Did you learn nothing from Y2K?!

    Anyway, I've had to make an arbitrary decision about how to map your 2 digits years in to proper 4 digit years. The script will map 51-99 into 1951-1999 and 00-50 into 2000-2050. If that doesn't work for you just let me know.

    If the reference is always one word, and the reference is what you want for the document name in the filename, it's easy to extend the script to do that. Just let me know if you want it to do that.

    BTW This isn't shell script here. It's ruby. The most elegant of programming languages, IMHO.
     
  14. Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #14
    Nice work jiminaus.

    Agreed, great language plus with Cocoa it's incredibly powerful.
     
  15. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #15
    Thank you to you both for helping me on this. I really appreciate.

    If i wanted to modify this...say another document has a different word other "Period Ending", and it had 3 characters for month instead of 2 numerical figures,is this script easy to modify to accomodate that?

    Does the \s represent a space?

    Thank you once again.
     
  16. macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #16
    Yes the \s does represent a space. The * after the \s means zero or more. The + means one or more.

    \d represents a digit. [a-z] represents any of the letters from a to z. So you'd need to replace (\d+)\/(\d+)\/(\d+) with (\d+)\/([a-z]+)\/(\d+).

    Now without more work, the filenames would still have the 3 character months in the filenames instead of the month number (eg 2011-JUN-30 instead of 2011-06-30).
     
  17. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #17
    Hi again.

    So in the context of this post, would adding the \r in the script below, mean that the file name will be taken from the date that is on the next line below Due Date
    Code:
    match = pdfText.match(/Due\s+Date\s+\r*(\d+)\/([a-z]+)\/(\d+)/i)
    Eg. Due Date
    01 Jan 2011

    On the pdf, there is nothing else on each of the lines that the text occupies.

    Thanks very much.
     
  18. macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #18
    Actually don't use \r, use \s. \s doesn't represent just space, it represent whitespace. Whitespace include space, tab, newline and return.

    Unfortunately regular expression won't tell you why they didn't match. So the best thing to do is to build them up piece by piece.

    Add the following lines (in blue) to the script temporarily while you're building up the regular expression.

    Code:
    day, month, year = match[1], match[2], match[3]
    [COLOR=blue]puts "matched: day=#{day}, month=#{month}, year=#{year}"
    exit 0[/COLOR]
    parentFolder = File.dirname(path)
    
    The take the regular expression right back to the beginning:
    Code:
    match = pdfText.match(/due/i);
    
    Test the script. If the regular expression matches, add another piece and test again. Keep adding and testing until either the you match everything you need to, or the regular expression stops matching. If the regular expression stop matching, then you know that's were the regular expression is failing.
     
  19. thread starter macrumors newbie

    Joined:
    Jun 28, 2011
    Location:
    Sydney
    #19
    Hi Jiminaus

    Thanks for your great advice on how to debug this script. I have worked it out and have got the result i was after and worked out were i was going wrong.

    I used the \w command instead of a-z and also learned that application that scans the document has a heck of a lot to do with the end result. I was running an applescript to OCR my docs with PDFPen pro. I converted the same failing documents with Adobe Acrobat Pro instead and got the results i was after.

    I can now put this to bed. I am really happy with the end result and cant thank you enough.
    Take care and thanks again
     

Share This Page