PDA

View Full Version : Script to name a file based on text inside the document




NikNakPAdyWak
Jun 28, 2011, 03:03 AM
Hi all

I am trying to setup a paperless office process that renames the recently scanned OCR pdf file and names it based on some content (a date) inside the document. I understand that there are so many variables to this and it would be almost impossible to do, but the file that i would like to do is for a Document i receive weekly and the info within it is almost always the same.

The first 5 lines contain information that never changes (its always the location from where it comes from). The 6th line and 7th word in the line is always "period ending: DD/MM/YYY"

I would like the file to be named YYYY/MM/DD - DocumentName

Is this possible?

Thanks you for reading.
Chris

ps. i am also using Hazel in this paperless office process if it helps.



kainjow
Jun 28, 2011, 10:21 PM
Where is "DocumentName" coming from? Is that also within the PDF? For now I'm ignoring that.

Here's a script that extracts the date, and renames the file. I wouldn't try to use "/" in the name - that would cause problems since "/" is the path delimiter, so I replaced it with "-".

require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
puts "Missing path."
exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
exit 1 if not pdfDoc
pdfText = pdfDoc.string.to_s
match = pdfText.match(/period ending:\s+(\d{2,})\/(\d{2,})\/(\d{4})/)
exit 1 if not match
day, month, year = match[1], match[2], match[3]
parentFolder = File.dirname(path)
newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
FileUtils.mv(path, newPath)
exit 0


In Hazel, create a "Run shell script" action and set the Shell to /usr/bin/ruby

You can also call this from Terminal to run it manually. Save it to to the Desktop as pdf_date.rb then from within Terminal you can call it like so:
ruby ~/Desktop/pdf_date.rb <pdf path>
where <pdf path> is the path to your PDF.

If you're going to play around with this you should ensure you're working on copies of your PDFs or at least have backups. This script isn't thoroughly tested.

NikNakPAdyWak
Jun 28, 2011, 10:53 PM
WOW.... thank you for your script. I can see that it is very powerful and way over my head. I am having a problem with it though. Hazel is just returning an script error processing script.

Hazels logs show:
hazelfolderwatch[2244] [Error] Shell script failed: Error processing shell script on file /Users/iMac/Dropbox/FileAway/Rent Invoice.pdf.

Is there a way i can debug it.

Thanks again

kainjow
Jun 29, 2011, 12:51 AM
Do the part I mentioned above about running it from the Terminal and see if that works, or if it gives an error. And might be helpful to post a screenshot of your Hazel config window where the script is.

Also what version of OS X are you running?

NikNakPAdyWak
Jun 29, 2011, 03:01 AM
Hi again
I tried that too. But no luck. Im not a big scripter and hopefully i followed your instructions. In a text edit doc, i copied your script and saved it as plaint text to the desktop. I changed the extension to .rb then opened terminal, copied your command and replaced the <pdf path> with my location of the pdf file. I pressed return and the command did nothing

http://dl.dropbox.com/u/8505987/screen.png

Here is the line from my frequent statement that i am trying to turn into a filename.

http://dl.dropbox.com/u/8505987/screen2.png

I hope this makes it a little clearer and perhaps i have done something wrong.

Thanks again

ps. to answer your earlier question, the document name would be "Rent" and i would be getting Hazel to add that.

jiminaus
Jun 29, 2011, 03:35 AM
I don't know if kainjow is around, so if he doesn't mind, I'll modify his script slightly. (edits in blue)


require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
puts "Missing path."
exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
puts "Failed to open pdf"
exit 1
end
pdfText = pdfDoc.string.to_s
match = pdfText.match(/period\s+ending:\s+(\d{2,})\/(\d{2,})\/(\d{4})/i)
if not match
puts "Didn't find period ending"
exit 1
end
day, month, year = match[1], match[2], match[3]
parentFolder = File.dirname(path)
newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
FileUtils.mv(path, newPath)
exit 0


This will tell us why it's not doing anything. It will also match period ending: regardless of capitalisation.

NikNakPAdyWak
Jun 29, 2011, 03:43 AM
Thanks for picking this up jiminaus. My error message now says:

Didn't find period ending

Cheers

NikNakPAdyWak
Jun 29, 2011, 04:14 AM
oh..and im running Snow Leopard 10.6.7
Thanks

jiminaus
Jun 29, 2011, 05:41 AM
Thanks for picking this up jiminaus. My error message now says:

Didn't find period ending

Cheers

Can you run the following script?


require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
puts "Missing path."
exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
puts "Failed to open pdf"
exit 1
end
pdfText = pdfDoc.string.to_s
File.open('pdftext.txt', 'w') do |file|
file.write(pdfText)
end
exit 0


This will output a file called pdftext.txt containing the text in the PDF. Please strip down the file so it just contains the line(s) related to period ending and then reply by copy and pasting those lines between tags.

We need to work out a regular expression which will match the period ending date.

NikNakPAdyWak
Jun 29, 2011, 06:28 AM
Hi Jiminaus

Im so sorry for the newbie replies. Until today, i had never done anything like this with shellscript...and im not too sure on how to run your script.

I have copied your script over the old one and typed the pdf_date.rb full path out in terminal, hit return and it returned Permission Denied.
So i briefly remeber something about being a super user, so i typed sudo -s then return, entered my password, entered my pdf path again, and it still said permission denied

http://dl.dropbox.com/u/8505987/screen3.png

I had even read to add .command as the extension and tried that (inside a seperate folder with the pdf in it) but nothing happened.

What am i doing wrong?
Thanks again.

jiminaus
Jun 29, 2011, 06:33 AM
Hi Jiminaus

Im so sorry for the newbie replies. Until today, i had never done anything like this with shellscript...and im not too sure on how to run your script.

I have copied your script over the old one and typed the pdf_date.rb full path out in terminal, hit return and it returned Permission Denied.
So i briefly remeber something about being a super user, so i typed sudo -s then return, entered my password, entered my pdf path again, and it still said permission denied

Image (http://dl.dropbox.com/u/8505987/screen3.png)

I had even read to add .command as the extension and tried that (inside a seperate folder with the pdf in it) but nothing happened.

What am i doing wrong?
Thanks again.

Run it just like the other script.


ruby ~/Desktop/pdf_date.rb <pdf path>


BTW The permission denied errors are because the script isn't executable. But making it executable wouldn't have help in this case, because the script isn't setup to be run that way.

NikNakPAdyWak
Jun 29, 2011, 06:55 AM
Hi Jiminaus

I have just sent you a PM.

jiminaus
Jun 29, 2011, 07:14 AM
Hi Jiminaus

I have just sent you a PM.


#!/usr/bin/ruby
require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
puts "Missing path."
exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
puts "Failed to open pdf"
exit 1
end
pdfText = pdfDoc.string.to_s
match = pdfText.match(/Period\s+Ending:\s*(\d+)\/(\d+)\/(\d+)/i)
if not match
puts "Didn't find period ending"
exit 1
end
day, month, year = match[1], match[2], match[3]
year = year.to_i
year += 2000 if year < 50
year += 1900 if year < 100
parentFolder = File.dirname(path)
newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
FileUtils.mv(path, newPath)
puts "Moved #{path} to #{newPath}"
exit 0


The trouble ended up being that the year in the period ending is only 2 digits. This is not good. Did you learn nothing from Y2K?!

Anyway, I've had to make an arbitrary decision about how to map your 2 digits years in to proper 4 digit years. The script will map 51-99 into 1951-1999 and 00-50 into 2000-2050. If that doesn't work for you just let me know.

If the reference is always one word, and the reference is what you want for the document name in the filename, it's easy to extend the script to do that. Just let me know if you want it to do that.

BTW This isn't shell script here. It's ruby. The most elegant of programming languages, IMHO.

kainjow
Jun 29, 2011, 10:21 AM
Nice work jiminaus.

It's ruby. The most elegant of programming languages, IMHO.

Agreed, great language plus with Cocoa it's incredibly powerful.

NikNakPAdyWak
Jun 29, 2011, 05:47 PM
Thank you to you both for helping me on this. I really appreciate.

If i wanted to modify this...say another document has a different word other "Period Ending", and it had 3 characters for month instead of 2 numerical figures,is this script easy to modify to accomodate that?

Does the \s represent a space?

Thank you once again.

jiminaus
Jun 29, 2011, 06:45 PM
Thank you to you both for helping me on this. I really appreciate.

If i wanted to modify this...say another document has a different word other "Period Ending", and it had 3 characters for month instead of 2 numerical figures,is this script easy to modify to accomodate that?

Does the \s represent a space?

Thank you once again.

Yes the \s does represent a space. The * after the \s means zero or more. The + means one or more.

\d represents a digit. [a-z] represents any of the letters from a to z. So you'd need to replace (\d+)\/(\d+)\/(\d+) with (\d+)\/([a-z]+)\/(\d+).

Now without more work, the filenames would still have the 3 character months in the filenames instead of the month number (eg 2011-JUN-30 instead of 2011-06-30).

NikNakPAdyWak
Jun 29, 2011, 11:40 PM
Hi again.

So in the context of this post, would adding the \r in the script below, mean that the file name will be taken from the date that is on the next line below Due Date
match = pdfText.match(/Due\s+Date\s+\r*(\d+)\/([a-z]+)\/(\d+)/i)

Eg. Due Date
01 Jan 2011

On the pdf, there is nothing else on each of the lines that the text occupies.

Thanks very much.

jiminaus
Jun 30, 2011, 02:39 AM
Hi again.

So in the context of this post, would adding the \r in the script below, mean that the file name will be taken from the date that is on the next line below Due Date
match = pdfText.match(/Due\s+Date\s+\r*(\d+)\/([a-z]+)\/(\d+)/i)

Eg. Due Date
01 Jan 2011

On the pdf, there is nothing else on each of the lines that the text occupies.

Thanks very much.

Actually don't use \r, use \s. \s doesn't represent just space, it represent whitespace. Whitespace include space, tab, newline and return.

Unfortunately regular expression won't tell you why they didn't match. So the best thing to do is to build them up piece by piece.

Add the following lines (in blue) to the script temporarily while you're building up the regular expression.


day, month, year = match[1], match[2], match[3]
puts "matched: day=#{day}, month=#{month}, year=#{year}"
exit 0
parentFolder = File.dirname(path)


The take the regular expression right back to the beginning:

match = pdfText.match(/due/i);


Test the script. If the regular expression matches, add another piece and test again. Keep adding and testing until either the you match everything you need to, or the regular expression stops matching. If the regular expression stop matching, then you know that's were the regular expression is failing.

NikNakPAdyWak
Jun 30, 2011, 04:09 AM
Hi Jiminaus

Thanks for your great advice on how to debug this script. I have worked it out and have got the result i was after and worked out were i was going wrong.

I used the \w command instead of a-z and also learned that application that scans the document has a heck of a lot to do with the end result. I was running an applescript to OCR my docs with PDFPen pro. I converted the same failing documents with Adobe Acrobat Pro instead and got the results i was after.

I can now put this to bed. I am really happy with the end result and cant thank you enough.
Take care and thanks again