macOS Script to name a file based on text inside the document

NikNakPAdyWak · Jun 28, 2011

Hi all

I am trying to setup a paperless office process that renames the recently scanned OCR pdf file and names it based on some content (a date) inside the document. I understand that there are so many variables to this and it would be almost impossible to do, but the file that i would like to do is for a Document i receive weekly and the info within it is almost always the same.

The first 5 lines contain information that never changes (its always the location from where it comes from). The 6th line and 7th word in the line is always "period ending: DD/MM/YYY"

I would like the file to be named YYYY/MM/DD - DocumentName

Is this possible?

Thanks you for reading.
Chris

ps. i am also using Hazel in this paperless office process if it helps.

kainjow · Jun 28, 2011

Where is "DocumentName" coming from? Is that also within the PDF? For now I'm ignoring that.

Here's a script that extracts the date, and renames the file. I wouldn't try to use "/" in the name - that would cause problems since "/" is the path delimiter, so I replaced it with "-".

Code:

[color=#aa0d91]require[/color] [color=#1c00cf]'osx/cocoa'[/color]
[color=#aa0d91]require[/color] [color=#1c00cf]'fileutils'[/color]
[color=#aa0d91]include[/color] OSX
OSX.require_framework([color=#1c00cf]'PDFKit'[/color])
path = ARGV[0]
[color=#aa0d91]if[/color] [color=#aa0d91]not[/color] path
   puts [color=#c41a16]"Missing path."[/color]
   exit [color=#1c00cf]1[/color]
[color=#aa0d91]end[/color]
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
exit [color=#1c00cf]1[/color] [color=#aa0d91]if[/color] [color=#aa0d91]not[/color] pdfDoc
pdfText = pdfDoc.string.to_s
match = pdfText.match(/period ending:\s+(\d{[color=#1c00cf]2[/color],})\/(\d{[color=#1c00cf]2[/color],})\/(\d{[color=#1c00cf]4[/color]})/)
exit [color=#1c00cf]1[/color] [color=#aa0d91]if[/color] [color=#aa0d91]not[/color] match
day, month, year = match[1], match[2], match[3]
parentFolder = File.dirname(path)
newPath = parentFolder + [color=#c41a16]"/#{year}-#{month}-#{day}.pdf"[/color]
FileUtils.mv(path, newPath)
exit [color=#1c00cf]0[/color]

In Hazel, create a "Run shell script" action and set the Shell to /usr/bin/ruby

You can also call this from Terminal to run it manually. Save it to to the Desktop as pdf_date.rb then from within Terminal you can call it like so:

Code:

ruby ~/Desktop/pdf_date.rb <pdf path>

where <pdf path> is the path to your PDF.

If you're going to play around with this you should ensure you're working on copies of your PDFs or at least have backups. This script isn't thoroughly tested.

NikNakPAdyWak · Jun 28, 2011

WOW.... thank you for your script. I can see that it is very powerful and way over my head. I am having a problem with it though. Hazel is just returning an script error processing script.

Hazels logs show:
hazelfolderwatch[2244] [Error] Shell script failed: Error processing shell script on file /Users/iMac/Dropbox/FileAway/Rent Invoice.pdf.

Is there a way i can debug it.

Thanks again

kainjow · Jun 28, 2011

Do the part I mentioned above about running it from the Terminal and see if that works, or if it gives an error. And might be helpful to post a screenshot of your Hazel config window where the script is.

Also what version of OS X are you running?

NikNakPAdyWak · Jun 29, 2011

Hi again
I tried that too. But no luck. Im not a big scripter and hopefully i followed your instructions. In a text edit doc, i copied your script and saved it as plaint text to the desktop. I changed the extension to .rb then opened terminal, copied your command and replaced the <pdf path> with my location of the pdf file. I pressed return and the command did nothing

Here is the line from my frequent statement that i am trying to turn into a filename.

I hope this makes it a little clearer and perhaps i have done something wrong.

Thanks again

ps. to answer your earlier question, the document name would be "Rent" and i would be getting Hazel to add that.

jiminaus · Jun 29, 2011

I don't know if kainjow is around, so if he doesn't mind, I'll modify his script slightly. (edits in blue)

Code:

require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
   puts "Missing path."
   exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
[COLOR=BLUE]if not pdfDoc
   puts "Failed to open pdf"
   exit 1
end[/COLOR]
pdfText = pdfDoc.string.to_s
match = pdfText.match(/period[COLOR=blue]\s+[/COLOR]ending:\s+(\d{2,})\/(\d{2,})\/(\d{4})/[COLOR=blue]i[/COLOR])
[COLOR=BLUE]if not match
   puts "Didn't find period ending"
   exit 1
end[/COLOR]
day, month, year = match[1], match[2], match[3]
parentFolder = File.dirname(path)
newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
FileUtils.mv(path, newPath)
exit 0

This will tell us why it's not doing anything. It will also match period ending: regardless of capitalisation.

NikNakPAdyWak · Jun 29, 2011

Thanks for picking this up jiminaus. My error message now says:

Didn't find period ending

Cheers

NikNakPAdyWak · Jun 29, 2011

oh..and im running Snow Leopard 10.6.7
Thanks

jiminaus · Jun 29, 2011

NikNakPAdyWak said:
Thanks for picking this up jiminaus. My error message now says:

Didn't find period ending

Cheers

Can you run the following script?

Code:

require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
   puts "Missing path."
   exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
   puts "Failed to open pdf"
   exit 1
end
pdfText = pdfDoc.string.to_s
File.open('pdftext.txt', 'w') do |file|
   file.write(pdfText)
end
exit 0

This will output a file called pdftext.txt containing the text in the PDF. Please strip down the file so it just contains the line(s) related to period ending and then reply by copy and pasting those lines between

Code:

tags.

We need to work out a regular expression which will match the period ending date.

NikNakPAdyWak · Jun 29, 2011

Hi Jiminaus

Im so sorry for the newbie replies. Until today, i had never done anything like this with shellscript...and im not too sure on how to run your script.

I have copied your script over the old one and typed the pdf_date.rb full path out in terminal, hit return and it returned Permission Denied.
So i briefly remeber something about being a super user, so i typed sudo -s then return, entered my password, entered my pdf path again, and it still said permission denied

I had even read to add .command as the extension and tried that (inside a seperate folder with the pdf in it) but nothing happened.

What am i doing wrong?
Thanks again.

jiminaus · Jun 29, 2011

NikNakPAdyWak said:
Hi Jiminaus

Im so sorry for the newbie replies. Until today, i had never done anything like this with shellscript...and im not too sure on how to run your script.

I have copied your script over the old one and typed the pdf_date.rb full path out in terminal, hit return and it returned Permission Denied.
So i briefly remeber something about being a super user, so i typed sudo -s then return, entered my password, entered my pdf path again, and it still said permission denied

Image

I had even read to add .command as the extension and tried that (inside a seperate folder with the pdf in it) but nothing happened.

What am i doing wrong?
Thanks again.

Run it just like the other script.

Code:

ruby ~/Desktop/pdf_date.rb <pdf path>

BTW The permission denied errors are because the script isn't executable. But making it executable wouldn't have help in this case, because the script isn't setup to be run that way.

NikNakPAdyWak · Jun 29, 2011

Hi Jiminaus

I have just sent you a PM.

jiminaus · Jun 29, 2011

NikNakPAdyWak said:
Hi Jiminaus

I have just sent you a PM.

Code:

#!/usr/bin/ruby
require 'osx/cocoa'
require 'fileutils'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
   puts "Missing path."
   exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
   puts "Failed to open pdf"
   exit 1
end
pdfText = pdfDoc.string.to_s
match = pdfText.match(/Period\s+Ending:\s*(\d+)\/(\d+)\/(\d+)/i)
if not match
   puts "Didn't find period ending"
   exit 1
end
day, month, year = match[1], match[2], match[3]
year = year.to_i
year += 2000 if year < 50
year += 1900 if year < 100
parentFolder = File.dirname(path)
newPath = parentFolder + "/#{year}-#{month}-#{day}.pdf"
FileUtils.mv(path, newPath)
puts "Moved #{path} to #{newPath}"
exit 0

The trouble ended up being that the year in the period ending is only 2 digits. This is not good. Did you learn nothing from Y2K?!

Anyway, I've had to make an arbitrary decision about how to map your 2 digits years in to proper 4 digit years. The script will map 51-99 into 1951-1999 and 00-50 into 2000-2050. If that doesn't work for you just let me know.

If the reference is always one word, and the reference is what you want for the document name in the filename, it's easy to extend the script to do that. Just let me know if you want it to do that.

BTW This isn't shell script here. It's ruby. The most elegant of programming languages, IMHO.

kainjow · Jun 29, 2011

Nice work jiminaus.

jiminaus said:
It's ruby. The most elegant of programming languages, IMHO.

Agreed, great language plus with Cocoa it's incredibly powerful.

NikNakPAdyWak · Jun 29, 2011

Thank you to you both for helping me on this. I really appreciate.

If i wanted to modify this...say another document has a different word other "Period Ending", and it had 3 characters for month instead of 2 numerical figures,is this script easy to modify to accomodate that?

Does the \s represent a space?

Thank you once again.

jiminaus · Jun 29, 2011

NikNakPAdyWak said:
Thank you to you both for helping me on this. I really appreciate.

If i wanted to modify this...say another document has a different word other "Period Ending", and it had 3 characters for month instead of 2 numerical figures,is this script easy to modify to accomodate that?

Does the \s represent a space?

Thank you once again.

Yes the \s does represent a space. The * after the \s means zero or more. The + means one or more.

\d represents a digit. [a-z] represents any of the letters from a to z. So you'd need to replace (\d+)\/(\d+)\/(\d+) with (\d+)\/([a-z]+)\/(\d+).

Now without more work, the filenames would still have the 3 character months in the filenames instead of the month number (eg 2011-JUN-30 instead of 2011-06-30).

NikNakPAdyWak · Jun 29, 2011

Hi again.

So in the context of this post, would adding the \r in the script below, mean that the file name will be taken from the date that is on the next line below Due Date

Code:

match = pdfText.match(/Due\s+Date\s+\r*(\d+)\/([a-z]+)\/(\d+)/i)

Eg. Due Date
01 Jan 2011

On the pdf, there is nothing else on each of the lines that the text occupies.

Thanks very much.

jiminaus · Jun 30, 2011

NikNakPAdyWak said:
Hi again.

So in the context of this post, would adding the \r in the script below, mean that the file name will be taken from the date that is on the next line below Due Date

Code:

match = pdfText.match(/Due\s+Date\s+\r*(\d+)\/([a-z]+)\/(\d+)/i)

Eg. Due Date
01 Jan 2011

On the pdf, there is nothing else on each of the lines that the text occupies.

Thanks very much.

Actually don't use \r, use \s. \s doesn't represent just space, it represent whitespace. Whitespace include space, tab, newline and return.

Unfortunately regular expression won't tell you why they didn't match. So the best thing to do is to build them up piece by piece.

Add the following lines (in blue) to the script temporarily while you're building up the regular expression.

Code:

day, month, year = match[1], match[2], match[3]
[COLOR=blue]puts "matched: day=#{day}, month=#{month}, year=#{year}"
exit 0[/COLOR]
parentFolder = File.dirname(path)

The take the regular expression right back to the beginning:

Code:

match = pdfText.match(/due/i);

Test the script. If the regular expression matches, add another piece and test again. Keep adding and testing until either the you match everything you need to, or the regular expression stops matching. If the regular expression stop matching, then you know that's were the regular expression is failing.

NikNakPAdyWak · Jun 30, 2011

Hi Jiminaus

Thanks for your great advice on how to debug this script. I have worked it out and have got the result i was after and worked out were i was going wrong.

I used the \w command instead of a-z and also learned that application that scans the document has a heck of a lot to do with the end result. I was running an applescript to OCR my docs with PDFPen pro. I converted the same failing documents with Adobe Acrobat Pro instead and got the results i was after.

I can now put this to bed. I am really happy with the end result and cant thank you enough.
Take care and thanks again

macOS Script to name a file based on text inside the document

macrumors newbie

Moderator emeritus

macrumors newbie

Moderator emeritus

macrumors newbie

macrumors 65816

macrumors newbie

macrumors newbie

macrumors 65816

macrumors newbie

macrumors 65816

macrumors newbie

macrumors 65816

Moderator emeritus

macrumors newbie

macrumors 65816

macrumors newbie

macrumors 65816

macrumors newbie

Our Staff