PDA

View Full Version : Automator/script to find matching sequential files?




sarahw97
Jun 8, 2011, 12:36 AM
Hi - hoping someone can point me in the right direction on this.. I've been searching all over but cannot seem to find an answer.

Problem: I have tens of thousands of files. Most of them are single page .tif files. They are named like:
Lease_1_2004_Smith.tif
Lease_2_2004_Smith.tif
Lease_3_2004_Smith.tif
Contract_1_2004_Smith.tif
Letter_1_2003_Smith.tif
Letter_2_2003_Smith.tif

GOAL:
1. Select folder
2. Find files that have matching names (except for the sequential number that always appears after the first _ ). So using the example above, it would return:
Lease_1_2004_Smith.tif
Lease_2_2004_Smith.tif
Lease_3_2004_Smith.tif
3. Convert the found set into a single PDF file (this is working well in automator).
4. Save the new PDF file as Lease_2004_Smith.PDF into the same folder.
5. Repeat for the next set of found matching files in this folder.

Any suggestions please? As the PDF conversion is already working well in Automator I'd be happy to try and do the whole process in Automator/applescript but open to ideas or third party applications. Items 2. and 4 are what I am really stuck on..

Thank you in advance for any suggestions.



sero
Jun 8, 2011, 03:42 PM
You'll probably want to script this if you have that many files. You could write a simple script to do this with built-in tools. Try reading some man pages in terminal:
man ls
man awk
man sort
man grep
man sips
man gs

For example, to get a list of names that you will search for (loop through, actually), you could type:
ls /PATH/TO/DIRECTORY|awk -F"_" '{print $1}'|sort -u

good luck

sarahw97
Jun 8, 2011, 05:03 PM
Thanks Sero I will see if I can figure any of that out.. sadly it sounds more than a little out of my depth (I have not ever used terminal so that might give you an idea of where I'm starting from..). But I will see if I can make any sense out of it tonight.. thanks.

subsonix
Jun 8, 2011, 05:31 PM
Not really helping you in this specific case I guess but, you should pick a different naming convention. For example, if you named your above files like this instead:

2004_Smith_Lease_1.tif
2004_Smith_Lease_2.tif
2004_Smith_Lease_3.tif

You could easily just sort by name, even in finder.


But in any case, you need to identify some pattern that is common among all files, and be very specific. After that you can use awk and/or a regex depending on the pattern and sort it like in the above example.
Integrating that with Automator should be easy, just use selected files and folders as input, then run shell script.

sarahw97
Jun 8, 2011, 05:49 PM
Hi, thanks for looking at my problem :)

I suppose I should clarify by explaining that the files are already arranged into folders by name. Finder does display the files in order based on the current naming format. I realize the "Smith" at the end is kind of redundant but I left it on there in case files got into the wrong place by accident or something..

//FILES/SMITH/
Lease_1_2004_Smith.tif
Lease_2_2004_Smith.tif
Lease_3_2004_Smith.tif
Contract_1_2004_Smith.tif
Letter_1_2003_Smith.tif
Letter_2_2003_Smith.tif

So I want to be able to choose the Smith folder, and then identify files where all parts of the name match EXCEPT for the sequential number that always appears after the first _ .
Thanks

andmr
Jun 9, 2011, 07:31 PM
The actions below represent a portion of the kind of Automator workflow you might put together. In this example the PDF files have already been converted from TIFF Image files.

Note that the first 3 actions are the critical combination:

1) Get Specified Finder Items -- Click the "+" sign and select the desired folder (Smith folder, e.g.).

2) Get Folder Contents

3) Filter Finder Items -- Click each popup button and select the desired criteria, e.g., Whose: File Type - Is: PDF File. Now click the "+" sign in the Filter action's window and add criteria: Whose: Name - Begins with - and enter Lease in the text field.

4) Now process the items, e.g., Combine PDF Pages.

Now repeat steps 1 through 3:

5) Get Specified Finder Items -- Select the same (Smith) folder.

6) Get Folder Contents

7) Filter Finder Items -- Whose: File Type - Is: PDF File. Click the "+" sign in the Filter action's window and add: Whose: Name - Begins with - and enter Contract in the text field.

8) Process these items as in step 4.

Continue adding actions 1 through 3 to the workflow, with the next cycle filtering Letter, and then process those items.

You should be able to use this same scheme in order to convert TIFF Images to PDFs, if necessary; rename the items, copy or move the items, etc.

You'll need to experiment until you find the workflow which does the job. Until then, it's probably best to work on copies of the target folders.

Good luck.

subsonix
Jun 9, 2011, 08:58 PM
Frankly I don't think it's possible with Automator alone, you would need a scripting language. The reason is you need to group files based on their name (minus _1_ and _2_ and so on) so you need some kind of data structure to keep track of them. Then you would need to convert for each group. Most of the things in Automator is linear only, the loop action seems way to limited.

Here's a pyton script that does the grouping. I tried using os.system() with sips but it only does one file at a time.


#!/usr/bin/python -tt

import os, sys, re, collections

def find_replace(pat, replace, text):
if re.search(pat, text):
return re.sub(pat, replace, text)

def main():

collection = collections.defaultdict(list)

# group files in collection (file name - _number_ is the key)
for file in sys.argv:
found = find_replace("_\d{1}_", "_", file)
if found:
collection[found].append(file)

# print result to stdout
for group in collection:
print "%s" % "\nOutput file name: ", re.sub("(.tif|.tiff)", ".pdf", group)
collection[group].sort()
print collection[group]



if __name__ == '__main__':
main()


run it with: ./script_name.py $(ls /path/to/files)

andmr
Jun 9, 2011, 09:29 PM
Frankly I don't think it's possible with Automator alone...

I've tested it using Automator. It works.

subsonix
Jun 9, 2011, 09:31 PM
Great, post it here then!

sarahw97
Jun 10, 2011, 01:56 AM
Hi, thanks both for your suggestions. Subsonix, I think you are right I spent quite a bit of time messing around with the different Automator filters but I don't think it can do what I need.. unless i want to spend the next 3 days inputting every possible filename (there are many different possible names, it's kind of defeating the purpose of a script if I have to do that.. be quicker to just drag the files in).. but.. not to be defeated yet .. I am trying your suggestion re the Python script.

Stumbling block is I'm having difficulty running the script.. trying to figure out how to run it, I am getting endless cannot find file' and 'no such directory' errors in terminal. Is there a Python-for-morons way to run the script via an application? Or can someone point me to a good reference for python.. I have read quite a bit from various sources but some of seems to be outdated/conflicting..

MasConejos
Jun 10, 2011, 09:31 AM
I had hoped to have this done sooner, but work has kept me pretty busy. I have written a ruby script for you and tested it on a small batch of dummy files. It will go recursively through all of your files and group them using a regular repression, like the python script above. It will then sort them numerically (slightly different from alphabetically) and write out the file collections to separate files.

For example
Lease_1_2004_Smith.tif...Lease_13_2004_Smith.tif will be written to an output file called 'Lease_2004_Smith.txt', who's contents are the full path of all the Lease_#_2004_Smith files. You should be able to open each output file with Automator, get the file names, and merge to a pdf.

There are a couple of configuration lines at the top of the file, where you can point the script to the top level folder containing all your files and the folder where you want the output files written.

You can edit the script with any text editor, then double click it to run it (or if double clicking doesn't work, just launch it from the console: navigate to its directory and then type './sequential.rb')

If you have any questions, let me know

Script link (http://roguepenguin.net/tif_sorting/sequential.rb) and test files (http://roguepenguin.net/tif_sorting/) here

subsonix
Jun 10, 2011, 10:30 AM
Stumbling block is I'm having difficulty running the script.. trying to figure out how to run it, I am getting endless cannot find file' and 'no such directory' errors in terminal. Is there a Python-for-morons way to run the script via an application? Or can someone point me to a good reference for python.. I have read quite a bit from various sources but some of seems to be outdated/conflicting..

Have you tried dragging in the path to the folder to get it? It sounds like you either have a name that does not match or an incorrect path. Anyway the reason for using argv here is that is the way a script plugs in to automator, you would chose: select files and folders from Finder, then "run shell script" and use arguments as input method, Automator would then feed the selection in finder into the script.

But to test it, copy paste the code, save it, name it, chmod +x your_chosen_name.py then it should work. You might want to test it with the dragging option to get the path right, so just type it out until after $(ls then drag it in from Finder.

Edit:

When I test the script with the file names you have given here, it produces this output:



Output file name: Letter_2003_Smith.pdf
['Letter_1_2003_Smith.tif', 'Letter_2_2003_Smith.tif']

Output file name: Contract_2004_Smith.pdf
['Contract_1_2004_Smith.tif']

Output file name: Lease_2004_Smith.pdf
['Lease_1_2004_Smith.tif', 'Lease_2_2004_Smith.tif', 'Lease_3_2004_Smith.tif']



These are arguments you could just use for a convert script, or tool that does several .tif files at once. sips is included in os x, but I only managed to convert one .tif at a time. I guess you could do that for each file, then append to a pdf file. But for 10.000 files it's going to be a lot of temp files.

sarahw97
Jun 10, 2011, 07:58 PM
thanks both of you for your input- guess what I'll be doing this weekend?!
Will let you know how it goes. Thank you so much for your help so far.

subsonix
Jun 10, 2011, 08:11 PM
That's great, here's a python video tutorial by Google that specifically address parsing, sorting and regular expressions that you might find useful.

http://www.youtube.com/watch?v=tKTZoB2Vjuk

sarahw97
Jun 11, 2011, 02:42 PM
OK I **think** I am getting really close.. much trial and error but I have the ruby script working.. I now know enough to be dangerous in navigating my way around Terminal :eek:

I am just stuck on how to tell Automator to take the content of a txt file, use it as a list of filenames and get those items as finder items ? hopefully this is simple but maybe more script is needed at this step?

subsonix
Jun 11, 2011, 03:22 PM
To me it sounds like you will need to solve it in the script. You can use the file, as a starting point, open all files in it and convert them, then append them. But if you do this with a script there is no need to save it in a file as an intermediate step really. I have created a working Automator service that I can share if you are interested. I ended up calling the append pdf action from within the script. I have only tested it on 6 files however.

MasConejos
Jun 11, 2011, 07:05 PM
I have uploaded a workflow (http://www.roguepenguin.net/tif_sorting/Sequential.workflow.zip) that takes a list of files, gets the contents and passes it on the 'New PDF From Images' action, and then renames the pdf to the standardized name.

This workflow uses the 'Dispense Items Incrementally (http://automator.us/leopard/downloads/index.html)' action to pass each file through the workflow, which you will need to download.

You can select all of the input text files on the first action of the workflow. The workflow currently saves all files to the desktop. If you want to change that you will need to set the save directory in the 'New PDF From Images' action and then change the path on the shell script command underneath it. Choosing a folder/path without spaces in the names will be easier for you. Also note that this will only loop for 5 minutes (Loop action at the bottom). After a small test run, you will probably want to increase the time from 5 minutes.

subsonix
Jun 11, 2011, 07:51 PM
Ok with no exercise left for the reader I might as well post the Service as well. Just drop this in: /Users/you/Library/Services

Then mark the files you want to apply the service on, right click and select it from the context menu.

(Use at your own risk, don't work on your only copy etc. ) :)

sarahw97
Jun 13, 2011, 01:00 AM
Thanks again to both Subsonix and MasConejos for your extremely awesome help on this. Both methods work and I truly appreciate your help, I didn't know where to start.. I thought this was going to be more straightforward than it turned out to be. I have learned a great deal, so thank you. Mas Conejos your script/automator is easier for me to understand the steps and scripting, I feel that I can work with this more in future and develop more workflows from this, so I really appreciate your help.

Subsonix; one question on your script/service - it is great and is working well on combining the multiple pages into one file etc, but it seems to be creating the new "combined" document as a .TIF file.. can't open it unless I change the file extn to pdf, and then it is perfect. If you are able to tell me which bit of code I would need to adjust in order to specify the .pdf extension, I would be grateful.


# OK all files in group are converted to temporary pdf files
# now let's append them to 1 file with the correct name
writeContext = CGPDFContextCreateWithURL(CFURLCreateFromFileSystemRepresentation(kCFAllocatorDefault, output_filename , len(output_filename), False), None, None)

# create PDFDocuments for all of the files.
docs = map(createPDFDocumentWithPath, temporary_files_list)

# find the maximum number of pages.
maxPages = 0
for doc in docs:
if CGPDFDocumentGetNumberOfPages(doc) > maxPages:
maxPages = CGPDFDocumentGetNumberOfPages(doc)

append(writeContext, docs, maxPages)

CGPDFContextClose(writeContext)
del writeContext
#CGContextRelease(writeContext)

subsonix
Jun 13, 2011, 01:39 AM
Subsonix; one question on your script/service - it is great and is working well on combining the multiple pages into one file etc, but it seems to be creating the new "combined" document as a .TIF file.. can't open it unless I change the file extn to pdf, and then it is perfect. If you are able to tell me which bit of code I would need to adjust in order to specify the .pdf extension, I would be grateful.


The problem is this line:


output_filename = re.sub("(.tif$|.tiff$)", ".pdf", group)


It will look for any name ending with either ".tif" or ".tiff", adding a case for ".TIF" and ".TIFF" solves this, the line should look like this:

output_filename = re.sub("(.tif$|.tiff$|.TIF$|.TIFF$)", ".pdf", group)

re.sub looks for lines ending with either (.tif$|.tiff$|.TIF$|.TIFF$) and if found, replaces it with ".pdf".

sarahw97
Jun 13, 2011, 08:01 PM
thanks again; have now processed all 26,832 of these horrible .tif files into PDF's.
Now all I have to do is read the documents.. Thanks again for your help, greatly appreciated.