PDA

View Full Version : File type using Terminal - Bash Script




SRossi
Oct 19, 2009, 01:23 PM
Hey all,

I am currently working on a bash script that will backup and restore a users file from their choosing. But these files may be a text document, a spreadsheet or a picture. Each filetype will be backed up to a different folder.

How I am going to do is ask them for the path to the file that they want, then take this file and find out what type of file it is and then back it up to certain file depending on this validation.

What I do not know how to do is figure out the file type using terminal. I had a look at the find and also file commands manual pages but these don't look like they will help me.

Does anyone have any idea how I could go about doing this?

Thanks in advance,

Stephen



LtRammstein
Oct 19, 2009, 01:57 PM
I think the easiest way for you to do this is to use the path to your advantage. What I mean is when you ask for the path of the file, it'll have the extension with it. All you need to do is look at the extension.

Example:

3.2-bash$ ./backup /Users/Anon/Documents/list.xlsx


From here you can just look at the .xlsx and know that it's an Excel file.

Hope this helps.

lee1210
Oct 19, 2009, 02:07 PM
There is a UNIX program "file" that will give you all sorts of information about a file. Not sure if its output will be sufficient for your uses, but it's worth a try.

-Lee

chown33
Oct 19, 2009, 02:23 PM
I'm puzzled as to why backing up files by type is necessary, or even a good idea. If I backup a folder containing various types of files, I expect to recover them in their original arrangement, not rearranged by file-type.

That said, the mdls command can extract file-type and a host of other metadata, and give it to you in textual form. Read the man page first, but I recommend trying it on different file-types like image, text, audio, etc. to learn how it works. Its metadata key names are the same as the documented keys for Spotlight Metadata Attributes.

Also, you can use mdfind to search for files based on metadata. It's the command-line equivalent to initiating a Spotlight search, or a smart folder.

SRossi
Oct 19, 2009, 06:01 PM
I think the easiest way for you to do this is to use the path to your advantage. What I mean is when you ask for the path of the file, it'll have the extension with it. All you need to do is look at the extension.

Example:

3.2-bash$ ./backup /Users/Anon/Documents/list.xlsx



Yeah I was going to use the file extension seamed the easiest way to determine what the file type was.

There is a UNIX program "file" that will give you all sorts of information about a file. Not sure if its output will be sufficient for your uses, but it's worth a try.

Yeah I had a look at the file command but I didn't know if I could use an if statement on the result of the query?

I'm puzzled as to why backing up files by type is necessary, or even a good idea. If I backup a folder containing various types of files, I expect to recover them in their original arrangement, not rearranged by file-type

I'm not backing up entire directories, I am only backing up certain files. Yeah I know seems a really stupid idea but its my universities idea not mine.

What I was thinking was doing a case statement like:

case $FILE in

*.txt) # Do text document backup
*.xlsx) # Do spreadsheet backup
*.jpg) # Do picture backup
*) # Give error about file type

Do you think this would be a way to go about it? Or would there be an easier way? Like as Lee says using file gives the file type but would I be able to input the output of the query into an if statement?

Thanks so far,

Stephen

lee1210
Oct 19, 2009, 06:20 PM
Here's an example i just whipped up:

for X in `file * | grep text | awk -F: '{print $1}'`
do
echo $X
done


Obviously echoing isn't very interesting, but this should give you each filename matching the grep. You can obviously make this much more advanced. Any plain text file should contain "text" in the output from file. For jpeg's, i get:
JPEG image data, EXIF standard

This means you can look for, say "JPEG image data" instead of .jpg, .JPG, .jpeg, .JPEG, etc.

I just saved an xslx. I got:
"Zip archive data, at least v2.0 to extract"
which is not particularly helpful, so maybe extension is best after all (but OS X is not CP/M or DOS, so things don't need file extensions).

-Lee

chown33
Oct 19, 2009, 06:51 PM
I just saved an xslx. I got:
"Zip archive data, at least v2.0 to extract"
which is not particularly helpful, so maybe extension is best after all (but OS X is not CP/M or DOS, so things don't need file extensions).

A lot of different files are zipped archives. The 'file' command can only look at the data fork of a file, not any metadata or xattrs. The 'mdls' command can look at metadata, so it often has more specific information than 'file' can muster. Try the kMDItemContentType attribute-name in particular.

chown33
Oct 19, 2009, 07:03 PM
I'm not backing up entire directories, I am only backing up certain files. Yeah I know seems a really stupid idea but its my universities idea not mine.


That can be interpreted several different ways, and my advice might change depending on which one it is.

Is this a homework assignment? A term project?

Or is this something your university's IT department intends to use for actual data backups?

Which OS versions does it have to run on?


What I was thinking was doing a case statement like:

case $FILE in

*.txt) # Do text document backup
*.xlsx) # Do spreadsheet backup
*.jpg) # Do picture backup
*) # Give error about file type

Do you think this would be a way to go about it? Or would there be an easier way? Like as Lee says using file gives the file type but would I be able to input the output of the query into an if statement?

What about text files that don't end in ".txt"? What about images that aren't JPEG, or JPEGs that don't end in ".jpg"? And is ".xlsx" really the only spreadsheet suffix? What about Numbers?

I recommend breaking this into two sub-problems:
1. Identify the type of file.
2. Backup the file based on its type.

If you modularize these two sub-problems, then you can change how files are identified without having to change how a file is backed up.

You will need a consistent set of identifiers that represent the classifications of the file, which the backup module then uses to decide how to backup the file. I recommend NOT using suffixes or extensions, but a set of plain keywords would work fine, e.g. "text", "image", "spreadsheet", and "other". The "other" category would not be backed up.

SRossi
Oct 20, 2009, 06:33 AM
That can be interpreted several different ways, and my advice might change depending on which one it is.

Is this a homework assignment? A term project?

Or is this something your university's IT department intends to use for actual data backups?

Which OS versions does it have to run on?



It is a homework assignment, and it has to be run on any type of linux system.


What about text files that don't end in ".txt"? What about images that aren't JPEG, or JPEGs that don't end in ".jpg"? And is ".xlsx" really the only spreadsheet suffix? What about Numbers?

I recommend breaking this into two sub-problems:
1. Identify the type of file.
2. Backup the file based on its type.

If you modularize these two sub-problems, then you can change how files are identified without having to change how a file is backed up.

You will need a consistent set of identifiers that represent the classifications of the file, which the backup module then uses to decide how to backup the file. I recommend NOT using suffixes or extensions, but a set of plain keywords would work fine, e.g. "text", "image", "spreadsheet", and "other". The "other" category would not be backed up.

I had broken the problem down to around the same sub-problems and I thought the same way about going about but how would I set up the key words like "text" ect.?

I know it would be something like (using your mdls command):

mdls $FILE

Which would give me the metadata including the file type, but how would I go about using the kMDItemContentType in a validation?

And lastly Lee thanks for the command and if I cannot get the script to run this way I am going to change my design to incorporate your example.

Thanks so far,

Stephen

pilotError
Oct 20, 2009, 07:38 AM
Yeah I was going to use the file extension seamed the easiest way to determine what the file type was.



Yeah I had a look at the file command but I didn't know if I could use an if statement on the result of the query?



I'm not backing up entire directories, I am only backing up certain files. Yeah I know seems a really stupid idea but its my universities idea not mine.

What I was thinking was doing a case statement like:

case $FILE in

*.txt) # Do text document backup
*.xlsx) # Do spreadsheet backup
*.jpg) # Do picture backup
*) # Give error about file type

Do you think this would be a way to go about it? Or would there be an easier way? Like as Lee says using file gives the file type but would I be able to input the output of the query into an if statement?

Thanks so far,

Stephen

You can just parse the filename for everything after the last "." and get the extension. It makes the case statement a little easier to deal with.


file_type=$(print $input_file | sed -e 's/.*\.//')


You can also use if statements and change the case on the fly. That's the beauty of the shell, lot's of options!

chown33
Oct 20, 2009, 08:56 AM
It is a homework assignment, and it has to be run on any type of linux system.

Mac OS X is not a Linux system.

Linux doesn't have the mdls command.

Linux has the file command, but it's output may not be the same as the output from the Mac OS X file command.

I don't think Linux has Excel, either, so ".xlsx" as the extension identifier for spreadsheets may not work.

You really need to look at the available commands and their output on an actual Linux system. Look for a 'basename' command (which is on Mac OS X, and may be on Linux, because it's a Posix command).

Oh, and the:

mdls $FILE

would have to be:

mdls "$FILE"

in case there are spaces embedded in the pathname.

SRossi
Oct 20, 2009, 09:22 AM
Mac OS X is not a Linux system.

Linux doesn't have the mdls command.

Linux has the file command, but it's output may not be the same as the output from the Mac OS X file command.

I don't think Linux has Excel, either, so ".xlsx" as the extension identifier for spreadsheets may not work.

You really need to look at the available commands and their output on an actual Linux system. Look for a 'basename' command (which is on Mac OS X, and may be on Linux, because it's a Posix command).

Oh, and the:

mdls $FILE

would have to be:

mdls "$FILE"

in case there are spaces embedded in the pathname.

After I wrote that post I emailed a lecturer and asked if it must only run on a Linux machine and he said no as long as it was not a Windows script it would suffice. So sorry it just has to run on a mac, most likely 10.6.

Ah I thought it would have quotes around the filename, but could I use mdls to validate it was a text file, or a spreadsheet etc? Like what I mean is:

1: Ask for text file
2: check if file exists - [ -f "$FILE" ]
3: validate that file is text file - This is the part I am trying to do
4: if validated backup
4.1: give error and ask for another file.
5: Quit back to main screen

So see how would I use the the mdls command to validate it. Like would there be anyway to use the output of the query and use it in an if statement or a case statement?

And piloterror thanks for that command, just really confused the now.

Thanks again,

Stephen

chown33
Oct 20, 2009, 12:22 PM
So see how would I use the the mdls command to validate it. Like would there be anyway to use the output of the query and use it in an if statement or a case statement?

What you seem to be missing is the bash syntax for using the output of a command as a value for other commands, or assigning command output to a shell variable. This is called command substitution. You should read about it on the bash man page, by finding the Command Substitution section:

http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/bash.1.html

Basically, `command` or $(command) runs the command (with parameters if given) then strips off any terminal newlines and puts the output in place of the command-substitution expression. You really need to read this on the man page, rather than me paraphrasing it. Note the first form uses back-quotes (grave accents), not single-quotes/apostrophes.

You also need to experiment with this on a manual command-line before adding it to a script. You can build up command-lines in any text editor, then copy and paste them into Terminal.app. This avoids having to retype lines manually.

For example, to assign the output of an mdls command to a variable:

someVar=`mdls -name kMDItemContentType "$FILE"`


If you echo "$someVar", it will contain the entire string output from the mdls command. You will have to parse this output further if you want only the value, because mdls generally outputs a distinctive header for the file-name, followed by name=value pairs for each metadata value.

You could do some parsing in the backquoted expression:

someVar=`mdls -name kMDItemContentType "$FILE" | grep kMDItemContentType `


You could use 'awk' to parse the output instead of grep, and this would permit a more precise output that only contained the value of kMDItemContentType. You might also use the -raw and -nullMarker options of mdls to produce more precise output. You should play with those first.

So now that the content-type value is in a variable, you can proceed to classify it as image, text, or spreadsheet. There are any number of ways to do that. A switch statement is one way, which is relatively straightforward to understand and code.

Choosing a good classifying approach depends on how flexible the classifying has to be. Since it's a homework assignment, it probably doesn't need a lot of flexibility, so I'd go with 'switch' unless or until it proves to be unwieldy. If this were an actual IT program, then separating the classification data (i.e. the mapping of mdls output values to file-class) from the shell-script might be useful, so it wouldn't be necessary to edit a shell-script every time a classification changes.

You might also look at other useful mdls values, such as kMDItemContentTypeTree. You can see their values by applying mdls to sample classifiable files, and you can understand more about what each value means by reading the Spotlight Metadata docs. You can also try the output from the 'file' command and see if it's any easier to parse or classify.