Okay so, there are lots of people out there that will swear by rsync as a replacement to Time Machine for a variety of reasons, but I'm not really one of them. However, my main gripe with Time Machine is that I want to backup to a NAS, then sync the backup to a cloud backup service (probably CrashPlan), in which case the sparsebundle that Time Machine creates is useless, as restoring a single file from the online backup would mean downloading a whole sparsebundle!
So I'm looking at using rsync, but the first thing that anyone will likely notice when making the switch is that rsync is actually really slow; it can take a long time for it to find a changed file so that it can actually start copying data. For smaller use cases this isn't an issue, but if you're backing up a lot (in my case around a million files coming to about ~3tb) then the impact is massive.
Now, I believe Time Machine achieves a big part of its speed by using Spotlight indexes to quickly find changed files so it can start copying as soon as possible.
As a result I've been tinkering with using mdfind to try to replicate this, so that after an initial backup I can do something like the following:
A bit simplified, as mdfind sometimes returns strange files (usually ones the current user doesn't have access to) that need to be filtered out first to stop rsync from failing, and usually some extra options on rsync are useful, and you need some logic to provide the LAST_UPDATE time in ISO format, but you get the idea.
The problem with this is that it's no good for finding files that have been removed, as Spotlight doesn't contain any information of use. So currently with my backup scheme I've resorted to running something like the above for most backups, then once a fortnight (or if disk space goes below a certain threshold) I will run a full (slow) rsync pass with the --delete parameter to clear out missing files.
It works, but I'm interested to know if anyone has investigated anything similar, or if anyone has ideas on how I can more efficiently detect files to be deleted from my backup?
My aim is to produce a script that will simulate Time Machine quite accurately using rsync, and use mdfind or similar features to accelerate backups where possible. Currently I just use it to create a mirror copy with no history, with online backup hopefully providing the file history for my use-case, but I want to speed up the whole process as much as possible for it to be a genuine replacement
I was thinking of a possible solution along these lines:
What I'm thinking I could do is for each hourly backup, a hidden file would be generated with the full file-list for that backup. When a daily backup is created from flattened hourlies, only the latest file list would be retained, and can then be used during the creation of hard-links to prevent the creation of links to files that have been deleted.
It still seems a bit of a hacky way to do things, but creating that list of files from mdfind is very quick and should compress very easily while copying over scp or if I can include it in the rsync operation somehow.
So I'm looking at using rsync, but the first thing that anyone will likely notice when making the switch is that rsync is actually really slow; it can take a long time for it to find a changed file so that it can actually start copying data. For smaller use cases this isn't an issue, but if you're backing up a lot (in my case around a million files coming to about ~3tb) then the impact is massive.
Now, I believe Time Machine achieves a big part of its speed by using Spotlight indexes to quickly find changed files so it can start copying as soon as possible.
As a result I've been tinkering with using mdfind to try to replicate this, so that after an initial backup I can do something like the following:
Code:
mdfind -onlyin "foo/bar" "kMDItemContentModificationDate >= \$time.iso($LAST_UPDATE)" > "/tmp/backup_files"
rsync -a --files-from "/tmp/backup_files" "foo/bar" "server:/foo/bar"
A bit simplified, as mdfind sometimes returns strange files (usually ones the current user doesn't have access to) that need to be filtered out first to stop rsync from failing, and usually some extra options on rsync are useful, and you need some logic to provide the LAST_UPDATE time in ISO format, but you get the idea.
The problem with this is that it's no good for finding files that have been removed, as Spotlight doesn't contain any information of use. So currently with my backup scheme I've resorted to running something like the above for most backups, then once a fortnight (or if disk space goes below a certain threshold) I will run a full (slow) rsync pass with the --delete parameter to clear out missing files.
It works, but I'm interested to know if anyone has investigated anything similar, or if anyone has ideas on how I can more efficiently detect files to be deleted from my backup?
My aim is to produce a script that will simulate Time Machine quite accurately using rsync, and use mdfind or similar features to accelerate backups where possible. Currently I just use it to create a mirror copy with no history, with online backup hopefully providing the file history for my use-case, but I want to speed up the whole process as much as possible for it to be a genuine replacement
I was thinking of a possible solution along these lines:
- Every hour, the machine uses mdfind to lookup changes and rsync to send them to an hourly backup folder on the rsync receiver. Each folder will thus only contain new files.
- Periodically hourly backup folders older than 24 hours are rsynced together into a daily backup, which is then synchronised against a previous daily backup using --link-dest to hard-link in file-history. The command would like something like:
Code:rsync -au --link-dest="/foo/bar/yesterday" "/foo/bar/yesterday/" "/foo/bar/today/"
- A similar process occurs to flatten older daily backups into monthly ones.
- If space is low then the oldest backups are discarded, though the most recent daily backup will remain. Hard-links should prevent files from being lost.
What I'm thinking I could do is for each hourly backup, a hidden file would be generated with the full file-list for that backup. When a daily backup is created from flattened hourlies, only the latest file list would be retained, and can then be used during the creation of hard-links to prevent the creation of links to files that have been deleted.
It still seems a bit of a hacky way to do things, but creating that list of files from mdfind is very quick and should compress very easily while copying over scp or if I can include it in the rsync operation somehow.
Last edited: