PDA

View Full Version : Using mdfind to accelerate rsync




haravikk
Apr 20, 2013, 05:43 AM
Okay so, there are lots of people out there that will swear by rsync as a replacement to Time Machine for a variety of reasons, but I'm not really one of them. However, my main gripe with Time Machine is that I want to backup to a NAS, then sync the backup to a cloud backup service (probably CrashPlan), in which case the sparsebundle that Time Machine creates is useless, as restoring a single file from the online backup would mean downloading a whole sparsebundle!

So I'm looking at using rsync, but the first thing that anyone will likely notice when making the switch is that rsync is actually really slow; it can take a long time for it to find a changed file so that it can actually start copying data. For smaller use cases this isn't an issue, but if you're backing up a lot (in my case around a million files coming to about ~3tb) then the impact is massive.

Now, I believe Time Machine achieves a big part of its speed by using Spotlight indexes to quickly find changed files so it can start copying as soon as possible.

As a result I've been tinkering with using mdfind to try to replicate this, so that after an initial backup I can do something like the following:

mdfind -onlyin "foo/bar" "kMDItemContentModificationDate >= \$time.iso($LAST_UPDATE)" > "/tmp/backup_files"
rsync -a --files-from "/tmp/backup_files" "foo/bar" "server:/foo/bar"

A bit simplified, as mdfind sometimes returns strange files (usually ones the current user doesn't have access to) that need to be filtered out first to stop rsync from failing, and usually some extra options on rsync are useful, and you need some logic to provide the LAST_UPDATE time in ISO format, but you get the idea.


The problem with this is that it's no good for finding files that have been removed, as Spotlight doesn't contain any information of use. So currently with my backup scheme I've resorted to running something like the above for most backups, then once a fortnight (or if disk space goes below a certain threshold) I will run a full (slow) rsync pass with the --delete parameter to clear out missing files.

It works, but I'm interested to know if anyone has investigated anything similar, or if anyone has ideas on how I can more efficiently detect files to be deleted from my backup?

My aim is to produce a script that will simulate Time Machine quite accurately using rsync, and use mdfind or similar features to accelerate backups where possible. Currently I just use it to create a mirror copy with no history, with online backup hopefully providing the file history for my use-case, but I want to speed up the whole process as much as possible for it to be a genuine replacement :)


I was thinking of a possible solution along these lines:
Every hour, the machine uses mdfind to lookup changes and rsync to send them to an hourly backup folder on the rsync receiver. Each folder will thus only contain new files.
Periodically hourly backup folders older than 24 hours are rsynced together into a daily backup, which is then synchronised against a previous daily backup using --link-dest to hard-link in file-history. The command would like something like:
rsync -au --link-dest="/foo/bar/yesterday" "/foo/bar/yesterday/" "/foo/bar/today/"
A similar process occurs to flatten older daily backups into monthly ones.
If space is low then the oldest backups are discarded, though the most recent daily backup will remain. Hard-links should prevent files from being lost.
The problem here is that while discarding older backups will remove file-system overhead, it won't actually reduce size as deleted files will actually be preserved.

What I'm thinking I could do is for each hourly backup, a hidden file would be generated with the full file-list for that backup. When a daily backup is created from flattened hourlies, only the latest file list would be retained, and can then be used during the creation of hard-links to prevent the creation of links to files that have been deleted.

It still seems a bit of a hacky way to do things, but creating that list of files from mdfind is very quick and should compress very easily while copying over scp or if I can include it in the rsync operation somehow.



ytk
Apr 20, 2013, 11:16 AM
I'm not sure I understand what exactly you're trying to do here, so just a few thoughts:

Why not just rsync to the server using link-dest? It works fine for remote destinations as well as local ones, as long as the target filesystem supports hard links.

You don't necessarily have to download the entire sparsebundle to restore a file. If you have an older copy of it, you could just rsync back the bands that have changed. Seems hacky and a pain in the butt, though.

If your NAS or backup service or whatever allows you to mount the remote folder, you can simply mount the sparsebundle remotely without downloading the entire thing. OS X will allow you to browse it as if it were a local drive, but should only download the files as you access them. I know this works over AFP, but it might work over SMB or even SFTP (using something like ExpanDrive to mount the remote server locally).

haravikk
Apr 20, 2013, 02:06 PM
Why not just rsync to the server using link-dest? It works fine for remote destinations as well as local ones, as long as the target filesystem supports hard links.
If I'm able to use mdfind to quickly grab the list of updated files since a specific date, then --link-dest won't create a whole snapshot as it'll only have a limited number of files to compare against. It's still useful for accelerating the transfer though, for files that are changed rather than entirely new.


This is why sending the file list across will be necessary as it will provide the full list for a --link-dest call for the receiver to work with. The advantage of sending the list of files and executing on the server itself is that it should eliminate the overhead that encrypted SSH can create when running rsync remotely, and even a huge file-list of million files compresses down to about 5mb; it would be even better if I could use mdfind to generate it though as a regular find command can take around 10-15 minutes. It's still a lot of work to link that many files up (I'm not sure if rsync uses directory hard-links if they're available?), but since the actual transfer of new data should be complete by that point it ought to be relatively quick for a finishing off operation.

There's also the possibility of only handling the linking when hourly backups are to be discarded, by merging them until an end point is found then linking into a daily backup.


You don't necessarily have to download the entire sparsebundle to restore a file.
True, but for a big enough backup it's still a colossal operation a 3tb sparse bundle requires nearly 400,000 8mb bands, and even relatively small updates can cause several bands to change at once, seems to change more if they're encrypted as well. Plus, rsyncing against a new sparse bundle would require you to either overwrite it, or have enough space for the second bundle. --link-dest might help again but if you go back by even just a few weeks then that can still be a lot of data.

The other problem is that a lot of online backup services don't offer things like AFP or rsync. CrashPlan for example has its own client, and I'm not sure it uses anything I could interact with directly instead.


Anyway, part of my reason for using a workaround like this is that I'm hoping to create a shell script that will be suitable for any unix based OS, allowing linux machines to produce Time Machine style backups that function in exactly the same way as those created by my OS X machines. While I know mdfind is OS X only, I'm only going to use it if an appropriate argument is provided, and fall back to more compatible, slower methods, like find as needed.

I just can't figure out why mdfind is only returning a handful of entries when I target a folder though. It seems to require search criteria but I'm providing something that should always match (kMDItemContentModificationDate >= $time.iso(1970-01-01 00:00:00 +0000)) but it doesn't return even close to the amount that find does, which is weird since the folder I'm testing with is a past Time Machine backup (since it should be static for all intents and purposes) and should guarantee that the entire contents are Spotlight indexed (or at least the majority of them).


[edit]
Okay, it seems that mdfind, or Spotlight in general, just doesn't index Time Machine backups in the way I expected, which means it's of no use for speeding up my particular use-case (since I'm rsyncing from a Time Machine backup in order to avoid having to recreate the same exclusion list). Spotlight only seems to index changed files in each individual Time Machine backup; anything that is hard-linked is completely unsearchable if you restrict the search to a particular folder. mdfind should still be fine for regular folders that aren't part of a Time Machine backup, provided they aren't excluded in some way, otherwise I have to use regular find. If I can multi-thread it then performance should still be okay, just not great.