Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Boyd01

Moderator
Original poster
Staff member
Feb 21, 2012
8,903
6,056
New Jersey Pine Barrens
I have an "interesting" problem related to disks with local copies of web app I'm developing. Have been using MAMP on my Mac for development, transferring files to a leased server with sftp as I go for the past 5 years. The entire site consists of almost 142 million individual files (~2.8tb) on an APFS-formatted 4tb Samsung T7 (with a CCC backup on another T7).

So far, so good, everything works really well. But now I want to get away from MAMP and have setup a home linux server. Got another 4tb T7, formatted as exFAT and am cloning one of the APFS disks to it. But this really isn't cutting it; after 46 hours, it's only copied 23 million files (530gb). More concerning, the Finder only reports about 800 gb of free space on the target exFAT disk. I did an earlier test trying to clone a 1tb disk with a subset of my site to a 2tb disk with CCC and it stopped with a "disk full" error after awhile. How do you run out of space cloning a 1tb disk to a 2tb disk? Looks like I'm headed for the same problem with the 4tb disk.

So, what's the best way to copy 142 million files from an APFS disk to an exFAT disk? This CCC clone is so slow, seems like I could sftp the files to the linux box just as quickly. Would rsync be a solution? Could do that with the terminal on the Mac, either directly to the exFAT disk or over the LAN to the linux machine (using gigabit ethernet). Don't really have any experience with that, but it seems like nothing could be any slower than what I'm doing now! :)

[edit]I stopped the CCC copy, since that was clearly not going to work. But I would like to preserve the files I've already copied before switching to a different method. But the way things stand, there's just not going to be enough space on the disk. Here's the output from df in the terminal. /dev/disk3s1 is the source APFS-formatted 4tb T7, /dev/disk4s2 is the destination exFAT-formatted 4tb T7. Any idea what's going on here? I checked, and there are no snapshots

Screen Shot 2025-12-18 at 8.31.42 AM.png
 
Last edited:
It’s the allocation size specified when the disk was formatted. Think of allocation size as blocks. An individual file takes up a number of blocks. If your files are very small and your allocation block size is large, lots of space gets wasted. Disk Utility formats ExFat with an allocation block size of 4KB. This is fine for general use. ExFAT on Mac sucks for small files, especially millions of them because of the default allocation block size. Plus, there is no journaling which protects against data loss.

Why not tar gzip all these files into one or several tar.gz archive files. Text compresses greatly. Then copy your big archive files to disk. The one or few archive files will copy faster than individual small files. Then you can uncompress the archives on the target system.
 
Thanks, was thinking of trying something like that. When I had to setup a new server a couple years ago after the hosting company messed things up, I just uploaded the zip files I already had for each directory and unzipped them, which was much faster copying individual files.

But should I reformat the 4tb exFAT disk and specify a different block size? This is the 4tb disk containing the full site on the remote server

Would I need to use diskutil from terminal, or is there a way to do that with Disk Utility in MacOS? Or should I reformat the disk on the local linux box that I'm setting up (2018 Mini running ubuntu natively). Thanks for your insights!

[edit]If I use the diskutil info disk4s2 in terminal for the exFAT disk on my mac, it says:

Code:
Disk Size:             4.0 TB (4000769376256 Bytes) (exactly 7814002688 512-Byte-Units)
Device Block Size:           512 Bytes
Volume Total Space:        4.0 TB (4000645644288 Bytes) (exactly 7813761024 512-Byte-Units)
Volume Used Space:        3.3 TB (3251197313024 Bytes) (exactly 6349994752 512-Byte-Units) (81.3%)
Volume Free Space:         749.4 GB (749448331264 Bytes) (exactly 1463766272 512-Byte-Units) (18.7%)
Allocation Block Size:       131072 Bytes
 
Last edited:
The Allocation Block Size is 131072 (i.e. 128KiB). This means the smallest file (1 byte) will actually consume 128KiB, which is a single allocation block.

Calculating from the disk capacity values, that gives you a total of 30523448 allocation blocks. That's the max number of files you could possibly have on that disk. It would actually be less, because directory space comes out of that (I think), and any file > 128KiB would need more allocation blocks.

Since 30523448 (about 30 million) is less than 142 million, you'd need a smaller Allocation Block size in order to fit that many files on that disk.

This calculation uses the count of 512-byte blocks:
7814002688 / 2 / 128

First, divide by 2 to get the count of 1KiB blocks. Then divide by 128, which is the Allocation Block size measured in KiB. The result is the count of 128KiB allocation blocks.
 
Thanks! Will take a bit of time to wrap my mind around this. That 142 million files number comes from Carbon Copy Cloner but it seems to agree with the number of inodes on the 4tb disk on my leased server. If I use df -i on the remote server, this is what I get

Code:
Filesystem                     Inodes     IUsed      IFree IUse% Mounted on
/dev/mapper/almalinux-root 2144552960 141599798 2002953162    7% /

This df -h output

Code:
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/almalinux-root  4.0T  2.7T  1.4T  66% /

blkid says this: BLOCK_SIZE="512" TYPE="xfs"

I was hoping to format this somehow that would also be readable on MacOS, but maybe that's not going to work? I tried connecting the 4tb T7 to my ubuntu server and reformatting it using the defaults with their GUI "disks" app and got this with df -i

Code:
Filesystem       Inodes  IUsed    IFree    IUse%
/dev/sdb1      244146176     11 244146165    1%

blkid says this: BLOCK_SIZE="4096" TYPE="ext4"

That suggests it could hold as many as ~244 million files, while the leased server could handle ~2 billion files (I think?). It's not inconceivable that I could exceed 240 million files as my site grows, so would I regret this default formatting later? And it's not quite clear to me whether this will be Mac-readable, of if that even matters (since I also have an APFS version of the disk).

Tempted to just go with this.... what do you think? Should I reformat with a smaller block size?
 
Last edited:
Have you considered using multiple partitions (volumes), and setting up mount-points in such a way that the total number of files is more evenly distributed across multiple dir sub-trees? It could still reside on one physical disk device, it just breaks up the monolithic structure into multiple "sub-lithic" parts.

It may seem like ancient history, but that's the kind of trick we had to do when max i-nodes per volume was 65536 (hooray for 32-bit i-node numbers).

Barring that, it does look like the /dev/sdb1 i-node count would work.
 
Yeah, have considered that in the past, but it really makes things easier to have one big partion. But yeah, I started out with BSD unix (on a VAX 11/750) in 1985 and things have come a long way. We got a Mac that same year and a Macintosh II when it came out, with a 1gb external SCSI hard disk. The director of the Computing Center stopped by and said "One gigabyte??? Before now, I don't think there was a total of one gigabyte on this whole campus!" 🤣
 
If all the data on mag tape was counted, there probably was more than 1GiB. That's only 1000 MiB. Not fast nor random-access, but capacity-wise, tape was fairly good.


Another idea to consider is using read-only mounts if a lot of the data is read-only. One reason for this is it greatly reduces the potential for data corruption if something goes awry, like an unexpected shutdown or an accidental cable disconnect.

It might even be practical to use a compressed read-only disk-image for the read-only data. A friend of mine did that for data originally on several dozen DVDs, and the compressed data was significantly smaller, which helped a lot for both disk space and I/O speed. It took a fair amount of time to compress the data, but it's a one-time cost.
 
  • Like
Reactions: Boyd01
Haha, yeah Ed might have been exaggerating about the gigabyte, but we were clearly only talking about disk drives. Their VAX 11/750 had two washing machine sized hard drives that I'm pretty sure didn't even add up to a full gigabyte. 😀

Anyway, those are all good suggestions and worth considering as I grow. In the past 5 years, thankfully haven't had issues with data corruption (of course, it could happen anytime). It's a cloud server with RAID-10 SSD's and redundant hardware.

But this home machine is something completely different (a 2018 Mini), am just trying to create a similar linux environment and transition away from using MAMP on my Mac. I'm just gonna use this default ubuntu formatting, it should be fine until I grow beyond 4tb (disk is about 3/4 full now using 142m inodes, should be able to fill it and stay well under the 244m limit.

And like everything else, it will be a good "learning experience" for me. Thanks, let's see how this goes...
 
Doesn't the ZFS file system have extra tools for data integrity? It's a bit RAM-intensive, BTRFS might be a better candidate for home use.
 
  • Like
Reactions: Boyd01
I was hoping to format this somehow that would also be readable on MacOS, but maybe that's not going to work?
Paragon sell an ext3/ext4 file system (the default for Linux) implementation for MacOS that lets you exchange discs fairly seamlessly with Linux. Generally useful if you have Linux boxes as well as Mac. Avoids messing around with exFAT…


However, I’d probably just use rsync -s over Ethernet - will take an age but it’s quick to resume a failed transfer.
 
  • Like
Reactions: Boyd01
Thanks for all the good ideas! But I already have .zip files for everything, transferring them over ftp now which is fast but involves a fair amount of work for me. Interesting - I was using sftp and only getting 17MB/sec which seemed very slow over gigabit ethernet. Realized that ftp would be faster, since no encryption is involved. I'm not worried about security since this is all on my LAN and my little server will never be accessible from outside. Wow.... The difference was much more dramatic than I expected, getting 119MB/sec with ftp which completely saturates my LAN. I can't even stream music or watch video on my Apple TV's while a transfer is running! 🤣 Also interesting that unzipping big archives is much faster on Linux than unzip in terminal on MacOS, even though the Linux box is an identical 2018 Mini.

Current thinking is that I don't really need this disk to be Mac-accessibe because I already have two APFS disks with everything. Once I get things setup the way I want, will probably just get another 4tb disk to backup the server (or re-purpose one of the APFS disks I currently have).
 
Last edited:
Doesn't the ZFS file system have extra tools for data integrity? It's a bit RAM-intensive, BTRFS might be a better candidate for home use.
One of the selling points for ZFS was that it would monitor the error data from the disks and move data away from sectors that were showing recoverable read errors. ZFS also supported several forms of RAID as well as a "Time Machine" like file recovery.

I've noticed that copying files to a exFAT formatted disk can take a lot longer then copying to an HFS+ disk.
 
  • Like
Reactions: MacMorrison
I’m a major rsync fan, especially since you can restart copies when interrupted. It’s been a long time since I used it, but my subset of the man pages that I find useful are:

(This is my typical restartable command)

rsync -axHvhPE source destination


a=archive,

x=don't cross filesystem boundaries

H=preserve hard links

v=verbose,

h=human readable,

P=save intermediate Partial files (--partial) and give you Progress (--progress)

E=extended attributes, resource forks

n=dry run

--no-owner

--no-group



A trailing slash on the source changes this behavior to avoid creating an additional directory level at the destination. You can think of a trailing / on a source as meaning "copy the contents of this directory" as opposed to "copy the directory by name", but in both cases the attributes of the containing directory are transferred to the containing directory on the destination. In other words, each of the following commands copies the files in the same way, including their setting of the attributes of /dest/foo:



rsync -avhP /src/foo /dest



copies "foo" itself to /dest directory creating a foo subdirectory



rsync -avhP /src/foo/ /dest/foo



copies "foo/*" to /dest directory implicitely forcing the creation of "foo" at the destination



rsync -avhP /src/foo/ /dest



puts the contents of foo into /dest with no "foo" directory





a=rlptgoD



-r, --recursive recurse into directories

-l, --links copy symlinks as symlinks

-p, --perms preserve permissions

-t, --times preserve modification times

-g, --group preserve group

-o, --owner preserve owner (super-user only)

-D same as --devices --specials

--devices preserve device files (super-user only)

--specials preserve special files
 
Those are some excellent rsync pointers - I know the basics but have never seriously used it, so this should be helpful in the future! I don't know whether this would have worked for me, but I suspect it would have been problematic.

I think the problem would be the overhead of copying more than 140 million files individually over my LAN. For the sake of argument, if it could copy 1000 files per second, that would take about 40 hours, which wouldn't be bad. But my guess is that it couldn't copy anywhere near that many files per second, it might be more like 10 per second which would be 4000 hours. I even question whether it could do as many as 10 per second, considering all the file transfer overhead.

So, I used the technique I mentioned earlier and sftp'ed zipped copies of full directories. These are all large files, the largest was 18 million files and more than 300gb. Thought I was clever, as I posted above, and used regular ftp which was much faster on my LAN than sftp. I had two streams going simultaneously and was also unzipping multiple files at the same time. Things were really flying... until I got a string of I/O errors ending in a corrupt disk. Wasted some time trying to recover but decided I wouldn't trust it even if I succeeded, so I bailed and started over.

Had been doing this with Ubuntu on a 2018 Mini as a server, but when I started over, I moved to Almalinux on a 2012 Mini, just to satisfy myself it wasn't related to some weird T2 chip issue on the 2018 (actually, I think I just pushed it too hard). Anyway, took a few days but now I have an ext4-formatted 4tb T7 with my full site. Now that I've got this all out of APFS and off my primary Mac, will have time to find the best way to back this up and maintain a current copy on the Linux box without affecting the Mac I use for everything else. :)
 
I would still encourage you to try the rsync method. Again, the big win is being able to simply restart the process when it fails for some reason without worry of screwed up files. You can break the process up by selecting some upper level subdirectories rather than THE top level one. No need to zip things up, let the -z option do that during the copy. Experiment with the timing. 15 years ago I found it faster simply rely on the fast network. Today, the speed of CPUs might mean zipping and unzipping on the fly could be faster. Rsync can work on local systems, but if you have SSH profiles set up, you can make the source and destination include user@system:/upper/src /local/dest or vice versa.

Additionally, if you want simple backups with easily traversable directories check out https://rsnapshot.org/ . It was how I used to maintain versions of websites I was responsible for.
 
  • Like
Reactions: Boyd01
Thanks. Will definitely be looking at backup options now that I have everything on a dedicated Linux box with 800MB/sec usb disks instead of my primary Mac and 120MB local network. :)
 
I also a basic rsync line(and thank @AJACs3 for so many invocations) for backup to an external SSD or to a Hetzner Storage box with ssh setup. My needs aren't nearly as stressing as yours though but it is worth a try. rsync also has a powerful but complicated include and exclude syntax and those directives can be placed in files to be called from the command line. I installed rclone but decided to stick with rsync since I felt I would be optimizing the wrong thing in my use.

In the past, I used cpio as recommended by a friend in the storage business:

cd source_dir
find . | cpio -pdmv destination_dir

'-p, --pass-through'Run in copy-pass mode. see 'Copy-pass mode'.

'-d, --make-directories'Create leading directories where needed.

'-m, --preserve-modification-time'Retain previous file modification times when creating files.

'-v, --verbose'List the files processed, or with '-t', give an 'ls -l' style table of contents listing. In a verbose table of contents of a ustar archive,user and group names in the archive that do not exist on the local system are replaced by the names that correspond locally to the numeric UID and GID stored in the archive.
 
  • Like
Reactions: Boyd01
Thanks. Will need to give some consideration to all this. But one factor is the nature of my site/server. I make maps which consist of millions of small .png or .jpg files (map tiles). And the thing is, once I complete a map, it's done and there will be no more changes to those files. My site is basically "read only" for the public. There's no registration, no user accounts, no ads, no e-mail - no ability for anyone to add their own content.

That's why I question the advantage of sifting through hundreds of millions of files every time I do a backup, for the most part it will be wasteful as the only changes will be new maps that I have added. And when I add one, it's in the form of zip file that I upload. Aside from that, the changes will be minor stuff like operating system updates, log files, etc.
 
Thanks. Will need to give some consideration to all this. But one factor is the nature of my site/server. I make maps which consist of millions of small .png or .jpg files (map tiles). And the thing is, once I complete a map, it's done and there will be no more changes to those files. My site is basically "read only" for the public. There's no registration, no user accounts, no ads, no e-mail - no ability for anyone to add their own content.

That's why I question the advantage of sifting through hundreds of millions of files every time I do a backup, for the most part it will be wasteful as the only changes will be new maps that I have added. And when I add one, it's in the form of zip file that I upload. Aside from that, the changes will be minor stuff like operating system updates, log files, etc.
That sounds like using a read-only DMG might be useful. They can be mounted like any other disk, and probably the only limit is the size of the mount table.

For the backup process, a read-only DMG would be a large file that isn't modified.

A sparse-bundle format should also work. It would be a directory with a bunch of 8MB files (or some other size). If it's mounted read-only, or the constituent files & dirs have read-only permission, then backup won't find any changes.

A read-only DMG can be compressed, but if the actual data files are mainly PNG or JPEG, I don't think that would save much space overall.

I think the rough equivalent in Linux-land is an ISO file. I think that can be made on a Mac using 'dd' to read the raw blocks of a file-system and write directly to a file. I've done that with Pi SD cards, and it makes duplicates or backups just fine.
 
  • Like
Reactions: Boyd01
Yeah, those sound like good options. But things have changed here, since I now have a 4tb Linux-formatted ext4 ssd with all ~142 million files on it. My thinking is that I'll no longer need any Mac disk compatibility going forward. My disk will plug into any Linux box with a USB port and I've been impressed with how well it runs on modest (cheap) hardware.

I still have two APFS-formatted 4tb ssd's with the whole site, they work great with MAMP on MacOS and CCC easily clones them (takes all night, but no big deal). Now I want to get it all off my Mac though, onto a local Linux box with the same operating system as my leased server.

The next step will be finding a good Linux backup strategy, but before I even do that, I need to sort out some more basic issues, like whether I should use the 2012 or 2018 Mini as the server. Both now run AlmaLinux and the 2018 is a far nicer machine, but I'm concerned about t2 fan control (can probably be handled with software but I need to figure out how to install).

Sorry, this is getting pretty far-afield for a thread the "Mac Basics" forum! :) But I no longer need to copy a massive number of files from an APFS disk to a Linux disk. The discussion is very interesting though and might help someone else.
 
Thanks. Will need to give some consideration to all this. But one factor is the nature of my site/server. I make maps which consist of millions of small .png or .jpg files (map tiles). And the thing is, once I complete a map, it's done and there will be no more changes to those files. My site is basically "read only" for the public. There's no registration, no user accounts, no ads, no e-mail - no ability for anyone to add their own content.

That's why I question the advantage of sifting through hundreds of millions of files every time I do a backup, for the most part it will be wasteful as the only changes will be new maps that I have added. And when I add one, it's in the form of zip file that I upload. Aside from that, the changes will be minor stuff like operating system updates, log files, etc.
I believe rsync will be pretty fast in weeding out the things that don’t need to be updated. That’s one of the great things about rsync, it only moves things that have changed. And you can also simply point to subdirectories to update and not need to crawl the whole tree.

Scenario… you’ve done a backup, but then go back and change a few things in the original. Just rerun the rsync command and only the updated items get moved/replaced/added. This also allows you to back up partial work, and simply rerun the command at different points to add your changes to the backup.
 
  • Like
Reactions: Boyd01
Oh this is an x-y problem. You're talking about Y but what is the X you are doing?

Some question:

1. Why 142 million files? We're a 25 year old very very large fintech SaaS and we don't have 142 million source files. Sounds like something is seriously off there in how you are doing things.
2. Are you moving to Linux as a one shot?

Assuming 1 is some crazy process, clean up first! If not, ignore this.

Assuming 2 is the destination, forget the following entirely:
  • Forget exFAT. The data structures do not support writing that volume of files and are locky. It'll take forever even if it works.
  • Forget LAN. RTT latency on anything will knacker you and it'll take forever.
  • Forget rsync. It doesn't scale to large workloads despite everyone telling you it will. It's also completely broken on macOS when it comes to things like unicode paths. Had no end of problems with that.
If you need it on ext4 as the destination for Linux, wipe the target disk on the Linux box and format it ext4. Then go back to the mac, install homebrew and macfuse and mount it [see 1]. Use a plain old "cp -Rf source dest" on the terminal on the mac and wait for it to copy the files. It's reliable and will run the whole thing end to end. Keep the mac awake with caffeinate in another terminal window so it won't go to sleep and feck up the mount. Unmount it, plug it into your linux box and you're done.

[1] https://www.jeffgeerling.com/blog/2024/mounting-ext4-linux-usb-drive-on-macos-2024

----

I noticed mentioning a RAID cloud server. Depending on outbound traffic, it's probably cheaper and more reliable to throw it at a cloud CDN provider if it's static content.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.