macOS Better way to traverse folders?

MorphingDragon · Jun 15, 2011

I'm writing an application that will crawl through a folder or disk drive for media files, sort them, remove duplicates and fix metadata. Well... performance is becoming a bit of an issue. I'm just wondering if the way I'm getting subdirectories and filenames is inefficient because it seems to be my bottleneck. Its taking an hour to crawl through the 100GB of data I'm using for testing without any processing.

Pseudo Code

Code:

Create String Stack
Create filenames array
Push Initial Folder onto Stack
while (Stack has more than 0 items)
  Pop folder off stack
  Get subdirectory filenames array
    Add array items to stack

  Get current folder's files array
    Add items to filenames array

Actual C# Code

Code:

//Get background worker ref
            BackgroundWorker worker = sender as BackgroundWorker; (BackgroundWorker is a wrapper for MultiThreading)

            //Create set for filenames
            ListSet<string> filenames = new ListSet<string>();
 
            //Get folder
            string first = e.Argument as string;

            //Create folder stack
            Stack<string> folders = new Stack<string>();

            //Push initial folder onto stack
            folders.Push(first);

            //Start traversing through subdirectories
            while (folders.Count > 0)
            {
                //Get current path
                string current = folders.Pop();

                //Get folder into
                DirectoryInfo info = new DirectoryInfo(current);
                
                //Add subfolders to stack
                foreach (DirectoryInfo d in info.GetDirectories())
                {
                    //Check if cancellation in progress
                    if (worker.CancellationPending == true)
                    {
                        //Update cancel status
                        e.Cancel = true;

                        //Break from loop
                        break;
                    }
                    else
                    {
                        try
                        {
                            //Add directory
                            folders.Push(d.FullName);
                        }
                        catch (SecurityException ex)
                        {
                            //Show error message
                            MessageBox.Show("Cannot read " + d.FullName + ". Access Denied.", "Access Denied", MessageBoxButtons.OK, MessageBoxIcon.Error);
                        }
                    }
                }

                //Add files to files list
                foreach (FileInfo f in info.GetFiles())
                {
                    //Check if cancellation in progress
                    if (worker.CancellationPending == true)
                    {
                        //UPdate cancel status
                        e.Cancel = true;

                        //Break from loop
                        break;
                    }
                    else
                    {
                        try
                        {
                            //Check if media file
                            if(this.CheckFile(f.FullName));
                            {
                              //Add file
                              filenames.Add(f.FullName);
                            }
                        }
                        catch (SecurityException ex)
                        {
                            //Show error message
                            MessageBox.Show("Cannot read " + f.FullName + ". Access Denied.", "Access Denied", MessageBoxButtons.OK, MessageBoxIcon.Error);
                        }
                    }
                }

                //Clear the stack
                folders.Clear();
                folders = null;

Sample Terminal Log Text

Code:

Starting crawl of /Volumes/ROBIPOD. 10:31AM
*...Printout of Scanned Files...*
Found 5892 Media Files
Finished at 11:20AM

chown33 · Jun 15, 2011

You need to profile your current code, measuring the time taking by each part, and then figure out which part to improve first.

For example, you didn't post any code for this.CheckFile(), which is called for every file encountered in the traversal. If it's doing something time-consuming for every file, then it could benefit greatly from doing quick tests for early rejection, and only do the more costly accept/reject tests after the quick tests have succeeded.

Or it could be the large number of try/catch setups and teardowns. Is there really a reason to continue after each error? Imagine 130 different times you get a SecurityException and subsequent alert. How effective is that really going to be? Why not just stop on the first error? That would move the try/catch outside the whole loop, so the entire traversal is inside one try block.

The posted code is a classic depth-first traversal using a LIFO stack of directory paths. If you changed it to a FIFO stack of directories, you'd get breadth-first traversal, which may have some slight performance advantage, given the typical layout of directories on disk. Or FIFO might not make any real difference, if it's not the main contributor.

If traversal alone accounts for 15% of the total elapsed time, then even if you manage to make it take zero time, you still only gained 15%. You always need to ask yourself this question: If I make it take zero time, what is the overall benefit?

100G of data could have a whole lot of files (takes longer to traverse) or it could be only a few files (much faster to traverse). Traversal time is usually proportional to the number of nodes, not the size of the data at each node. Unless you're reading in all the data, then it WILL be proportional to size of data.

This is all just guesses. You need to measure first, before attempting any kind of improvements.

gnasher729 · Jun 15, 2011

Run it under Shark.

It looks like you are scanning the contents of an iPod. I'd guess that file system calls are what takes the time. Shark should tell you.

Try ls -R /Volumes/ROBIPOD in terminal to check how long that takes. If that is a lot quicker than your code then at least to know you can do something about it. If that takes ages as well, then there may not be much you can do.

Or very easy: Run your code with Activity Monitor open and see what happens. How much CPU usage; how many reads/writes total, how many reads/writes per second.

mobilehaathi · Jun 15, 2011

gnasher729 said:
Try ls -R /Volumes/ROBIPOD in terminal to check how long that takes. If that is a lot quicker than your code then at least to know you can do something about it. If that takes ages as well, then there may not be much you can do.

This is a good suggestion. You should definitely do it.

MorphingDragon · Jun 16, 2011

I took out the try/catch statements for now (I know I have security privileges) and changed to a FIFO stack and they aren't certainly the issue. I also did as Gnasher suggested and 'ls' is much much faster.

I ran the program in Shark and in Visual Studio's performance tools, I think the issue seems to be when creating all the objects to get the Filesystem information. I say that as it can take a few seconds to get all the subfolders or files in larger folder trees.

I was going to rewrite this part as a C Library (I'm dropping MONO for various reasons) anyway and just use C#.Net/Obj-C for the respective OS GUIs. But I was going to wait until I can figure out how to get the traversal performing well. I may do it now anyway just so I'm forced to think about it properly.

gnasher729 · Jun 16, 2011

You could either use contentsOfDirectoryAtPath: in NSFileManager and recursion, or go really low level: Use "man ftw" to get some more information about the "ftw" function, which is a low-level Posix function that will traverse a directory and call a callback function.

Code:

#include <ftw.h>

static long sFileCount;
static int ftw_callback (const char* path, const struct stat* info, int flags) {
	if (flags == FTW_F)	{
		++sFileCount;
		if (sFileCount % 10000 == 0)
			printf ("Path %d = %s\n", (int) sFileCount, path);
	}
	return 0;
}

(void) ftw ("/System", &ftw_callback, 50);

That function scans the system folder on my MacBook (180,000 files) in less than ten seconds.

MorphingDragon · Jun 16, 2011

gnasher729 said:
Code:

#include <ftw.h> static long sFileCount; static int ftw_callback (const char* path, const struct stat* info, int flags) { if (flags == FTW_F) { ++sFileCount; if (sFileCount % 10000 == 0) printf ("Path %d = %s\n", (int) sFileCount, path); } return 0; } (void) ftw ("/System", &ftw_callback, 50);

That function scans the system folder on my MacBook (180,000 files) in less than ten seconds.

This is perfect, thank you. I can even use ftw on Windows with a self-contained Cygwin. Though shouldn't we be using fts now not ftw?

gnasher729 · Jun 16, 2011

MorphingDragon said:
This is perfect, thank you. I can even use ftw on Windows with a self-contained Cygwin. Though shouldn't we be using fts now not ftw?

Maybe yes. They both worked for me; ftw () was slightly simpler.

Search

Search

macOS Better way to traverse folders?

MorphingDragon

macrumors 603

chown33

Moderator

gnasher729

Suspended

mobilehaathi

macrumors G3

MorphingDragon

macrumors 603

gnasher729

Suspended

MorphingDragon

macrumors 603

gnasher729

Suspended

Our Staff