Better way to traverse folders?

Discussion in 'Mac Programming' started by MorphingDragon, Jun 15, 2011.

  1. MorphingDragon, Jun 15, 2011
    Last edited: Jun 15, 2011

    macrumors 603

    MorphingDragon

    Joined:
    Mar 27, 2009
    Location:
    The World Inbetween
    #1
    I'm writing an application that will crawl through a folder or disk drive for media files, sort them, remove duplicates and fix metadata. Well... performance is becoming a bit of an issue. I'm just wondering if the way I'm getting subdirectories and filenames is inefficient because it seems to be my bottleneck. Its taking an hour to crawl through the 100GB of data I'm using for testing without any processing.

    Pseudo Code
    Code:
    Create String Stack
    Create filenames array
    Push Initial Folder onto Stack
    while (Stack has more than 0 items)
      Pop folder off stack
      Get subdirectory filenames array
        Add array items to stack
    
      Get current folder's files array
        Add items to filenames array
    
    Actual C# Code
    Code:
    //Get background worker ref
                BackgroundWorker worker = sender as BackgroundWorker; (BackgroundWorker is a wrapper for MultiThreading)
    
                //Create set for filenames
                ListSet<string> filenames = new ListSet<string>();
     
                //Get folder
                string first = e.Argument as string;
    
                //Create folder stack
                Stack<string> folders = new Stack<string>();
    
                //Push initial folder onto stack
                folders.Push(first);
    
                //Start traversing through subdirectories
                while (folders.Count > 0)
                {
                    //Get current path
                    string current = folders.Pop();
    
                    //Get folder into
                    DirectoryInfo info = new DirectoryInfo(current);
                    
                    //Add subfolders to stack
                    foreach (DirectoryInfo d in info.GetDirectories())
                    {
                        //Check if cancellation in progress
                        if (worker.CancellationPending == true)
                        {
                            //Update cancel status
                            e.Cancel = true;
    
                            //Break from loop
                            break;
                        }
                        else
                        {
                            try
                            {
                                //Add directory
                                folders.Push(d.FullName);
                            }
                            catch (SecurityException ex)
                            {
                                //Show error message
                                MessageBox.Show("Cannot read " + d.FullName + ". Access Denied.", "Access Denied", MessageBoxButtons.OK, MessageBoxIcon.Error);
                            }
                        }
                    }
    
                    //Add files to files list
                    foreach (FileInfo f in info.GetFiles())
                    {
                        //Check if cancellation in progress
                        if (worker.CancellationPending == true)
                        {
                            //UPdate cancel status
                            e.Cancel = true;
    
                            //Break from loop
                            break;
                        }
                        else
                        {
                            try
                            {
                                //Check if media file
                                if(this.CheckFile(f.FullName));
                                {
                                  //Add file
                                  filenames.Add(f.FullName);
                                }
                            }
                            catch (SecurityException ex)
                            {
                                //Show error message
                                MessageBox.Show("Cannot read " + f.FullName + ". Access Denied.", "Access Denied", MessageBoxButtons.OK, MessageBoxIcon.Error);
                            }
                        }
                    }
    
                    //Clear the stack
                    folders.Clear();
                    folders = null;
    
    Sample Terminal Log Text
    Code:
     
    Starting crawl of /Volumes/ROBIPOD. 10:31AM
    *...Printout of Scanned Files...*
    Found 5892 Media Files
    Finished at 11:20AM 
    
     
  2. macrumors 603

    Joined:
    Aug 9, 2009
    #2
    You need to profile your current code, measuring the time taking by each part, and then figure out which part to improve first.

    For example, you didn't post any code for this.CheckFile(), which is called for every file encountered in the traversal. If it's doing something time-consuming for every file, then it could benefit greatly from doing quick tests for early rejection, and only do the more costly accept/reject tests after the quick tests have succeeded.

    Or it could be the large number of try/catch setups and teardowns. Is there really a reason to continue after each error? Imagine 130 different times you get a SecurityException and subsequent alert. How effective is that really going to be? Why not just stop on the first error? That would move the try/catch outside the whole loop, so the entire traversal is inside one try block.

    The posted code is a classic depth-first traversal using a LIFO stack of directory paths. If you changed it to a FIFO stack of directories, you'd get breadth-first traversal, which may have some slight performance advantage, given the typical layout of directories on disk. Or FIFO might not make any real difference, if it's not the main contributor.

    If traversal alone accounts for 15% of the total elapsed time, then even if you manage to make it take zero time, you still only gained 15%. You always need to ask yourself this question: If I make it take zero time, what is the overall benefit?

    100G of data could have a whole lot of files (takes longer to traverse) or it could be only a few files (much faster to traverse). Traversal time is usually proportional to the number of nodes, not the size of the data at each node. Unless you're reading in all the data, then it WILL be proportional to size of data.

    This is all just guesses. You need to measure first, before attempting any kind of improvements.
     
  3. macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #3
    Run it under Shark.

    It looks like you are scanning the contents of an iPod. I'd guess that file system calls are what takes the time. Shark should tell you.

    Try ls -R /Volumes/ROBIPOD in terminal to check how long that takes. If that is a lot quicker than your code then at least to know you can do something about it. If that takes ages as well, then there may not be much you can do.

    Or very easy: Run your code with Activity Monitor open and see what happens. How much CPU usage; how many reads/writes total, how many reads/writes per second.
     
  4. macrumors 603

    mobilehaathi

    Joined:
    Aug 19, 2008
    Location:
    The Anthropocene
    #4
    This is a good suggestion. You should definitely do it.
     
  5. thread starter macrumors 603

    MorphingDragon

    Joined:
    Mar 27, 2009
    Location:
    The World Inbetween
    #5
    I took out the try/catch statements for now (I know I have security privileges) and changed to a FIFO stack and they aren't certainly the issue. I also did as Gnasher suggested and 'ls' is much much faster.

    I ran the program in Shark and in Visual Studio's performance tools, I think the issue seems to be when creating all the objects to get the Filesystem information. I say that as it can take a few seconds to get all the subfolders or files in larger folder trees.

    I was going to rewrite this part as a C Library (I'm dropping MONO for various reasons) anyway and just use C#.Net/Obj-C for the respective OS GUIs. But I was going to wait until I can figure out how to get the traversal performing well. I may do it now anyway just so I'm forced to think about it properly.
     
  6. gnasher729, Jun 16, 2011
    Last edited: Jun 16, 2011

    macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #6
    You could either use contentsOfDirectoryAtPath: in NSFileManager and recursion, or go really low level: Use "man ftw" to get some more information about the "ftw" function, which is a low-level Posix function that will traverse a directory and call a callback function.

    Code:
    #include <ftw.h>
    
    static long sFileCount;
    static int ftw_callback (const char* path, const struct stat* info, int flags) {
    	if (flags == FTW_F)	{
    		++sFileCount;
    		if (sFileCount % 10000 == 0)
    			printf ("Path %d = %s\n", (int) sFileCount, path);
    	}
    	return 0;
    }
    
    (void) ftw ("/System", &ftw_callback, 50);
    
    That function scans the system folder on my MacBook (180,000 files) in less than ten seconds.
     
  7. thread starter macrumors 603

    MorphingDragon

    Joined:
    Mar 27, 2009
    Location:
    The World Inbetween
    #7
    This is perfect, thank you. I can even use ftw on Windows with a self-contained Cygwin. Though shouldn't we be using fts now not ftw?
     
  8. macrumors G5

    gnasher729

    Joined:
    Nov 25, 2005
    #8
    Maybe yes. They both worked for me; ftw () was slightly simpler.
     

Share This Page