PDA

View Full Version : Better way to traverse folders?




MorphingDragon
Jun 15, 2011, 07:04 AM
I'm writing an application that will crawl through a folder or disk drive for media files, sort them, remove duplicates and fix metadata. Well... performance is becoming a bit of an issue. I'm just wondering if the way I'm getting subdirectories and filenames is inefficient because it seems to be my bottleneck. Its taking an hour to crawl through the 100GB of data I'm using for testing without any processing.

Pseudo Code

Create String Stack
Create filenames array
Push Initial Folder onto Stack
while (Stack has more than 0 items)
Pop folder off stack
Get subdirectory filenames array
Add array items to stack

Get current folder's files array
Add items to filenames array


Actual C# Code

//Get background worker ref
BackgroundWorker worker = sender as BackgroundWorker; (BackgroundWorker is a wrapper for MultiThreading)

//Create set for filenames
ListSet<string> filenames = new ListSet<string>();

//Get folder
string first = e.Argument as string;

//Create folder stack
Stack<string> folders = new Stack<string>();

//Push initial folder onto stack
folders.Push(first);

//Start traversing through subdirectories
while (folders.Count > 0)
{
//Get current path
string current = folders.Pop();

//Get folder into
DirectoryInfo info = new DirectoryInfo(current);

//Add subfolders to stack
foreach (DirectoryInfo d in info.GetDirectories())
{
//Check if cancellation in progress
if (worker.CancellationPending == true)
{
//Update cancel status
e.Cancel = true;

//Break from loop
break;
}
else
{
try
{
//Add directory
folders.Push(d.FullName);
}
catch (SecurityException ex)
{
//Show error message
MessageBox.Show("Cannot read " + d.FullName + ". Access Denied.", "Access Denied", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
}

//Add files to files list
foreach (FileInfo f in info.GetFiles())
{
//Check if cancellation in progress
if (worker.CancellationPending == true)
{
//UPdate cancel status
e.Cancel = true;

//Break from loop
break;
}
else
{
try
{
//Check if media file
if(this.CheckFile(f.FullName));
{
//Add file
filenames.Add(f.FullName);
}
}
catch (SecurityException ex)
{
//Show error message
MessageBox.Show("Cannot read " + f.FullName + ". Access Denied.", "Access Denied", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
}

//Clear the stack
folders.Clear();
folders = null;


Sample Terminal Log Text

Starting crawl of /Volumes/ROBIPOD. 10:31AM
*...Printout of Scanned Files...*
Found 5892 Media Files
Finished at 11:20AM



chown33
Jun 15, 2011, 11:29 AM
You need to profile your current code, measuring the time taking by each part, and then figure out which part to improve first.

For example, you didn't post any code for this.CheckFile(), which is called for every file encountered in the traversal. If it's doing something time-consuming for every file, then it could benefit greatly from doing quick tests for early rejection, and only do the more costly accept/reject tests after the quick tests have succeeded.

Or it could be the large number of try/catch setups and teardowns. Is there really a reason to continue after each error? Imagine 130 different times you get a SecurityException and subsequent alert. How effective is that really going to be? Why not just stop on the first error? That would move the try/catch outside the whole loop, so the entire traversal is inside one try block.

The posted code is a classic depth-first traversal using a LIFO stack of directory paths. If you changed it to a FIFO stack of directories, you'd get breadth-first traversal, which may have some slight performance advantage, given the typical layout of directories on disk. Or FIFO might not make any real difference, if it's not the main contributor.

If traversal alone accounts for 15% of the total elapsed time, then even if you manage to make it take zero time, you still only gained 15%. You always need to ask yourself this question: If I make it take zero time, what is the overall benefit?

100G of data could have a whole lot of files (takes longer to traverse) or it could be only a few files (much faster to traverse). Traversal time is usually proportional to the number of nodes, not the size of the data at each node. Unless you're reading in all the data, then it WILL be proportional to size of data.

This is all just guesses. You need to measure first, before attempting any kind of improvements.

gnasher729
Jun 15, 2011, 12:40 PM
Run it under Shark.

It looks like you are scanning the contents of an iPod. I'd guess that file system calls are what takes the time. Shark should tell you.

Try ls -R /Volumes/ROBIPOD in terminal to check how long that takes. If that is a lot quicker than your code then at least to know you can do something about it. If that takes ages as well, then there may not be much you can do.

Or very easy: Run your code with Activity Monitor open and see what happens. How much CPU usage; how many reads/writes total, how many reads/writes per second.

mobilehaathi
Jun 15, 2011, 06:52 PM
Try ls -R /Volumes/ROBIPOD in terminal to check how long that takes. If that is a lot quicker than your code then at least to know you can do something about it. If that takes ages as well, then there may not be much you can do.

This is a good suggestion. You should definitely do it.

MorphingDragon
Jun 16, 2011, 06:52 AM
I took out the try/catch statements for now (I know I have security privileges) and changed to a FIFO stack and they aren't certainly the issue. I also did as Gnasher suggested and 'ls' is much much faster.

I ran the program in Shark and in Visual Studio's performance tools, I think the issue seems to be when creating all the objects to get the Filesystem information. I say that as it can take a few seconds to get all the subfolders or files in larger folder trees.

I was going to rewrite this part as a C Library (I'm dropping MONO for various reasons) anyway and just use C#.Net/Obj-C for the respective OS GUIs. But I was going to wait until I can figure out how to get the traversal performing well. I may do it now anyway just so I'm forced to think about it properly.

gnasher729
Jun 16, 2011, 09:45 AM
You could either use contentsOfDirectoryAtPath: in NSFileManager and recursion, or go really low level: Use "man ftw" to get some more information about the "ftw" function, which is a low-level Posix function that will traverse a directory and call a callback function.

#include <ftw.h>

static long sFileCount;
static int ftw_callback (const char* path, const struct stat* info, int flags) {
if (flags == FTW_F) {
++sFileCount;
if (sFileCount % 10000 == 0)
printf ("Path %d = %s\n", (int) sFileCount, path);
}
return 0;
}

(void) ftw ("/System", &ftw_callback, 50);


That function scans the system folder on my MacBook (180,000 files) in less than ten seconds.

MorphingDragon
Jun 16, 2011, 11:16 AM
#include <ftw.h>

static long sFileCount;
static int ftw_callback (const char* path, const struct stat* info, int flags) {
if (flags == FTW_F) {
++sFileCount;
if (sFileCount % 10000 == 0)
printf ("Path %d = %s\n", (int) sFileCount, path);
}
return 0;
}

(void) ftw ("/System", &ftw_callback, 50);


That function scans the system folder on my MacBook (180,000 files) in less than ten seconds.

This is perfect, thank you. I can even use ftw on Windows with a self-contained Cygwin. Though shouldn't we be using fts now not ftw?

gnasher729
Jun 16, 2011, 12:58 PM
This is perfect, thank you. I can even use ftw on Windows with a self-contained Cygwin. Though shouldn't we be using fts now not ftw?

Maybe yes. They both worked for me; ftw () was slightly simpler.