macOS OS I/O buffering, C++, and who dat data?

toddburch · May 11, 2007

I'm getting into C++. I just wrote a Ruby script that takes over 20 minutes to process a 136MB file. Too slow. And, I have bigger files to process. So, it's off to C++ I go!

For a test, I wrote a small c++ app that simply reads in the file I need to process and writes it back out. It's coming in off a flash drive, and it output to my hard drive.

The first time I run it, it takes about 2-3 minutes to run. This is the same 136MB file I used with Ruby. Kinda slow I thought. Running it immediately again, it runs in about 3 seconds. Perfectly acceptable!

If I unplug the flash drive, and plug it back in, I'm back to the glacially slow process. (OK, 2-3 minutes better than 20...) I'll attribute this to the poor data transfer rate of the flash drive plugging in my keyboard.

Running hard-drive to same hard-drive is very fast - just a few seconds. Not sure if I'm getting any benefits of operating system I/O buffering or not, but I could reboot and test again, I guess.

I've only one book on C++ and it doesn't get into multithreading or advanced I/O buffering, so I'll be looking into these topics as well. For this file, and larger ones still to come, I would like a dedicated process reading as fast as it can, and then have a data-conversion process, and then finally a third output process for the new file.

Reader Process - read as fast as it can, not stopping for anthing except out of memory conditions, and then if that happens, start reusing buffers that have been emptied.

Converter process - when data exist in a buffer, convert it and place it into a seond buffer for output

Writer process - when data starts showing up in the output buffer, start writing. Or, wait for larger chunks of data then go to sleep waiting for the next chunk of data.

One last comment / question.

In my first tests with this homemade file-copier, I simply output the first 3 records (128 bytes each) of non-displayable binary data to the xcode console.

I then tweaked my program to write these same 3 records to a second file on my hard drive.

Changing my program again to read from the file I just created and output to the console, the non-displayable binary data was different. The newly created file appears to have extra data prefixed to the original data. Does anyone have an explanation for this?

No rocket science here:

Code:

int main (int argc, char * const argv[]) {
	int i ; 
	char n[4096*10] ;  // buffer 
	
	ofstream out(OFILE, ios::out | ios::binary ) ; 
	if (!out) { 
		cout << "Cannot open output file " << OFILE << ".\n" ; 
		return -1 ; 
	} 
	 
	ifstream in(IFILE, ios::in | ios::binary) ; 
	if (!in) { 
		cout << "Cannot open input file " << IFILE << ".\n" ; 
		return -1 ; 
	} 
	
	i = 0 ; 
	while (!in.eof() ) {  
		in.read( (char *) &n, sizeof n ) ; 
		i += in.gcount() ; 
		out.write((char *) &n, in.gcount() ) ; 
	}

Thanks, Todd

iSee · May 11, 2007

I think you want to change the way you are passing n into read and write:

Code:

	while (!in.eof() ) {  
		in.read( n, sizeof n ) ; 
		i += in.gcount() ; 
		out.write( n, in.gcount() ) ; 
	}

I've removed the "(char*)& before n in both calls. n is a pointer to characters already. When you take the address to that you have a pointer to a pointer to characters, which isn't what you want.

Edit: I meant to say, this probably accounts for the prefixed data you noticed (by the way, props for using that nice "reality-check" test case).

Also regarding some of the other stuff you commented on:
* The contents of the flash drive are definitely getting buffered which is why the second run goes so fast. Your keyboard USB port may not be USB 2.0--many aren't. If so, it would run at a max of 12Mbps rather than 480Mbps.
* By default ifstream and ofstream use buffered I/O. Because you have your own buffering mechanism it will most likely improve performance to turn that off. I can't quite remember how that is done, but look up "pubsetbuf." I can't exactly remember the call.
* Your asynchronous I/O design seems sound (of course, the devil is in the details!).
* Your asynchronous process may not improve performance a lot. You've got three processes. Without multi-threading/asynchronous the total process will take as long as all three of them would take running separately. With multi-threading the best you can do is have the whole process take only as long as the longest process. If the time to run each of the processes it this:
1. Read data from flash drive: 2 min, 30 sec
2. Process data: 10 sec
3. Write data to internal hard drive: 5 sec.
Then the multi-threaded implementation runs only 15 sec faster (in the best case) than the simple implementation. Also, if your input device is the same as your output device, or if they are different devices on the same bus and that bus doesn't have sufficient bandwidth, then steps 1. and 3. can't run at full speed at the same time anyway.
So benchmark these three steps independently in an environment as much like the production environment as possible before you spend a lot of time on the multi-threading (unless this is just for educational purposes, I guess).

toddburch · May 11, 2007

Thanks for the feedback. I removed the "pointer-to-the-address-of" stuff. I copied that verbatim from a Herb Schildt book (C++ from the ground up). When I was typing it in, I questioned it, as I knew "char *" meant "pointer to" and I knew "&n" was "address of n", and I knew that n was a pointer anyway, but I didn't think to cancel them both out. Just another level of indirection that is not needed.

I rebooted my machine and the first run hard-drive-to-same-hard-drive took about 4 seconds. The second run took about 2 seconds. So, I suppose Tiger is doing me some favors for me in the buffering recent data department, and that is fine.

I'm sure the process will run "fast enough" regardless of the synchronicity/ascynchronicity (is that a word?) coded.

As the input records are converted, some may fail data validation, so these failing records will be kicked out to two separate files instead of the primary output file. One destination will be just another binary file with no changes to the data (AKA a "bad data" file). The other file will get the record converted to hex format (dump format, with eyecatcher) so it can diagnosed as to what is wrong, by a human, along with a descriptive message for what failed validation, and why.

It should be a fun process.

My next step is to do the dump formatting stuff.

Thanks again. Todd

toddburch · May 13, 2007

As it turns out, there was no bogus or extra data prepended to the file data. It looked bogus because of the way the non-displayable characters were being handled by the OS.

iSee · May 14, 2007

toddburch said:
As it turns out, there was no bogus or extra data prepended to the file data. It looked bogus because of the way the non-displayable characters were being handled by the OS.

That's interesting. For the previous version of the code to work, then n would have to equal &n.

When you think about it you can see why this might be true. n is clearly a pointer to the first character of the buffer. But where would n itself be in memory--that is, the address of n, or &n ? There doesn't need to be an actual pointer somewhere in memory pointing to the first character of the array because n can't change value to point somewhere else, and the value of n is known at compile time. But &n has to return something. It's reasonable that a compiler would simply make &n is equal to n. I would guess that this behavior isn't guaranteed by the C specification, but I'm not checking it to find out. 😀

toddburch · May 17, 2007

I hit a milestone this evening. My C++ program to replace the Ruby script is well underway:

1) It reads in an arbitrary amount of data, or the whole file, so it can (eventually) be controlled through a GUI for the number of records to process

2) When an invalid record is encountered, it successfully outputs a hex formatted dump (with eyecatcher) of the record to a .txt file for later analysis

3) When an invalid record is encountered, it successfully writes the bad record to a "bad record file"'

4) It accounts for all converted records, invalid records, and makes sure the input file is a multiple of the record length

5) It reports the elapsed time it took to run when complete.

I still have to add record conversion and validation. The Ruby script takes 20+ minutes to process the 136MB file.

The way the app is functioning up right now, which causes every record to be invalid (thus writing an entire 136MB "bad record" file, a 528MB hex dump file, and a temporary (placeholder) output data file (~28MB - but it will be a lot larger), the process takes about 21 seconds.

Rounding up to 30 seconds, that's a 40X improvement in performance.

Niiiiiiiiice. 😀

Todd

Code:

Here in Record Constructor...
File size is 136546048
There are 1066766 records to process in the file...
Processing records now...
File Copied. 136546048 bytes copied.
   1066766 records processed.
         0 good records processed.
   1066766 records failed validation.
End time was   00:37:45
Start time was 00:37:24
Elapsed time is 21 seconds.
Elapsed time was 00:00:21

filecopytest has exited with status 0.

Search

Search

macOS OS I/O buffering, C++, and who dat data?

toddburch

macrumors 6502a

iSee

macrumors 68040

toddburch

macrumors 6502a

toddburch

macrumors 6502a

iSee

macrumors 68040

toddburch

macrumors 6502a

Our Staff