macOS too much data

dougphd · May 17, 2016

I have almost 2000 files with 4000 lines each. I have to access the files in pairs, i.e. I have to run my code 4 million times. Each time I run it, I have to read a char[14] and a float from each line in the pair of flies.
How do I do this? Do I read all the data into 2000 structs or do I open and close each pair of files as I go?

chown33 · May 18, 2016

Describe the data handling in more detail.

Take a pair of files AAA.data and BBB.data. After they've been read, will either file ever be part of another pairing? Or is it like a deck of cards in Poker, and after dealing and playing once, the cards are discarded.

If the files are only read once, then I'd open/read/close them once and go on.

Are the paired lines cumulative in some way? That is, do the lines read from the AAA and BBB pairing have any effect on the pairing of files CCC and DDD?

If the effect is cumulative in some way, then you'll need to keep the cumulative data in memory. Unless you're extremely wasteful, I don't see any major obstacles to that. It's going to cost about 20 bytes per line, times 8 million lines, and that's still well under a gigabyte.

If the effect is non-cumulative, then don't accumulate anything. Simply process each pair of files, produce the output, free the memory, and go to the next pair.

dougphd · May 18, 2016

Neither. Each file has to be paired with every other file so there are 2000*1999 + 1999*1998 + 1998*1997 etc. pairs

lee1210 · May 18, 2016

What is important is the interaction or lack thereof between file pairs, line pairs, etc. If you only need to process a pair of files at a time, and that doesn't affect any other data, write the program to take two filenames and just process those and generate your output, then you just need to generate a shell script to run all your permutations, which would be much easier than trying to do it all in your code.

Did you lose the password to the farmerdoug account?

-Lee

chown33 · May 18, 2016

What do you want to minimize? Speed, memory usage, or development time?

An all-in-RAM implementation may be faster, but then again it may not. Files can be cached in RAM, which negates much of the cost of reading them repeatedly. Threading also figures into speed, and then it's difficult to predict where the limiting bottleneck really is.

Memory usage will be worst in an all-in-RAM implementation, and best when only 2 files are processed.

Development time will probably be least if you follow lee1210's advice and make a program that only processes 2 files. That's almost certainly the simplest thing possible. Then again, you'd have to then make a shell script, but you could ask here about how to approach that.

dougphd · May 18, 2016

Hi Lee. I think the problem was that I couldn't remember farmerdoug but these days having retired I am truly farmerdoug. How are you doing?

dougphd · May 18, 2016

A shell would have to be in bedded in the program. Calculations using data in the files are used with other results. I'm interested in speed. Development time is not an issue.

jasnw · May 19, 2016

Without chewing up a lot of time running "what if" tests on various configurations it's probably impossible to know a priori what the fastest solution is. If this is a one-off, you might do some simple things like run an initial pass to convert the text-format files to unformatted binary files. This should cut down some on I/O time, even with OS caching. Then I'd just brute-force the sucker, doing not much more than making sure that the last "#2" file you read in for comparing with the current "#1" is the #1 for the next series of comparisons. I don't know how much unformatted input saves you over formatted reads these days, but in the Bad Old Days it could save a bunch.

jared_kipe · May 19, 2016

This sounds like MAYBE a problem a database could solve. Not knowing the data, but knowing there is some kind of _relationship_ between it implies that if you DID know the data the _relationship_ part would be trivial.

E.g. I hear 2k files with 4k lines, where the titles of file are important. That translates to me that you might have 2k tables with 4k rows each and a bunch of joins, OR 1 table with a column for 'file_name' with 8 million rows. Either would be fine for a reasonable database like Postgres. (and varchar(14), real columns as described above) (EDIT: and probably a int line_num column to join off of, if you put a key on 'file_name,line_num' then your joins are stupid fast as you've described them above.)

I realize that none of this matters, you just want to do it with file pointers and byte arrays... just a little outside input from someone who does stuff like this a lot.

Search

Search

macOS too much data

dougphd

macrumors member

chown33

Moderator

dougphd

macrumors member

lee1210

macrumors 68040

chown33

Moderator

dougphd

macrumors member

dougphd

macrumors member

jasnw

macrumors 65816

jared_kipe

macrumors 68030

Our Staff