Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Dookieman

macrumors 6502
Original poster
Oct 12, 2009
390
67
I'm working on parsing two large databases and merging them into one. Both databases have similar data between the two but not reference to each other. They been converted to JSON format so I'm using the arrays to iterate one over the other. I'm matching strings and since the data is very similar, if a match has been found, it then inserts the ID from one into the other so we now have a reference point in the final database.

The parsing is very CPU intensive since I'm using fuzzy string matching in hopes to find matching strings even if they have slight variations. Both databases have 50,000+ entries in them so running this takes a long time. ~8 Hours on my MBP.

Normally I would just let it crunch the data overnight but I want to make some changes to my algorithm and try multiple things to find the best results. So this gets to be annoying having to wait...

So, it is possible to offload some of the work to the GPU? Fuzzy string matching is matrix based so I thought it might be a good candidate for the GPU. Would adding metal help reduce the amount of time it takes?

Thanks!
 
First things first - data structures and caching. Without knowing the details of your problem I can't give you anything concrete but is there some data structure you could use to speed up access? Do you continually reparse one or both JSON documents? Could you parse once and keep in memory in some indexable structure? Could you at least build an in memory index speeding up access to parts of the JSON doc?

Any of that is likely to speed you up much more that using the GPU. Of course if your problem doesn't allow for that...
 
You say that you are comparing arrays so I assume that you are doing this in memory, reading these data bases into memory only once in their entirety and finally writing out the merged data as a single file write. If not, do it this way. Don't limit your speed by reading and writing to the disk. Your disk has a limit on transactions per second and doing things like this in memory is much faster. There might be some way to optimize your code as well. Is fuzzy string comparing quite processor intensive and slow? Also you should be able to process this as a background task, running it while working on other things.
 
There may be some benefit to plain old threading. It at least depends on the data structures involved, the number of parallel threads you can have (i.e. number of real & virtual cores), and whether things can run without contention.

If there isn't any contention or locking, then it may be as simple as firing up N threads (for however many cores you have), each one working on 1/N of the problem.
 
It may be viable reduce the search space by creating two new tables with extra columns containing "normalized" versions of the strings, this would increase search/comparison speed significantly if you have some kind of costly comparison algorithm. This normalization could do things like standardizing case, strip spaces/punctuation, remove redundant spaces, apply some kind of spelling correction or reduce the ambiguous strings to a small set of likely ones.

It may also be viable to create some kind of multi dimensional data structure to represent the strings. This would allow you to ignore strings that live inside "far" away spaces so that you will only have to compare against a few nearby neighbors.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.