PDA

View Full Version : What is the fastest way to test for inclusion in list




rossipoo
Nov 27, 2010, 12:06 AM
I have a list of words and I want to test input to determine if it is a word included in the word list.

My first thought was to use NSSet; with 100000 words however, it started to show it's limitations. It's for the iPhone, and I'd really like to do several lookups, without a perceivable delay.

I tried splitting the words into multiple NSSets, each containing a subset based on the length of the word, then I check in the corresponding. To my surprise, I found this to be a significant improvement. I was surprised because I'd always read that NSSet was best for lookups, and expected that the NSSet would make a similar comparison itself, except based on some kind of hash instead.

Short of implementing a tree or other custom structure, are there better ways that I could be performing this test?



chown33
Nov 27, 2010, 01:15 AM
Multiple subsets is a tree. If each subset contains words of the same length, then the first level of branching is by length.

You could use another criteria to produce another branching level, such as all words in the subset have the same first character. Or the same hashcode of the first character, such as its Unicode value modulo some convenient number.

The whole idea is to rapidly eliminate large numbers of words that can't possibly be correct (rapid exclusion). What remains will be a small and quickly-searchable subset, if the branching criteria are chosen to produce reasonable distributions.

PatrickCocoa
Nov 27, 2010, 02:58 AM
with 100000 words however, it started to show it's [sic] limitations. It's for the iPhone, and I'd really like to do several lookups, without a perceivable delay.

You've got a situation where you have 100,000 words and your only bottleneck is the lookups to test inclusion? If so, that's great - once you defeat the lookup monster you're golden. But you may want to make sure there's not a long line of other monsters behind this one.

If you've already got everything figured out except this lookup problem, then great. I'm not trying to be snippy, just throwing out unasked for advice.

sanPietro98
Nov 27, 2010, 07:17 AM
Given the small RAM footprint of an iPhone (as compared to a desktop computer app), it may not be the best idea to keep 100000 items in heap space. have you thought about using Core Data or sqlite to persist your data? You can use their innate indexing and querying capabilities more efficiently than brute force searches.

gnasher729
Nov 27, 2010, 10:03 AM
I have a list of words and I want to test input to determine if it is a word included in the word list.

My first thought was to use NSSet; with 100000 words however, it started to show it's limitations. It's for the iPhone, and I'd really like to do several lookups, without a perceivable delay.

I tried splitting the words into multiple NSSets, each containing a subset based on the length of the word, then I check in the corresponding. To my surprise, I found this to be a significant improvement. I was surprised because I'd always read that NSSet was best for lookups, and expected that the NSSet would make a similar comparison itself, except based on some kind of hash instead.

Short of implementing a tree or other custom structure, are there better ways that I could be performing this test?

How exactly are you looking up things in this NSSet? NSSet uses a hash table internally; a tree or similar should not be able to improve it, so I suspect something else is going on.

peterfeiner
Nov 27, 2010, 10:09 AM
Some kind of trie data structure will suit your needs. For a set of keys (the 100000 words in your case), a trie stores the keys as paths in a tree where two keys have the same path up to their common prefix. This data structure affords lookups with O(m) memory accesses for inputs of length m, regardless of the number of keys. Hence tries have better asymptotic complexity than binary search trees (which NSSet probably uses), where, for n keys, lookups use O(m * log n) memory accesses. Moreover, tries can use less space than binary search trees: if many of your words have common prefixes, then you won't have to store the 100000 words individually.

Take a look at http://en.wikipedia.org/wiki/Trie for a decent explanation.

A space efficient variation of tries, called a Patricia Trie or a Radix Tree, might be better suited for your low memory environment: http://en.wikipedia.org/wiki/Radix_tree.

gnasher729
Nov 27, 2010, 10:24 AM
Some kind of trie data structure will suit your needs. For a set of keys (the 100000 words in your case), a trie stores the keys as paths in a tree where two keys have the same path up to their common prefix. This data structure affords lookups with O(m) memory accesses for inputs of length m, regardless of the number of keys. Hence tries have better asymptotic complexity than binary search trees (which NSSet probably uses), where, for n keys, lookups use O(m * log n) memory accesses. Moreover, tries can use less space than binary search trees: if many of your words have common prefixes, then you won't have to store the 100000 words individually.

NSSet uses a hash table; hash tables have constant access time, faster than anything else. I'd like to see _exactly_ what the original poster is doing; without that, everything is guesswork. I just hope he doesn't do a search loop like "for (NSString* s in [bigset allObjects]) ..."

rossipoo
Nov 27, 2010, 10:25 AM
Thanks, I'm particularly interested in using Core Data. I didn't think to try it because I have almost no experience with Core Data. Loading the data was a huge problem at first (taking about 8 or 9 seconds), though I somewhat mitigated it by loading in an other thread.

This is my current lookup method: standardWordLists is an NSArray of NSSets (based on word length), and properWordList is an NSDictionary word list for proper nouns and other words containing uppercase.- (NSString *) wordForKey:(NSString *)key {
// Check cache first.
if ([key isEqualToString: lastKey]) {
return lastWord;
}

// Not in cache, new lookup.
[lastKey release];
lastKey = nil;
[lastWord release];
lastWord = nil;

if (key == nil || key.length < 3 || key.length > WORDLIST_COUNT) {
// Look up is nil, or too short, or too long.
// Cache nothing.
return nil;
}

NSInteger listIndex = key.length - 1;
if ([[standardWordLocks objectAtIndex: listIndex] tryLock]) {
[[standardWordLocks objectAtIndex: listIndex] unlock];

NSSet *wordlist = [standardWordLists objectAtIndex: listIndex];
if ([wordlist containsObject: key]) {
// Found word for key, cache result.
lastKey = [key retain];
lastWord = [key retain];
return lastWord;
}
}

if ([properWordLock tryLock]) {
[properWordLock unlock];

// May or may not find word for key, cache result.
lastKey = [key retain];
lastWord = [[properWordList objectForKey: key] retain];
return lastWord;
} else {
// Unable to lookup word at this time.
return nil;
}
}

lee1210
Nov 27, 2010, 10:26 AM
Some kind of trie data structure will suit your needs. For a set of keys (the 100000 words in your case), a trie stores the keys as paths in a tree where two keys have the same path up to their common prefix. This data structure affords lookups with O(m) memory accesses for inputs of length m, regardless of the number of keys. Hence tries have better asymptotic complexity than binary search trees (which NSSet probably uses), where, for n keys, lookups use O(m * log n) memory accesses. Moreover, tries can use less space than binary search trees: if many of your words have common prefixes, then you won't have to store the 100000 words individually.

Take a look at http://en.wikipedia.org/wiki/Trie for a decent explanation.

A space efficient variation of tries, called a Patricia Trie or a Radix Tree, might be better suited for your low memory environment: http://en.wikipedia.org/wiki/Radix_tree.

To elaborate a bit, you could have the trie in a file, and only read up the records you need while traversing. That means m reads from "slow" storage, but can keep the memory usage way down compared to having the whole structure in RAM. Where I work this technique was used for fast lookups before RDBMS was all the rage. The biggest pitfall was keeping the structure updated as data changed, but in this case it sounds like the data will be fairly static.

-Lee

peterfeiner
Nov 27, 2010, 10:57 AM
NSSet uses a hash table; hash tables have constant access time, faster than anything else. I'd like to see _exactly_ what the original poster is doing; without that, everything is guesswork. I just hope he doesn't do a search loop like "for (NSString* s in [bigset allObjects]) ..."

The cost of the hash lookup must be at least linear in the size of the key because you have to read the key (the entire string in this case) to know what you're looking for! Next, because hash collisions are unavoidable, the key has to be compared to potential matches in the table. So, even assuming that the number of collisions is somehow bounded, the best case lookup is O(m).

Given that hash tables and tries have the same asymptotic complexity, there are attributes of tries that might make them better suited for the OP's needs. For instance, a trie may very well use less memory because each key doesn't need to be stored in its entirety. Additionally, the actual complexity may be better if traversing the trie is less expensive than computing the hash and checking for collisions.

I believe there's sufficient reason to try using a trie. However, you'd have to do an experiment to determine which data structure uses less space and is faster.

peterfeiner
Nov 27, 2010, 11:04 AM
To elaborate a bit, you could have the trie in a file, and only read up the records you need while traversing. That means m reads from "slow" storage, but can keep the memory usage way down compared to having the whole structure in RAM. Where I work this technique was used for fast lookups before RDBMS was all the rage. The biggest pitfall was keeping the structure updated as data changed, but in this case it sounds like the data will be fairly static.

-Lee

That's an interesting approach. I imagine a challenging aspect of implementing this would be locality. Ideally, a path would tend to be stored on the same page.

Do you know if they'd mmap the entire file or seek on each step?

peterfeiner
Nov 27, 2010, 11:08 AM
(Sorry, I should have posted my last three replies as one message).

rossipoo, why are you acquiring locks then immediately unlocking them?


if ([[standardWordLocks objectAtIndex: listIndex] tryLock]) {
[[standardWordLocks objectAtIndex: listIndex] unlock];

gnasher729
Nov 27, 2010, 11:26 AM
That tryLock/unlock business that you are doing is just plain wrong. It doesn't protect you in any way and just wastes time. You are basically checking whether there is a lock on the set before you proceed, but someone else could lock it immediately after the unlock and for example start modifying the NSSet, if that is why you were using the locks.

On the other hand, this seems to explain why splitting into multiple sets makes things faster: That way you will have fewer locking conflicts (it is less likely that two threads access the same NSSet when you have multiple sets). Your problem is likely not the data structure, but the fact that you are performing lots of locks, and in an incorrect way.

I suggest you get rid of the useless locks and just use @synchronized, with a single NSSet.

PS. Your code is not thread safe in the very first line. If multiple threads call wordForKey, then it is possible that one thread has looked up a word, has changed lastKey but hasn't changed lastWord yet, and a second thread comes in, finds that the word matches lastKey, and returns lastWord which hasn't been changed yet.

To Peterfeiner: You better give Apple a call and tell them that their implementation of NSSet is far from optimal. On the other hand, NSSet and NSDictionary are expressly there for exactly the purpose of the original poster, and they are used everywhere, so maybe their implementation, developed, tested and optimised over many years, is better than anything someone could hack together in a few days.

lee1210
Nov 27, 2010, 11:34 AM
That's an interesting approach. I imagine a challenging aspect of implementing this would be locality. Ideally, a path would tend to be stored on the same page.

Do you know if they'd mmap the entire file or seek on each step?

No mmap'ing. Each "next" character in a record would point to the record for the new prefix. Seek, read the new prefix node, repeat. The search strings were short (most ~10 characters, hardly ever over 20 characters). The tries could have millions of key/value pairs represented, so the files were too large to read out all at once. I don't recall if the locality of the "next" prefix records was taken into consideration. If the structure was built depth-first it would "just work", if breadth-first, it would spread out the nodes quite far. I feel like it was done as records were added to the structure, so it was probably close to "random" distribution of the nodes. Perhaps sub-optimal, but the code was already complex, the performance was quite good, and the design was developed nearly 30 years ago, and lived without modification until a couple of years ago when we ported to an RDBMS.

-Lee

rossipoo
Nov 27, 2010, 12:04 PM
I've never heard of the trie before. Does each node have 26 children? That sounds really promising. I made a basic tree once before, but I couldn't figure balancing out, but the trie looks like it could be built up while loading without having to worry about balancing.


rossipoo, why are you acquiring locks then immediately unlocking them?

The locks only got used once. That was my thinking anyway. Once they get unlocked, they never lock again, so I wasn't worried about it.

I saw the performance gain from splitting into several sets before I used the locking system. I added the locking system so that I could look up words before the process of loading is complete.

peterfeiner
Nov 27, 2010, 12:15 PM
To Peterfeiner: You better give Apple a call and tell them that their implementation of NSSet is far from optimal. On the other hand, NSSet and NSDictionary are expressly there for exactly the purpose of the original poster, and they are used everywhere, so maybe their implementation, developed, tested and optimised over many years, is better than anything someone could hack together in a few days.

Per the OP's first message, NSSet wasn't doing the job. Your suggestion of using a single NSSet is not constructive: the OP clearly stated that's what he originally tried.

Which algorithm or data structure best suits a problem depends on many facetes of the problem being solved. Checking for membership in a set may seem like a solved problem, and a hash set is a good approach in many situations, but it isn't necessarily the best solution for all situations. Asserting otherwise is naive. Apple's NSSet implementation, albeit well engineered and tested, could simply be a poor fit for the OP's problem, as I've argued in earlier posts.

peterfeiner
Nov 27, 2010, 12:35 PM
I've never heard of the trie before. Does each node have 26 children?

If your words only have either lowercase or uppercase letters, then this would work. However, there are a myriad of approaches you could try here that all have different tradeoffs. If you use an array of 26 pointers, then sparse nodes would waste space. If you used a list or a mapping, then each link would be mor expensive to traverse. I'd try the array of 26 points first and see if the memory use and performance is suitable.

That sounds really promising. I made a basic tree once before, but I couldn't figure balancing out, but the trie looks like it could be built up while loading without having to worry about balancing.

That's correct. A trie does not need to be balanced.



The locks only got used once. That was my thinking anyway. Once they get unlocked, they never lock again, so I wasn't worried about it. I saw the performance gain from splitting into several sets before I used the locking system. I added the locking system so that I could look up words before the process of loading is complete.

Concurrently doing searches and loading does require locks (unless you are careful about how the data structure is built, but I recommend avoiding this as very subtle problems can arise). However, as a couple of us have pointed out, you aren't using them correctly. The general idea is to surround any access to shared data with lock / unlock. For example, in your thread that builds the trie, you might do something like this (pseudocode):


for word in word_list:
lock(set_lock)
add_word_set(word, global_set)
unlock(set_lock)


In your threads that search for words in the list:


lock(set_lock)
in_set = is_word_in_set(word, global_set)
unlock(set_lock)
return in_set


I suggest reading something about concurrent programming. Although I haven't read it, Apple's guide to concurrent programming will probably be a good read: http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreading/.

Edit: I think may have misunderstood how you were using the locks. If you're using the locks to prevent wordForKey from proceeding until the data has been loaded (as opposed to concurrently loading the data and doing searches on the partially-loaded data, as I assumed earlier), then what you're doing isn't incorrect (as long as those locks are initially locked and only the thread loading the data unlocks them). I'm not sure what you're trying to accomplish in the bigger picture though; the threads calling wordForKey will block until loading finishes, so the execution of the threads is essentially serialized.

gnasher729
Nov 27, 2010, 02:31 PM
Per the OP's first message, NSSet wasn't doing the job. Your suggestion of using a single NSSet is not constructive: the OP clearly stated that's what he originally tried.

Which algorithm or data structure best suits a problem depends on many facetes of the problem being solved. Checking for membership in a set may seem like a solved problem, and a hash set is a good approach in many situations, but it isn't necessarily the best solution for all situations. Asserting otherwise is naive. Apple's NSSet implementation, albeit well engineered and tested, could simply be a poor fit for the OP's problem, as I've argued in earlier posts.

You didn't read my post completely then. The original poster performed one lock and one unlock for each access. That is in itself a very expensive operation. However, it gets more expensive for contented locks. tryLock will not just give up when a lock is locked, it will spinlock for a while, wasting time. if two threads keep on banging on the same lock, they will lose time. By splitting this into multiple sets, competing threads will usually not compete for the same lock, making it faster.

The posters problems have most likely nothing to do with any inefficiencies in NSSets, it has all to do with massive amounts of locking and unlocking operations.

By the way, if someone posts here that "NSSet isn't doing its job", then it is much more likely that something else is going wrong. As we finally found out, the trylock/unlock pair could be replaced with a simple "BOOL initialisationFinished". There are about 15 method calls by my count, when one or two would do (the retain / release are unnecessary, length is called three times alone and so on).

And if the problem is indeed NSSet, then there are better ways than writing your own code. For example, using a dozen lines of C++ code using std::map would be worth trying.