PDA

View Full Version : sorting question.




farmerdoug
Feb 5, 2013, 05:14 PM
I need the top ten values in a set of results. I could either store all the results as they are calculated and then sort; or I could test each value as it is calculated. If is larger than the largest ten up to that point, I could discard the smallest.
Looking for suggestions on the best way.
Thanks.



SnowLeopard2008
Feb 5, 2013, 05:31 PM
I need the top ten values in a set of results. I could either store all the results as they are calculated and then sort; or I could test each value as it is calculated. If is larger than the largest ten up to that point, I could discard the smallest.
Looking for suggestions on the best way.
Thanks.

Pick a sorting method. Selection, bubble, etc. I would process the top ten after calculation. Simplifies things a bit.

Giuly
Feb 5, 2013, 05:34 PM
Depending on how large your dataset is and which language you're using, you might as well make use of the higher-level sort functions of arrays, dictionaries, lists, triples, what have you and then cut the first 10 elements.

dmi
Feb 5, 2013, 09:48 PM
I'd suggest scanning through the results while maintaining a ten element heap structure with the top values seen so far.

You could also do a recursive K-median selection algorithm analogous to quicksort, but although the expected time is O(n), unfortunate choices could make it O(n^2)
A median of medians algorithm can guarantee O(n) performance, but the complication may not be worth it compared to maintaining a heap of just 10 elements.

farmerdoug
Feb 6, 2013, 06:14 AM
At the moment I have been doing what dmi has suggest, or close. I scan the top ten every time I have a new value; placing in the right position if is large enough to be included; and shifting the rest down. My guess is that the number of greater values is far less the very large number of total values.

The language is c.

dmi
Feb 6, 2013, 07:42 AM
With a min heap structure, you can make the scan of the top ten with O( log(10) ) comparisons and shifts, instead of O(10) comparisons and shifts, but 10 may be small enough that it the difference is not worth worrying about.

Efrem
Feb 6, 2013, 08:13 AM
At the moment I have been doing what dmi has suggest, or close. I scan the top ten every time I have a new value; placing in the right position if is large enough to be included; and shifting the rest down. My guess is that the number of greater values is far less the very large number of total values.

The language is c.If, as you seem to have said, you keep the top ten in order, then you don't have to scan them. Just compare the new value with the bottom one in the current Top 10 list.

dmi
Feb 6, 2013, 01:58 PM
Just compare the new value with the bottom one in the current Top 10 list.
And if it is greater than the bottom one, it would have to be inserted in order.
This is easy with a heap structure, but a top 10 list is small enough that shifting the everything down instead may not be too bad.

ytk
Feb 6, 2013, 02:02 PM
The simplest way I can think of to do this is to maintain an array of 11 items, and initially populate it with MINVAL for your data type. As you read each item, place it into a temporary variable. Iterate backwards over the array, starting at position 9 (the tenth element). If the value at that position is less than the read value, move it to the next position up and continue iterating. If it is equal or greater, place the read value at the next position up, and repeat. If you reach the bottom of the list, place the read value in position 0.

This approach is nice because it doesn't require any temporary variables to actually perform the sort other than the one holding the read value and an index variable, and it's probably pretty fast because it only needs to read and write elements that potentially need to be shifted.

If you wanted to improve performance a tiny bit (and we're getting really obsessive here :D), you could eliminate the last conditional (testing whether you've reached the bottom of the list). One way to do this would be with a goto statement (instead of a break) after you've placed the value in its place. The other would be to make your array size 12 items, and populating the the first item with MAXVAL, and all the rest with MINVAL. Then when you are done, you just take [1..10] as your top ten list.

farmerdoug
Feb 6, 2013, 02:31 PM
You guys have done yeoman service here. I thought about just looking at the last value or something similar and for some (obviously incorrect) reason rejected it. However, I should do it. Regarding some of the other suggestions, I need to think about them. I didn't grasp the ideas at first reading but will check in to them.

Thanks.

chown33
Feb 6, 2013, 02:34 PM
And if it is greater than the bottom one, it would have to be inserted in order.

His point is that it's not necessary to test the new candidate against every member of the top-ten list.

For a fast-rejection, it's only necessary to compare the new candidate with the smallest item already in the top-ten list. If the candidate value is less, then it's rejected because it will certainly be less than every other value already in the top-ten list. We can deduce this because the initial comparison is performed on the smallest value already in the list. Reject/accept is thus determined with one comparison for each new candidate value

This is in direct contrast to the OP's algorithm: "I scan the top ten every time I have a new value". I.e. ten comparisons for each new candidate value.

After reject/accept has been decided, the new candidate replaces the smallest value already in the list (it will certainly be the item removed, since it's smallest). The list is then resorted. I'd probably start with insertion-sort because the list is so small, and only worry about a faster sort if performance was measurably bad. But after eliminating 9X pointless comparisons, my first guess is the gain from that would be worth more than trying to find a faster sort than insertion-sort.

dmi
Feb 8, 2013, 01:14 AM
This is in direct contrast to the OP's algorithm: "I scan the top ten every time I have a new value".

When you scan an ordered top ten list,
(and being ordered seemed to be implied by the statement about placing it in the right position and shifting everything down)
you can stop as soon as you know that everything else in the list is greater than the value you are scanning for.
If the values come in a random order, the expected number of comparisons needed to check n new values would be about n+10*log(n).
The worst case would be 10*n with a linear scan, or log(10)*n with a heap or binary search.

If you keep the list in a heap order rather than a linear order, then instead of 10*log(10) operations to re-sort the list when you replace an element, you 'd only need log(10) operations to restore the heap order.
But, again, 10 may be small enough that even 10^2 operations to re-sort may not be an excessive burden.

farmerdoug
Feb 8, 2013, 05:45 AM
I am only comparing to the last item on list; then if I have a new item to insert I figure out where it goes; move everything down; and insert it. I also checked my numbers, I'm saving the top ten out of millions. Saving then sort would definitely be a bad idea. What would help now is if you could tell me how to move a structure from one location to another without doing it element by element.

Thanks.

gnasher729
Feb 8, 2013, 07:57 AM
I am only comparing to the last item on list; then if I have a new item to insert I figure out where it goes; move everything down; and insert it. I also checked my numbers, I'm saving the top ten out of millions. Saving then sort would definitely be a bad idea. What would help now is if you could tell me how to move a structure from one location to another without doing it element by element.

Thanks.

For example:

memmove (&array [3], &array [2], 7 * sizeof (array [2]));

The first parameter is the destination, the second parameter is the source, and the third parameter is the number of bytes. This particular call will move seven array elements starting with element #2 to the location starting with element #3, and handles overlapping source and destination correctly. In other words, it moves #2 to #8 into the places #3 to #9.

On MacOS X, memmove, memcpy and memset use hand-optimised assembler code that is actually optimised for the particular machine it is running on. There is no way to do the same thing faster. (memcpy cannot be used here because source and destination are overlapping).

chown33
Feb 8, 2013, 10:02 AM
I am only comparing to the last item on list; ...

If "last" means "the largest of the top ten", then that's inefficient. If "last" means "the smallest of the top ten", then that's much better. If the reason isn't clear, reread post #11 (http://forums.macrumors.com/showpost.php?p=16800331&postcount=11).

A sorted array with the smallest item first and the largest last is called "ascending" order. An array with the largest first and the smallest last is called "descending" order.


What would help now is if you could tell me how to move a structure from one location to another without doing it element by element.

To move a structure (i.e. a single struct), simply assign it:
struct example {
int foo;
char bar[10];
};

struct example arry[10];
arry[3] = arry[8]; // copies 8 to 3

struct example * a;
a = &arry[4];
*a = arry[9]; // copies 9 to 4


To move one or more structs in a single move, use memmove or memcpy.

The distinction between the two things -- a single structure vs. one or more -- is that one is a single item, and the other is an array of items.

For the single struct case, it doesn't matter that a single struct contains an array (of char). The struct itself is a single item.

dmi
Feb 9, 2013, 05:49 AM
In C++ you can maintain a heap with
std::pop_heap and std::push_heap
you can either reverse the Compare operator or negate the values to make it a min heap instead of a max heap

In C, you could build your own like
#define heap_size 10
int heap[heap_size];

void down_heap(int p,int root){
int child;
while( (child=root*2+1) < heap_size ){
if( child+1 < heap_size && heap[child+1] < heap[child] ){
++child;
}
if( heap[child] >= p ){ return; }
heap[root]=heap[child];
heap[child]=p;
root=child;
}
}

and add maintain it like

i=heap_size;
while( scanf("%d",&p ) ){
if( i==0?heap[0]<p:i-- ){ heap[i]=p; down_heap(p,i); }
}