PDA

View Full Version : C++ XCode v. 4.6 Shuffling an Array




thrill4rishabh
Apr 9, 2013, 02:07 PM
Hi people,

I am trying to shuffle an array of integers and the function definition I came up with is as follows:



void shuffle (int array[], int & size)
{
cout << "\nShuffling cards.... ....\n";
for (int i = 0; i < size; i++)
{
int random = abs(rand()/size);
swap (array[i], array[random]);
}
cout<<"\n cards shuffled \n";
}



It gives me the line : "Shuffling cards.... ....:"

But then there is a 'Bus error : 10'

Could anyone help me please?

Thanks



subsonix
Apr 9, 2013, 02:17 PM
You probably want to mod your result of rand() with size (%), not divide it. Example if size is 10 and rand() generates 1000, then the index you get is 100, which is out of bounds for your array.

mfram
Apr 9, 2013, 02:19 PM
Well, for one you probably want:

int random = rand() % size;

chown33
Apr 9, 2013, 02:23 PM
What do you expect the value of the variable 'random' to be? Clearly, you expect it to be random, but within what range of values?

What is the actual value of the variable? Is it within the expected range of values? If not, then break down the calculation into smaller parts (there are only two), and work out where it's going wrong.

The essence of debugging is to confirm your expectations. So write out exactly what you expect. Then confirm it by looking at the actual value, maybe using a suitable print to cout, or by using the debugger.

The above applies to all programs: confirm expectations.


You should also learn a bit about shuffle algorithms. Example:
http://en.wikipedia.org/wiki/Fisher–Yates_shuffle

Found by searching Wikipedia for search terms shuffle algorithm

thrill4rishabh
Apr 9, 2013, 02:46 PM
That was stupid of me. Thanks.

lloyddean
Apr 9, 2013, 03:40 PM
Since you're using C++ why not use what's already available in 'algorithm'?

Two versions of 'random_shuffle' the simpler one is exampled below and there is also a version which can be passed a random number generator of your own design.


#include <algorithm>
#include <random>
#include <iterator>
#include <iostream>

int main()
{
using std::cout;
using std::endl;
using std::random_shuffle;
using std::vector;

vector<int> cards = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 };

srand(time(NULL));
random_shuffle(cards.begin(), cards.end());

for ( vector<int>::iterator itr = cards.begin(); itr != cards.end(); itr++ )
{
cout << *itr << " ";
}

cout << endl;

return 0;
}

chown33
Apr 9, 2013, 04:58 PM
That was stupid of me. Thanks.

The bigger mistake is that you apparently have no debugging strategy. If you have no debugging or testing strategy, and no way to examine variables or confirm that values are as expected, then you have no way to find or fix mistakes except by having others do your debugging for you.

A programmer who can't debug or test their own code is about as useful as a cook whose food is never eaten. It may be possible to make amazing things with lots of ingredients, and it may look wonderful, but if no one tastes the result, then it's not really cooking. "The proof of a pudding is in the eating".

gnasher729
Apr 9, 2013, 06:14 PM
Hi people,

I am trying to shuffle an array of integers and the function definition I came up with is as follows:



void shuffle (int array[], int & size)
{
cout << "\nShuffling cards.... ....\n";
for (int i = 0; i < size; i++)
{
int random = abs(rand()/size);
swap (array[i], array[random]);
}
cout<<"\n cards shuffled \n";
}



It gives me the line : "Shuffling cards.... ....:"

But then there is a 'Bus error : 10'

Could anyone help me please?

Thanks

1. Why is size an int& and not an int? Doesn't make sense at all.
2. I'd recommend that things like array sizes should have type size_t.
3. I'd recommend using std:: for standard library functions, like std::swap.

4. To see what's going wrong: What is the range of values that rand () could return? If that is divided by size, what is the range of possible values? And what range does random need to be in?

5. Challenge question: If the problem is fixed, and if the variable random were set to a truly random value from 0 to size-1 in each iteration of the loop, would the result permutation of the array be perfectly random, or would some permuations be more likely than others?

ArtOfWarfare
Apr 9, 2013, 07:18 PM
A programmer who can't debug or test their own code is about as useful as a cook whose food is never eaten. It may be possible to make amazing things with lots of ingredients, and it may look wonderful, but if no one tastes the result, then it's not really cooking. "The proof of a pudding is in the eating".

This analogy struck me as odd. I think a closer one would be "a cook who can't eat their own food."

Left on their own, the cook can't ever develop new recipes - only around the lunch and dinner rush can they determine that their current techniques are yielding undesirable food.

RiskyMr
Apr 21, 2013, 08:25 AM
Challenge question: If the problem is fixed, and if the variable random were set to a truly random value from 0 to size-1 in each iteration of the loop, would the result permutation of the array be perfectly random, or would some permutations be more likely than others?

The result would be perfectly random as each element of the array would be swapped with a random value. (And I'm happy to be proven wrong as I love these kinds of questions.)

I also agree with lloyddean and suggest using the standard library whenever possible.

ArtOfWarfare
Apr 21, 2013, 09:22 AM
5. Challenge question: If the problem is fixed, and if the variable random were set to a truly random value from 0 to size-1 in each iteration of the loop, would the result permutation of the array be perfectly random, or would some permuations be more likely than others?

Imagine a set of size 2 containing A and B. We have 4 possible sequences of two random numbers being drawn:

00: BA
01: AB
10: AB
11: BA

Looks like an even distribution of permutations to me, but I'll try with a set of size 3 just so I can be a bit more confident.

Imagine a set of size 3 containing A, B, and C. We have 27 possible sequences of these 3 random numbers being drawn

000: CAB
001: BCA
002: BAC
010: CBA
011: ACB
012: ABC

So far we're perfectly distributed but an issue just occurred to me. There's 27 possible sequences of random numbers (with repeats) and 6 permutations of 3 elements. Thus, at best we'd have 3 permutations show up 4 times and 3 permutations show up 5 times

I believe this code is perfectly random:

for (int i = 0; i < size; i++) {
int rand = // get random number between i and size-1
swap (array[i], array[rand]);
}

It has size! possible sequences of random numbers and size! possible permutations... or am I missing something? Is not every permutation possible? Is some permutation repeated?

gnasher729
Apr 21, 2013, 04:22 PM
It has size! possible sequences of random numbers and size! possible permutations... or am I missing something? Is not every permutation possible? Is some permutation repeated?

The original algorithm doesn't produce a random sequence. I actually calculated one of the probabilities involved and it was quite wrong: Assume for example we are using this algorithm to shuffle 52 cards. What is the probability that card #52 ends up in position 51? It should be 1/52 (the card should be in any of 52 positions with equal probability); if not then the shuffling isn't random.

In the first 51 exchanges, card 52 will either stay where it is, or move to a different position. If it moves to a different position, it cannot move back to position 52. In each exchange, the probability that the card in position 52 stays in place is 51/52. So the probability that card 52 is in position 52 after 51 moves is (51/52)^51.

In exchange #51, each card will be moved to position 51 with equal probability 1/52. So the probability that card 52 is in position 51 after 51 moves is 1/52.

So what's the probability that card 52 is in position 51 after 52 moves? The chance is 1/52 if it was in position 52 after 51 moves, and 51/52 if it was in position 51 after 51 moves; in all other cases it is 0. So the total probability is

(1/52)(51/52)^51 + (51/52)(1/52) = (1/52) ((51/52)^51 + 51/52) which is quite a bit more than 1/52.

(For n ≥ 3 the original algorithm used n^n random sequences to produce n factorial permuations, which cannot be random because n^n is not divisible by n factorial when n ≥ 3; a much simpler argument. )

Your algorithm should be correct because every permutation can be produced by exactly one sequence of random numbers: To produce the order x0, x1, x2, ..., xn-1, the first random number must be x0, the first random number must be the index where x1 was stored after the first swap, and so on.

lloyddean
Apr 21, 2013, 08:58 PM
My experience writing large game projects in assembler on various processors a couple of decades ago and then in C/C++ would have me do the loop as -


for ( size_t i = size; i--; )
{
int rand = // get random number between i and size-1
swap (array[i], array[rand]);
}

ArtOfWarfare
Apr 21, 2013, 11:14 PM
My experience writing large game projects in assembler on various processors a couple of decades ago and then in C/C++ would have me do the loop as -


for ( size_t i = size; i--; )
{
int rand = // get random number between i and size-1
swap (array[i], array[rand]);
}

Does that save an instruction on each cycle by using BEZ (or do actual ISAs have support for actually doing BGT and BLT and BEQ? My class on computer architecture focused on MIPS - we glanced at ARM and x86 during a few assignments.)

gnasher729
Apr 22, 2013, 03:02 AM
My experience writing large game projects in assembler on various processors a couple of decades ago and then in C/C++ would have me do the loop as -


for ( size_t i = size; i--; )
{
int rand = // get random number between i and size-1
swap (array[i], array[rand]);
}

You deliberately wrote the loop in a highly unreadable style. Why? You also changed the algorithm which was quite obviously correct, and you will have a much much harder time to prove that it is actually correct.


Does that save an instruction on each cycle by using BEZ (or do actual ISAs have support for actually doing BGT and BLT and BEQ? My class on computer architecture focused on MIPS - we glanced at ARM and x86 during a few assignments.)

1. Your compiler will optimize the C code. If there are different ways to write the code, they will all produce the same executable code, except unusual coding styles might not be handled well by the compiler. Like this one.

2. Calculating execution times on a modern processor is basically impossible. You have to measure it. If you don't measure it, forget about optimising.

3. Whatever savings you get by writing a simple loop in an awful style is dwarved by the execution time of your random number generator and therefore not worth it.

ArtOfWarfare
Apr 22, 2013, 10:28 AM
You deliberately wrote the loop in a highly unreadable style. Why? You also changed the algorithm which was quite obviously correct, and you will have a much much harder time to prove that it is actually correct.




1. Your compiler will optimize the C code. If there are different ways to write the code, they will all produce the same executable code, except unusual coding styles might not be handled well by the compiler. Like this one.

2. Calculating execution times on a modern processor is basically impossible. You have to measure it. If you don't measure it, forget about optimising.

3. Whatever savings you get by writing a simple loop in an awful style is dwarved by the execution time of your random number generator and therefore not worth it.

I would use Sandy Bridge's built in random number generator, which I understand is quite quick. Also, would the compiler see the code as the same? Would it really change a forward iteration to a backwards iteration? It seems like the fact that they'll yield the samel results is the same is unobvious on account of the fact that you're moving elements about in the array that you're iterating over. It seems to me the assembly code will have one line fewer in the loop, which didn't have many lines, so it'll go somewhere around 20% faster. People like talking about how new macs are 20% faster than old macs, so I don't see why they wouldn't want their code to execute 20% faster, too. I understand its not changing the big O class, but O(n) is hard to beat anyways.

Catfish_Man
Apr 22, 2013, 12:53 PM
I would use Sandy Bridge's built in random number generator, which I understand is quite quick. Also, would the compiler see the code as the same? Would it really change a forward iteration to a backwards iteration? It seems like the fact that they'll yield the samel results is the same is unobvious on account of the fact that you're moving elements about in the array that you're iterating over. It seems to me the assembly code will have one line fewer in the loop, which didn't have many lines, so it'll go somewhere around 20% faster. People like talking about how new macs are 20% faster than old macs, so I don't see why they wouldn't want their code to execute 20% faster, too. I understand its not changing the big O class, but O(n) is hard to beat anyways.

It won't be 20% faster. Not even close. Not all lines of assembly take the same amount of time (in fact, the same line, in a different loop iteration, may take a wildly different amount of time).

lloyddean
Apr 22, 2013, 09:42 PM
I'm in no mood to argue in these forums. I'm not going to beat my chest trying to convince you of anything. I'm here to learn things myself, if not about the technologies then about the people and their way of thinking. And mostly as an observer.

But your statement ...

You deliberately wrote the loop in a highly unreadable style.

... is very subjective and in my experiences having done systems and embedded (I'm NOT promoting my experience over yours, but for me) I see C/C++ as high-level assemblers and there are times when I want the algorithms to be what I write and NOT what the optimizer mangles it into.

Not all architectures have quality, good or bug free optimizers that can be counted on.

I will agree with your statement -

Calculating execution times on a modern processor is basically impossible. You have to measure it. If you don't measure it, forget about optimising.

That's what I've done and my experiences on "some" platforms leaves me with different conclusions.

Enjoy

ghellquist
Apr 23, 2013, 09:32 AM
.... it'll go somewhere around 20% faster.


Hi.
The thing about small-scale optimizations is that it should be done when and where it makes a difference. I do find it very difficult to stay away from that kind of optimizations myself, but maybe we should be aware about the problems it can create.

I also started out as an assembler programmer doing hard real time systems software (process control and such). This is a very special area where it sometimes is necessary to write "ugly" but fast code. The problem is that it always becomes expensive to maintain the special things after a while. It is the same in a high level language like C or C++, special code is expensive to maintain.

It is much better to write well-structured code using standard, well-known coding conventions. In my experience C sort of invites you to, sometimes, write rather convoluted code lines doing a lot of things on each line. My suggestion is to avoid that at the first pass. There are many reasons for this:
- first of all, the code actually has to do what you intend for it to do. Better write readable code than optimized. You will most likely have to read the code several times to try to find the bugs.
- secondly most code has to be maintained, often by a different programmer than the one writing it from the beginning.
- yet another aspect is that is very difficult to really understand what effect a source level optimization actually has. Often on a large processor the loop variables can be stored in registers, and the loop code cached in the fastest L1 cache. The part of the code taking long time then might be accessing memory variables (and these might be cached as well). Many optimizations on "small scale" might actually make the code run slower. A typical example is that starting by accessing an array from high adresses down to low might take a lot longer time than accessing the same array starting from low adresses. The reason is that caches intervene and might prefetch data towards higher adresses only. On a different processor, the effect might be totally different.
- the important point as I see it, is to try to measure where an optimization really does a difference. Typically this is not in a "run-once" internal initialization routine like setting up the deck for a card game. A more typical place might instead be where sprites and things are move around on the screen. Doing graphic calls without thinking can create a rather sluggish user experience.


Well, my ramblings, worth about as much as the paper it is written on (what? no paper! ).

// gunnar

subsonix
Apr 23, 2013, 09:55 AM
Lloyd, this is why I don't give out anything too valuable on these internet forums. It's an effort in futility.

In this case it's nonsense. All you would have to do is to profile it or generate assembly instead of speculating.

Short for loop:


LBB3_5: ## Parent Loop BB3_3 Depth=1
## => This Inner Loop Header: Depth=2
movabsq $-1, %rax
movq -72(%rbp), %rcx
movq %rcx, -104(%rbp) ## 8-byte Spill
addq %rax, %rcx
movq %rcx, -72(%rbp)
movq -104(%rbp), %rax ## 8-byte Reload
cmpq $0, %rax
je LBB3_7
## BB#6: ## in Loop: Header=BB3_5 Depth=2
movl $1048575, %eax ## imm = 0xFFFFF
movl %eax, -108(%rbp) ## 4-byte Spill
callq _rand
movl $-2147481599, %ecx ## imm = 0xFFFFFFFF80000801
movl %eax, -112(%rbp) ## 4-byte Spill
imull %ecx
movl -112(%rbp), %eax ## 4-byte Reload
leal (%rdx,%rax), %ecx
movl %ecx, -116(%rbp) ## 4-byte Spill
shrl $31, %ecx
movl -116(%rbp), %edx ## 4-byte Reload
sarl $19, %edx
addl %edx, %ecx
imull $1048575, %ecx, %ecx ## imm = 0xFFFFF
subl %ecx, %eax
movl -112(%rbp), %ecx ## 4-byte Reload
movl %eax, -52(%rbp)
movq -72(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r8
imulq %rsi, %r8
addq %rdi, %r8
movslq -52(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r9
imulq %rsi, %r9
addq %rdi, %r9
movq %r8, %rdi
movq %r9, %rsi
movl %ecx, -120(%rbp) ## 4-byte Spill
callq _swap
jmp LBB3_5


Idiomatic for loop:


LBB3_5: ## Parent Loop BB3_3 Depth=1
## => This Inner Loop Header: Depth=2
movq -72(%rbp), %rax
cmpq $1048576, %rax ## imm = 0x100000
jae LBB3_8
## BB#6: ## in Loop: Header=BB3_5 Depth=2
movl $1048575, %eax ## imm = 0xFFFFF
movl %eax, -100(%rbp) ## 4-byte Spill
callq _rand
movl $-2147481599, %ecx ## imm = 0xFFFFFFFF80000801
movl %eax, -104(%rbp) ## 4-byte Spill
imull %ecx
movl -104(%rbp), %eax ## 4-byte Reload
leal (%rdx,%rax), %ecx
movl %ecx, -108(%rbp) ## 4-byte Spill
shrl $31, %ecx
movl -108(%rbp), %edx ## 4-byte Reload
sarl $19, %edx
addl %edx, %ecx
imull $1048575, %ecx, %ecx ## imm = 0xFFFFF
subl %ecx, %eax
movl -104(%rbp), %ecx ## 4-byte Reload
movl %eax, -52(%rbp)
movq -72(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r8
imulq %rsi, %r8
addq %rdi, %r8
movslq -52(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r9
imulq %rsi, %r9
addq %rdi, %r9
movq %r8, %rdi
movq %r9, %rsi
movl %ecx, -112(%rbp) ## 4-byte Spill
callq _swap
## BB#7: ## in Loop: Header=BB3_5 Depth=2
movabsq $1, %rax
movq -72(%rbp), %rcx
addq %rax, %rcx
movq %rcx, -72(%rbp)
jmp LBB3_5


I tested both loop types on 100.000.000 iterations of the loop, it made no difference what so ever with a result around 48 - 50ms.

softwareguy256
Apr 23, 2013, 09:13 PM
I have no idea what you are talking about or trying to prove here. I have a feeling though you are on the wrong track. Better listen to Lloyd he knows what he is talking about.

In this case it's nonsense. All you would have to do is to profile it or generate assembly instead of speculating.

Short for loop:


LBB3_5: ## Parent Loop BB3_3 Depth=1
## => This Inner Loop Header: Depth=2
movabsq $-1, %rax
movq -72(%rbp), %rcx
movq %rcx, -104(%rbp) ## 8-byte Spill
addq %rax, %rcx
movq %rcx, -72(%rbp)
movq -104(%rbp), %rax ## 8-byte Reload
cmpq $0, %rax
je LBB3_7
## BB#6: ## in Loop: Header=BB3_5 Depth=2
movl $1048575, %eax ## imm = 0xFFFFF
movl %eax, -108(%rbp) ## 4-byte Spill
callq _rand
movl $-2147481599, %ecx ## imm = 0xFFFFFFFF80000801
movl %eax, -112(%rbp) ## 4-byte Spill
imull %ecx
movl -112(%rbp), %eax ## 4-byte Reload
leal (%rdx,%rax), %ecx
movl %ecx, -116(%rbp) ## 4-byte Spill
shrl $31, %ecx
movl -116(%rbp), %edx ## 4-byte Reload
sarl $19, %edx
addl %edx, %ecx
imull $1048575, %ecx, %ecx ## imm = 0xFFFFF
subl %ecx, %eax
movl -112(%rbp), %ecx ## 4-byte Reload
movl %eax, -52(%rbp)
movq -72(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r8
imulq %rsi, %r8
addq %rdi, %r8
movslq -52(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r9
imulq %rsi, %r9
addq %rdi, %r9
movq %r8, %rdi
movq %r9, %rsi
movl %ecx, -120(%rbp) ## 4-byte Spill
callq _swap
jmp LBB3_5


Idiomatic for loop:


LBB3_5: ## Parent Loop BB3_3 Depth=1
## => This Inner Loop Header: Depth=2
movq -72(%rbp), %rax
cmpq $1048576, %rax ## imm = 0x100000
jae LBB3_8
## BB#6: ## in Loop: Header=BB3_5 Depth=2
movl $1048575, %eax ## imm = 0xFFFFF
movl %eax, -100(%rbp) ## 4-byte Spill
callq _rand
movl $-2147481599, %ecx ## imm = 0xFFFFFFFF80000801
movl %eax, -104(%rbp) ## 4-byte Spill
imull %ecx
movl -104(%rbp), %eax ## 4-byte Reload
leal (%rdx,%rax), %ecx
movl %ecx, -108(%rbp) ## 4-byte Spill
shrl $31, %ecx
movl -108(%rbp), %edx ## 4-byte Reload
sarl $19, %edx
addl %edx, %ecx
imull $1048575, %ecx, %ecx ## imm = 0xFFFFF
subl %ecx, %eax
movl -104(%rbp), %ecx ## 4-byte Reload
movl %eax, -52(%rbp)
movq -72(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r8
imulq %rsi, %r8
addq %rdi, %r8
movslq -52(%rbp), %rsi
movq -48(%rbp), %rdi
movabsq $4, %r9
imulq %rsi, %r9
addq %rdi, %r9
movq %r8, %rdi
movq %r9, %rsi
movl %ecx, -112(%rbp) ## 4-byte Spill
callq _swap
## BB#7: ## in Loop: Header=BB3_5 Depth=2
movabsq $1, %rax
movq -72(%rbp), %rcx
addq %rax, %rcx
movq %rcx, -72(%rbp)
jmp LBB3_5


I tested both loop types on 100.000.000 iterations of the loop, it made no difference what so ever with a result around 48 - 50ms.

Catfish_Man
Apr 23, 2013, 11:58 PM
I have no idea what you are talking about or trying to prove here. I have a feeling though you are on the wrong track. Better listen to Lloyd he knows what he is talking about.

That's one of the most depressing replies I've seen on this forum.

"I have no idea what you said, but you're wrong" is so aggressively ignorant that I actually hope you're just trying to provoke him, and not actually serious.

subsonix
Apr 24, 2013, 06:26 AM
I have no idea what you are talking about or trying to prove here.

We are discussing the impact of: 'for(int i = SIZE; i--; )' over: 'for(int i = 0; i < SIZE; i++)'

What I suggested was to output the assembly produced by the compiler for each loop type, and I also provided it in the post. The first example is the shorter for loop with both the evaluation and decrement step in the middle part as Lloyd suggested. This loop produces assembly with all branching code in the beginning, up until the ##BB#6 comment, (not including the jmp instruction at the very end). The idiomatic for loop does the incrementation at the end of the loop after ##BB#7 and the branching in the beginning before ##BB#6, much like the loop in C. However, the idiomatic for loop appears to actually use 1 less instruction for this.

When I tested the loop types I used an array of size 1.000.000, and shuffled it 100 times which gives 100.000.000 iterations in total. I added a timestamp with mach_absolute_time() before and after the loop and converted it to nano seconds with AbsoluteToNanoseconds(), I then divided the result with 100 to get an average time spent shuffling an array of 1.000.000. The reason for this last step is to suppress the impact the OS and scheduler may have on our process. Skipping this last step would however show a difference (if any) more clearly afaik.


Edit:

Running the shorter for loop test 5 times:
Elapsed time: 43960619ns
Elapsed time: 44968580ns
Elapsed time: 44714770ns
Elapsed time: 45216508ns
Elapsed time: 45167888ns

Running the idiomatic for loop test 5 times:
Elapsed time: 43724876ns
Elapsed time: 43464096ns
Elapsed time: 43796514ns
Elapsed time: 43297202ns
Elapsed time: 45620971ns

gnasher729
Apr 24, 2013, 10:40 AM
I have no idea what you are talking about or trying to prove here. I have a feeling though you are on the wrong track. Better listen to Lloyd he knows what he is talking about.

1. He measured. That's the most important thing of all. If you don't measure, you don't know whether the code you want to make faster is actually worth spending your time on, and you don't know whether the time you spent to make it faster is actually making it faster, so it's just a waste of time. So A says "I write the code this way because it runs faster". B says "I checked, it doesn't".

2. He looked at the assembler code. _If_ you go down to that level, then writing unreadable code in the hope that it runs faster, without actually checking, is again just a waste of time. He looked at the code, and there was no benefit from writing a loop that was harder to understand. Actually, that loop performed two branches, which are most likely to hurt performance. So A says "I write the code this way because the assembler code is better". B says "I checked, it isn't".

Micro-optimisations on that level rarely ever pay. Here's what might pay: Use vectors.

typedef unsigned char vec_uint8 __attribute__((__vector_size__(16)));
typedef int vec_int32 __attribute__((__vector_size__(16)));
typedef float vec_float __attribute__((__vector_size__(16)));


This declares types that are vectors of 16 bytes, 4 ints, or 4 floats. Both MacOS X and iOS compilers fully support these types, so you can do up to 16 operations in a single instruction. That can give you a huge factor in speed.

Use multiple threads with GCD. No problem running 8 threads on a 15" MBP, or two threads on an iPhone. 2 to 8 times less time without problems.

Make sure that you know about your caches. When you have lots of data, do all the work on a subset that fits into cache, then another subset that fits into cache, and so on. This can make a _huge_ difference.

All these things operate on a much higher level, and that's where the money is.

ArtOfWarfare
Apr 24, 2013, 11:09 AM
Here's what might pay: Use vectors.

typedef unsigned char vec_uint8 __attribute__((__vector_size__(16)));
typedef int vec_int32 __attribute__((__vector_size__(16)));
typedef float vec_float __attribute__((__vector_size__(16)));


This declares types that are vectors of 16 bytes, 4 ints, or 4 floats. Both MacOS X and iOS compilers fully support these types, so you can do up to 16 operations in a single instruction. That can give you a huge factor in speed.

How would you utilize these in an array shuffling algorithm? Or are we past the point of actually talking about shuffling arrays at all?

It's interesting that I've never heard of these types before. I can see how they'd be very helpful with performing vector operations... but it seems like I rarely need to do anything with those without an API separating me from the vector.

softwareguy256
Apr 25, 2013, 11:05 PM
No what is depressing is that no one sees the glaring mistake except Lloyd and I. I am not as charitable as Lloyd though, I think I gave you guys enough help. If you still can't figure it out, you'll be making 5 figures the rest of your life.

That's one of the most depressing replies I've seen on this forum.

"I have no idea what you said, but you're wrong" is so aggressively ignorant that I actually hope you're just trying to provoke him, and not actually serious.

----------

I would optimize code that is correct first lol. THat is what is so freaking hilarious, guy wastes all this effort for nothing.

1. He measured. That's the most important thing of all. If you don't measure, you don't know whether the code you want to make faster is actually worth spending your time on, and you don't know whether the time you spent to make it faster is actually making it faster, so it's just a waste of time. So A says "I write the code this way because it runs faster". B says "I checked, it doesn't".

2. He looked at the assembler code. _If_ you go down to that level, then writing unreadable code in the hope that it runs faster, without actually checking, is again just a waste of time. He looked at the code, and there was no benefit from writing a loop that was harder to understand. Actually, that loop performed two branches, which are most likely to hurt performance. So A says "I write the code this way because the assembler code is better". B says "I checked, it isn't".

Micro-optimisations on that level rarely ever pay. Here's what might pay: Use vectors.

typedef unsigned char vec_uint8 __attribute__((__vector_size__(16)));
typedef int vec_int32 __attribute__((__vector_size__(16)));
typedef float vec_float __attribute__((__vector_size__(16)));


This declares types that are vectors of 16 bytes, 4 ints, or 4 floats. Both MacOS X and iOS compilers fully support these types, so you can do up to 16 operations in a single instruction. That can give you a huge factor in speed.

Use multiple threads with GCD. No problem running 8 threads on a 15" MBP, or two threads on an iPhone. 2 to 8 times less time without problems.

Make sure that you know about your caches. When you have lots of data, do all the work on a subset that fits into cache, then another subset that fits into cache, and so on. This can make a _huge_ difference.

All these things operate on a much higher level, and that's where the money is.