1. He measured. That's the most important thing of all. If you don't measure, you don't know whether the code you want to make faster is actually worth spending your time on, and you don't know whether the time you spent to make it faster is actually making it faster, so it's just a waste of time. So A says "I write the code this way because it runs faster". B says "I checked, it doesn't".
2. He looked at the assembler code. _If_ you go down to that level, then writing unreadable code in the hope that it runs faster, without actually checking, is again just a waste of time. He looked at the code, and there was no benefit from writing a loop that was harder to understand. Actually, that loop performed two branches, which are most likely to hurt performance. So A says "I write the code this way because the assembler code is better". B says "I checked, it isn't".
Micro-optimisations on that level rarely ever pay. Here's what might pay: Use vectors.
typedef unsigned char vec_uint8 __attribute__((__vector_size__(16)));
typedef int vec_int32 __attribute__((__vector_size__(16)));
typedef float vec_float __attribute__((__vector_size__(16)));
This declares types that are vectors of 16 bytes, 4 ints, or 4 floats. Both MacOS X and iOS compilers fully support these types, so you can do up to 16 operations in a single instruction. That can give you a huge factor in speed.
Use multiple threads with GCD. No problem running 8 threads on a 15" MBP, or two threads on an iPhone. 2 to 8 times less time without problems.
Make sure that you know about your caches. When you have lots of data, do all the work on a subset that fits into cache, then another subset that fits into cache, and so on. This can make a _huge_ difference.
All these things operate on a much higher level, and that's where the money is.