Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Even if Apple jacked up their own chips in secret, and announced them next week, people would freak. Talk about Beta testing. It's bad enough with their reckless disregard for port standards :D
 
I don't see any theoretical or practical reason why this should not be doable. You take x86 binary, translate it to LLVM IR, then let LLVM optimise and output the ARM binary. This challenge is equivalent to "classical" compiling and optimising (sans of course type meta information you have in a high-level language). AFAIK, there are projects that attempt to do just that (I don't know how successful they are though). As to vector instructions, they might be slightly different between the architecture sets, but NEON and SSE/AVX are quite similar in spirit — as far as I know (has been some time since I last done any hand-coded vector assembly) they have comparable types of instructions — and I see no fundamental problems into mapping one to another. Now, stuff like AVX-512 with its mask registers etc. is another beast altogether and emulating those would probably be very slow, but thats not a problem one has to deal within the next few years I guess.

To sum it up, I am quite optimistic about binary recompilation of this sort. Again, this is purely theoretic. I am not advocating that Apple moves to ARM. I am merely interested in discussing if/how this could be accomplished and what the practical implications of such a move might be.

There is no x86 binary to LLVM IR, compilers are 1 way operation. You don't go from machine specific code back to any intermediary that can be recompiled without some form of emulation, not to mention you would need translations for ALL the libraries referenced by a particular program, think WINE for linux, and that's for binaries in the same architecture. And like I already said, ARM doesn't even have equivalent super-scalar instructions used by many professional tools (video editing, computational)
 
There is no x86 binary to LLVM IR, compilers are 1 way operation. You don't go from machine specific code back to any intermediary that can be recompiled without some form of emulation

Ever heard of dissemblers? Why would you even say that going from x86 to LLVM IR is impossible? Its just another compiler, and a fairly straightforward one (its just the details that are tricky to get). You have one (well defined) language, you transform it to another one (well defined) language.

And anyway, just first two google search results on the topic:

https://github.com/trailofbits/mcsema
http://llvm.org/devmtg/2013-04/bougacha-slides.pdf

Edit: and a x86 to C/C++ decompiler:

http://derevenets.com/examples.html

not to mention you would need translations for ALL the libraries referenced by a particular program, think WINE for linux, and that's for binaries in the same architecture.

Third-party libraries could be compiled the same way I have described above. System libraries are just provided by the OS. The compilation to different architecture would be completely transparent to the application.

And like I already said, ARM doesn't even have equivalent super-scalar instructions used by many professional tools (video editing, computational)

Could you explain what you mean by "super-scalar instructions?" I am not familiar with the term. I know what a superscalar CPU is but I never heard the term be applied to instructions or instruction sets.
 
Last edited:
Could you explain what you mean by "super-scalar instructions?" I am not familiar with the term. I know what a superscalar CPU is but I never heard the term be applied to instructions or instruction sets.

AVX instructions. SSE4.1+ (probably?)
 
Ever heard of dissemblers? Why would you even say that going from x86 to LLVM IR is impossible? Its just another compiler, and a fairly straightforward one (its just the details that are tricky to get). You have one (well defined) language, you transform it to another one (well defined) language.

And anyway, just first two google search results on the topic:

https://github.com/trailofbits/mcsema
http://llvm.org/devmtg/2013-04/bougacha-slides.pdf

Edit: and a x86 to C/C++ decompiler:

http://derevenets.com/examples.html



Third-party libraries could be compiled the same way I have described above. System libraries are just provided by the OS. The compilation to different architecture would be completely transparent to the application.



Could you explain what you mean by "super-scalar instructions?" I am not familiar with the term. I know what a superscalar CPU is but I never heard the term be applied to instructions or instruction sets.

Superscalar instructions are superscalar CPU architectures with additional instructions that support superscalar operations such as Intel's VT-d, VT-x, SSE, SIMD. (VT-x/d are not directly superscalar but allow virtualization of superscalar instructions)

Disassemblers go from machine code to ASM of that architecture ie you will get x86/64 ASM, not cross compatible c/c++. That alone does not let you convert the binary with 1:1 conversion to a new architecture because you can't guarantee same memory access patterns in the new host. Suppose host code had N bytes reserved for instruction memory and Y bytes reserved for data. You would have to analyze that, simply allocating the same amount in LLVM IR would break because the code size for IR is not 1:1 and the memory usage is not 1:1.

No product exists for even like architectures let alone different architectures. Super-scalar instructions utilize multiple hw datapaths in parallel, for example, if you are doing video encoding since many things from frame to frame can be done in parallel, you can do things like super-scalar addition operations which will instead of doing single register + register = register, will do a full vector of values. This is a very basic case, but instructions like that which are the cornerstone for most high performance applications just won't exist under an ARM ISA.
 
Superscalar instructions are superscalar CPU architectures with additional instructions that support superscalar operations such as Intel's VT-d, VT-x, SSE.

1. Sigh... Superscalar is a term applied to CPUs which reorder/reoptimise the instruction stream and are thus able to execute multiple instructions in parallel. This is also often referred to as instruction level parallelism. Both modern Intel x86 and Apple ARM CPUs are superscalar (exception: Intel Atom, which is not superscalar).

2. SSE/AVX are a SIMD/vector instruction set extensions. They have nothing to do with being superscalar. ARM also has its SIMD instructions which are very similar to SSE/AVX. Its called NEON.

Disassemblers go from machine code to ASM of that architecture ie you will get x86/64 ASM, not cross compatible c/c++.

3. In my previous post I have linked a number of tools that decompile x86 into different representations, such as LLVM IR and C/C++. I find it mildly amusing that you still deny the possibility of such a tool even after it was presented to you.

That alone does not let you convert the binary with 1:1 conversion to a new architecture because you can't guarantee same memory access patterns in the new host. Suppose host code had N bytes reserved for instruction memory and Y bytes reserved for data. You would have to analyze that, simply allocating the same amount in LLVM IR would break because the code size for IR is not 1:1 and the memory usage is not 1:1.

4. I don't see why it matters. Of course one would recompute the offsets appropriately. Thats trivial. LLVM has very flexible support for data types and can match all native data types used by x86 and ARM CPUs. The only tricky part is when it comes to alignment of data and when code makes assumptions about data sizes. Which, luckily, is not an issue in this particular, because x86-64 and A64 have exact same data size and alignment specs.

Super-scalar instructions utilize multiple hw datapaths in parallel, for example, if you are doing video encoding since many things from frame to frame can be done in parallel, you can do things like super-scalar addition operations which will instead of doing single register + register = register, will do a full vector of values. This is a very basic case, but instructions like that which are the cornerstone for most high performance applications just won't exist under an ARM ISA.

5. Again, this is called SIMD and has nothing to do with superscalar execution. Please don't invent new terminology in order to make a point. Intel Atom is not superscalar, but it supports SSE.

6. ARM has had vector instructions for years: https://www.arm.com/products/processors/technologies/neon.php

7. All Apple ARM fully support advanced ARM vector instructions and Apple even offers the devs a number of well-implemented numeric libraries that take full advantage of such instructions, both for Intel and ARM side.

------------

Final note: I don't think that I want to continue this discussion any longer. You seem have some experience with programming and you seem to have read some tech articles here or there, for all good it did you. But its also clear that you are very adamant about ignoring your lack of basic education, like the fact that you don't know what superscalar means or that ARM has SIMD instructions. I already have to explain some of this stuff in my day job as university lecturer and programmer, so you'll have to excuse me if I get bored quickly if I also have to do it on an internet forum.
 
5. Again, this is called SIMD and has nothing to do with superscalar execution. Please don't invent new terminology in order to make a point. Intel Atom is not superscalar, but it supports SSE.
Intel Atom supports out-of-order execution since Silvermont.
 
Intel Atom supports out-of-order execution since Silvermont.

Ah, true, thanks, I forgot about this one :) Still, my main point was to show that instruction level paralellism and SIMD instructions are orthogonal features.
 
4. I don't see why it matters. Of course one would recompute the offsets appropriately. Thats trivial. LLVM has very flexible support for data types and can match all native data types used by x86 and ARM CPUs. The only tricky part is when it comes to alignment of data and when code makes assumptions about data sizes. Which, luckily, is not an issue in this particular, because x86-64 and A64 have exact same data size and alignment specs.
What about platform specific ABIs? Backend specific behaviors? Differences in consistency models?
 
  • Like
Reactions: meme1255
1. Sigh... Superscalar is a term applied to CPUs which reorder/reoptimise the instruction stream and are thus able to execute multiple instructions in parallel. This is also often referred to as instruction level parallelism. Both modern Intel x86 and Apple ARM CPUs are superscalar (exception: Intel Atom, which is not superscalar).

2. SSE/AVX are a SIMD/vector instruction set extensions. They have nothing to do with being superscalar. ARM also has its SIMD instructions which are very similar to SSE/AVX. Its called NEON.



3. In my previous post I have linked a number of tools that decompile x86 into different representations, such as LLVM IR and C/C++. I find it mildly amusing that you still deny the possibility of such a tool even after it was presented to you.



4. I don't see why it matters. Of course one would recompute the offsets appropriately. Thats trivial. LLVM has very flexible support for data types and can match all native data types used by x86 and ARM CPUs. The only tricky part is when it comes to alignment of data and when code makes assumptions about data sizes. Which, luckily, is not an issue in this particular, because x86-64 and A64 have exact same data size and alignment specs.



5. Again, this is called SIMD and has nothing to do with superscalar execution. Please don't invent new terminology in order to make a point. Intel Atom is not superscalar, but it supports SSE.

6. ARM has had vector instructions for years: https://www.arm.com/products/processors/technologies/neon.php

7. All Apple ARM fully support advanced ARM vector instructions and Apple even offers the devs a number of well-implemented numeric libraries that take full advantage of such instructions, both for Intel and ARM side.

------------

Final note: I don't think that I want to continue this discussion any longer. You seem have some experience with programming and you seem to have read some tech articles here or there, for all good it did you. But its also clear that you are very adamant about ignoring your lack of basic education, like the fact that you don't know what superscalar means or that ARM has SIMD instructions. I already have to explain some of this stuff in my day job as university lecturer and programmer, so you'll have to excuse me if I get bored quickly if I also have to do it on an internet forum.

You are confusing out of order execution and superscalar, ie if you have a multiply op then a add op that reference completely separate registers, but the multiply is waiting for a previous instruction, you can do the add instruction first and later retire the op in order.

SIMD, and all vector operations, are superscalar execution... SIMD - Single Instruction Multiple Data - can't process multiple data in a single pipeline stage without being superscalar. Maybe you should take a basic computer architecture class.

ARM superscalar arch is not for high performance, it's for better efficiency for a particular power envelope. A basic VLSI class would tell you that completing a particular task faster saves more power than lowering the clock speed or power gating more transistors. x86-64 don't have the same instruction data size, 64 bit only refers to memory ALIGNMENT. x86-64 has variable size instructions, ARM does not. LLVM is not meant to reverse engineer and recompile for different architectures, that is a much more complicated problem and there is no way Apple will move to any ARM based processors in a Macbook Pro in the foreseeable future. End of story.

Lastly the tools you mentioned are not for what you think they are lol, feel free to take a lib written for x86-64, decompile to C and then try to recompile it with LLVM for AARCH64. 0% chance of the program working. There is a reason why cross compilers don't support all libraries by default.
 
You are confusing out of order execution and superscalar, ie if you have a multiply op then a add op that reference completely separate registers, but the multiply is waiting for a previous instruction, you can do the add instruction first and later retire the op in order.

Again, the common definition of superscalar is the ability to execute multiple instructions at once. Which is why most relevant superscalar CPUs are out of order. These two things go hand in hand in modern CPU design.

SIMD, and all vector operations, are superscalar execution... SIMD - Single Instruction Multiple Data - can't process multiple data in a single pipeline stage without being superscalar.

You don't have to be superscalar to have wide ALUs (e.g. older GPUs, I am not sure if modern ones are superscalar designs). SIMD = single instruction — a CPU can have SIMD without the ability to execute multiple instructions in parallel.

ARM superscalar arch is not for high performance, it's for better efficiency for a particular power envelope.

Which can be said about any superscalar CPU. Superscalar and out-of-order are there to maximise the utilisation of execution units.

x86-64 don't have the same instruction data size ... x86-64 has variable size instructions, ARM does not

I fail to see why this is relevant. The instruction napping won't be one to one in any case.

64 bit only refers to memory ALIGNMENT

It first and foremost refers to basic pointer and register size. Anyway, as I mentioned before, the sizes and alignments of basic data types (chars, ints, longs, *void etc.) are the same for A64 and x86-64, which means that all data structures are binary compatible.
[doublepost=1477287284][/doublepost]
What about platform specific ABIs?

Those can be also translated automatically (they are regular after all). I am not 100% sure at the moment whether the translator would need to have access to function signatures though or whether it can do without. I suppose that signatures are not needed if the translator acts conservatively, but that might hurt performance.

Differences in consistency models?

Now, this is going to be a problem :D Admittedly, I don't know anything about consistency model in ARM spec. But yeah, that is potentially big can of worms. In fact, multithreading is indeed where this entire story probably fails spectacularly. Unless Apple made sure that their CPUs have compatible behaviour to Intel's.

I can think of a number of additional scenarios where the x86 to ARM translation can't work. Any kind of software that generates code (JIters), any kind of software that is carefully crafted in assembler (interpreters and friends), any kind of self-modifying software etc.
 
Last edited:
  • Like
Reactions: wbrat
Again, the common definition of superscalar is the ability to execute multiple instructions at once. Which is why most relevant superscalar CPUs are out of order. These two things go hand in hand in modern CPU design.

Superscalar is completely independent from out of order, ex: Intel's Xeon Phi. Completely different optimization at play.

You don't have to be superscalar to have wide ALUs (e.g. older GPUs, I am not sure if modern ones are superscalar designs). SIMD = single instruction — a CPU can have SIMD without the ability to execute multiple instructions in parallel.

Please provide example of any general purpose processor that has wide functional units without superscalar. How would you utilize it in a load/store machine?

Superscalar instructions are used when ILP can replace single issue instructions, for example, two arrays being added together and result put in third array, instead of increment pointer by 1 data value, doing the add, and writing to memory, normal instruction would be replaced by a single SIMD instruction that would load N bytes and do N adds in parallel then write the N results to memory, where N is width of that particular instruction.

Which can be said about any superscalar CPU. Superscalar and out-of-order are there to maximise the utilisation of execution units.

No, it really can't, all superscalar/out of order/VLIW arch are not equal. Design tradeoffs are based on requirements for power and the eventual application it will be used in (mobile, embedded, server, etc). ARM ISA is not suitable for the needs that current Intel chips provide.

I fail to see why this is relevant. The instruction napping won't be one to one in any case.

Instruction mapping won't work period. If original program uses some memory mapped feature in one instruction set, new instruction set wouldn't know how to re implement it unless it was manually patched. It would pretend like it was a regular memory access and the program would fail. Example: run x86-64 version of perf, convert to arm64 binary -> kernel panic when trying to access memory locations that map to counter data that doesn't exist.

It first and foremost refers to basic pointer and register size. Anyway, as I mentioned before, the sizes and alignments of basic data types (chars, ints, longs, *void etc.) are the same for A64 and x86-64, which means that all data structures are binary compatible.

No, chars ints long *voids don't exist to the hardware. These are just software abstractions interpreted by the compiler. Also pointers don't point to 64 bit data types, both x86-64 and aarch64 are byte addressable, not to mention only lower 48 bits are used for addresses.
 
Superscalar is completely independent from out of order, ex: Intel's Xeon Phi. Completely different optimization at play.

Thats what I said, yes. I also said that superscalar plays nicely with out of order, thats why most modern CPUs are both.


Please provide example of any general purpose processor that has wide functional units without superscalar. How would you utilize it in a load/store machine?

I did: Intel Atom before Silvermont.

Superscalar instructions are used when ILP can replace single issue instructions, for example, two arrays being added together and result put in third array, instead of increment pointer by 1 data value, doing the add, and writing to memory, normal instruction would be replaced by a single SIMD instruction that would load N bytes and do N adds in parallel then write the N results to memory, where N is width of that particular instruction.

This is not an optimisation that the CPU usually does by itself, at least, not something I am aware of. Auto-vectorization of the kind you talk about usually is performed by the compiler, and it has nothing to do with ILP.



Instruction mapping won't work period. If original program uses some memory mapped feature in one instruction set, new instruction set wouldn't know how to re implement it unless it was manually patched. It would pretend like it was a regular memory access and the program would fail. Example: run x86-64 version of perf, convert to arm64 binary -> kernel panic when trying to access memory locations that map to counter data that doesn't exist.

Thats why one would fix addresses while translating. Its not difficult to do.

No, chars ints long *voids don't exist to the hardware.

They kind of do, because its something that the CPU can directly operate on. As long as the CPU has instructions that can specifically manipulate bytes, words (of different type) etc., these are basic data types that CPU supports. The C language datatypes are a very low-level abstraction and they map more or less directly to what the CPU can do. So it makes a big difference whether a pointer on a platform is four, eight, or 16 bytes.

Also pointers don't point to 64 bit data types, both x86-64 and aarch64 are byte addressable, not to mention only lower 48 bits are used for addresses.

Where did I clam that pointers point to 64 bit data types? The relevant part is that pointers themselves are 64 bit.
 
I am a bit concern that the only true leak for the MacBook Pro was the picture of it which showed an empty hole for the magic function keys.

I am worried that the production and distribution chains are far from ready (so no leaks because nothing to show) and we are gonna to have to wait a couple more months before seeing anything in the stores.

Am I the only one?
[doublepost=1477074207][/doublepost]In comparison, a month before the iphone 7 event, we had countless of pictures, both from the final product and several hardware components plus the full final specs pages...
why are you so worried and concerned? take a chill pill and enjoy the sun.
 
P.S. BTW, I am really sorry for derailing this thread, I guess we got too carried away :) To make things clear, I am not trivialising the problem and I certainly don't claim that x86 to ARM is a solved problem and that Apple should just release those ARM Macs. I am merely curious about whether and how such transition can be done. Translation from x86 to ARM sounds very enticing here. And I think that Apple can pull it of if they make sure that their CPUs have the same behaviour (e.g. reads/writes in a multiprocessor system etc.) as Intel ones.
 
I did: Intel Atom before Silvermont.
?
Intel Atom was an in order 2-wide superscalar arch prior to Silvermont, out of order was added for Silvermont...

This is not an optimisation that the CPU usually does by itself, at least, not something I am aware of. Auto-vectorization of the kind you talk about usually is performed by the compiler, and it has nothing to do with ILP.

Apple, Qualcomm, NVIDIA do it. They do not implement ARM microarchitecture, VLIW implementations of ARM combine multiple ARM instructions into a single instruction which is decomposed into micro ops. Directly related to ILP.

Thats why one would fix addresses while translating. Its not difficult to do.

You can't translate memory mapped addresses from one architecture to another (or even within the same architecture unless its from the same family), memory mapped addresses refer to a SoC peripheral, not a real memory address. You would have to recompile from source with code changes. See perf example above.

They kind of do, because its something that the CPU can directly operate on. As long as the CPU has instructions that can specifically manipulate bytes, words (of different type) etc., these are basic data types that CPU supports. The C language datatypes are a very low-level abstraction and they map more or less directly to what the CPU can do. So it makes a big difference whether a pointer on a platform is four, eight, or 16 bytes.

C language is not a at all a low level abstraction. There is no direct reference to the stack or specific registers. You can't modify or view the current PC. There is no such thing as basic data type, only addressibility and in x86-64 case, op input size, where the hardware is concerned. (Processors that support bit banding can address single bits if needed).

Example x86-64:
83C0FF add eax,byte +0xff

You can't do add eax,byte +0xffffffffffffffff, you would have to load it into a register or have it at a memory offset and use a different instruction mode.

Where did I clam that pointers point to 64 bit data types? The relevant part is that pointers themselves are 64 bit.

"It first and foremost refers to basic pointer and register size" - false

Pointers themselves are 48 bit in most modern 64 bit architectures.
Register sizes are not tied to arch. You can have 64 bit with 8 bit registers, 32 bit registers, 128 bit registers.

Go through a basic comp arch class on youtube or something, your knowledge is limited to high level software.
 
?
Intel Atom was an in order 2-wide superscalar arch prior to Silvermont, out of order was added for Silvermont...

Sorry, my mistake. True, Atom is limited dual-issue. Ok, then let's take ARM Cortex-A5 CPUs: single issue, but SIMD instructions.


Apple, Qualcomm, NVIDIA do it. They do not implement ARM microarchitecture, VLIW implementations of ARM combine multiple ARM instructions into a single instruction which is decomposed into micro ops. Directly related to ILP.

Why are we talking about VLIW now? I though we were talking about auto-vectorisation like in your example. What CPU can autovectorize loops and turn non-SIMD instructions into SIMD instructions?

You can't translate memory mapped addresses from one architecture to another (or even within the same architecture unless its from the same family), memory mapped addresses refer to a SoC peripheral, not a real memory address. You would have to recompile from source with code changes. See perf example above.

Is it something a user space application needs to do? Sounds more like driver-side stuff to me.


There is no such thing as basic data type

Most CPUs can't natively operate on 3-byte words. If they could, then the C int on that platform would most likely be 3 bytes. Thats what I mean with "C types being a low-level abstraction".


Example x86-64:
83C0FF add eax,byte +0xff

You can't do add eax,byte +0xffffffffffffffff, you would have to load it into a register or have it at a memory offset and use a different instruction mode.

I fail to see how this is relevant. Its just implementation specifics. The fact remains that the CPU can operate on bytes, 16-bit words, 32-bit words and 64-bit words. Thats it. And SIMD datatypes of course.

Pointers themselves are 48 bit in most modern 64 bit architectures.

I don't even know how to reply to this. True enough, only 48 bits are used. But the pointers themselves are still 64bit. Good luck storing a 48bit pointer.
 
Sorry, my mistake. True, Atom is limited dual-issue. Ok, then let's take ARM Cortex-A5 CPUs: single issue, but SIMD instructions.

Cortex A5 is 32 bit SIMD... which is not considered wide...

I asked you: "Please provide example of any general purpose processor that has wide functional units"
Cortex A5 does not have wide functional units. It is single issue.

Regarding superscalar let me quote what you said:
"Superscalar is a term applied to CPUs which reorder/reoptimise the instruction stream"
- This is wrong in ANY definition of the word superscalar, it has nothing to do with instruction reordering

Why are we talking about VLIW now? I though we were talking about auto-vectorisation like in your example. What CPU can autovectorize loops and turn non-SIMD instructions into SIMD instructions?

When did I say auto-vectorization is done at the CPU level? Compiler optimization to SIMD is exploitation of ILP, the VLIW example is to show how non compiler optimized ARM instructions will still be optimized by a CPU for most modern architectures.

Is it something a user space application needs to do? Sounds more like driver-side stuff to me.

So are you saying user space applications don't need drivers or use libs that use memory mapped SoC features? According to you, they could all be converted to the new architecture. How would you make the userspace program perf x86-64 work on ARM without recompiling the perf kernel driver? And you are expecting all 3rd party drivers to port their code to the new architecture, hence proving my initial point.

Most CPUs can't natively operate on 3-byte words. If they could, then the C int on that platform would most likely be 3 bytes. Thats what I mean with "C types being a low-level abstraction".

What does this have to do with 64 bit vs 32 bit? C int is completely defined by the compiler, has no merit on arch size. That's why C99 has specific identifiers for 1nt16, 32, 64 etc.

I fail to see how this is relevant. Its just implementation specifics. The fact remains that the CPU can operate on bytes, 16-bit words, 32-bit words and 64-bit words. Thats it. And SIMD datatypes of course.

There is no relevance, apart from what you saying it has to do with address size and basic data size which is wrong.

I don't even know how to reply to this. True enough, only 48 bits are used. But the pointers themselves are still 64bit. Good luck storing a 48bit pointer.

In architecture they are represented in 48 bits. You can't do operations on a address which has the high bits set. It will likely result in an exception or trap call...

I'm still not sure why you are still trying to defend your original position that x86-64 code can be natively recompiled to ARM. Not only would it be a performance suicide which, it's also not possible since the ARM spec don't even come close to supporting all the features on an Intel SoC.

There will not be an ARM Macbook Pro which runs recompiled from binary native x86-64 code, ever.

Translation from x86-64 is plain stupid, not enticing at all.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.