This reminds me that at some point I should write up my exploration of the x86 encoding scheme, because a lot of the traditional explanations tend to be overly focused on how the 8086 would have decoded instructions, which isn't exactly the same way you look at them for a modern processors.
I actually have a tool I wrote to automatically derive on x86 decoder from observing hardware execution (based in part on sandsifter, so please don't ask me if I've heard of it), and it turns out to largely be a lot simpler than people make it out to be... if you take a step back and ignore some of what people have said about the role of various instruction prefixes (they're not prefixes, they're extra opcode bits).
(FWIW, this is fairly dated in that it doesn't cover the three-byte opcodes, or the 64-bit prefixes that were added, like the REX and VEX prefixes).
I handwrote some C code (years ago) to do parallel x86 decode in a realistic manner (= similar to how a modern CPU would do it). Was a lot easier than I feared. I didn't complete it, it was just exploratory to see what it would look like.
I spent some time last weekend on a small side project which involves JIT encoding ARM64 instructions to run them on Apple Silicon.
I’ve written assembly before, but encoding was always kind of black magic.
How surprised was I to learn how simple instruction encoding is on arm64! Arguably simpler than implementing encoding wasm to byte code, which I played with a while ago.
If you want to play with this, based on my very very limited experience so far, I’d suggest starting with arm - fixed length 4 byte instructions, nice register naming scheme, straightforward encoding of arguments, make it very friendly.
I agree: AArch64 is a nice instruction set to learn. (Source: I taught ARMv7, AArch64, x86-64 to first-year students in the past.)
> how simple instruction encoding is on arm64
Having written encoders, decoders, and compilers for AArch64 and x86-64, I disagree. While AArch64 is, in my opinion, very well designed (also better than RISC-V), it's certainly not simple. Here's some of my favorite complexities:
- Many instructions have (sometimes very) different encodings. While x86 has a more complex encoding structure, most encodings follow the same structure and are therefore remarkably similar.
- Huge amount of instruction operand types: memory + register, memory + unsigned scaled offset, memory + signed offset, optionally with pre/post-increment, but every instruction supports a different subset; vector, vector element, vector table, vector table element; sometimes general-purpose register encodes a stack pointer, sometimes a zero register; various immediate encodings; ...
- Logical immediate encoding. Clever, but also very complex. (To be sure that I implemented the decoding correctly, I brute-force test all inputs...)
- Register constraints: MUL (by element) with 16-bit integers has a register constraint on the lowest 16 registers. CASP requires an even-numbered register. LD64B requires an even-numbered register less than 24 (it writes Xt..Xt+7).
- Much more instructions: AArch64 SIMD (even excluding SVE) has more instructions than x86 including up to AVX-512. SVE/SME takes this to another level.
I was recently working on some x86 emulation code. This is one of the best links that I found to summarize how it works, skipping the giant Intel instruction set references.
X86 feels painful. So many instructions that wind up being decoded by physical hardware, eating up unnecessary die space and electricity, all to save ram and storage space which is now abundant and cheap compared to when x86 was designed
The instruction decoder was a large part of the die in 1985. Today you won't be able to identify it in a die photo. In a world with gigantic vector register files, the area used by decode simply is not relevant. Anyway x86 does not save storage space. x86_64 code tends to be larger than armv8 code.
All the various bits that get tacked on for doing prefetch and branch prediction all are fairly large too, given the amount of random caching, which often is what people account for when measuring decode power usage I think. That’s going to be the case in any arch besides something like a DSP without any kind of dynamic dispatch.
For the cores working hardest to achieve the absolute lowest cpi running user code, this is true. But these days the computers have computers in them to manage the system. And these kinds of statements aren’t necessarily true for these “inner cores” that aren’t user accessible.
“ RTKit: Apple's proprietary real-time operating system. Most of the accelerators (AGX, ANE, AOP, DCP, AVE, PMP) run RTKit on an internal processor. The string "RTKSTACKRTKSTACK" is characteristic of a firmware containing RTKit.”
You say its irrelevant, but that's not the same as being necessary. These decode components are simply not necessary whereas a branch prediction actually makes the processor faster
I asked Grok to make recursive fibonacci on assembly. Worked OK but C-version was faster. I asked about it, but explanation was non-trivial and beyond comprehension.
Anyways. This means that learning ANY computer language is waste of time. Soon.
You have the proof right there that it will never be a waste of time. You can't understand why the C one is faster, and someone who does will be superior to a machine because they can apply this learning, context, and much more, to solve really though problems.
Writing Fibonacci in assembly as recursive functions using C-like calling conventions is like asking Superman to mainline Kryptonite, rather I'd expect an assembly implementation to look like a BASIC version that calculates iteratively.
This reminds me that at some point I should write up my exploration of the x86 encoding scheme, because a lot of the traditional explanations tend to be overly focused on how the 8086 would have decoded instructions, which isn't exactly the same way you look at them for a modern processors.
I actually have a tool I wrote to automatically derive on x86 decoder from observing hardware execution (based in part on sandsifter, so please don't ask me if I've heard of it), and it turns out to largely be a lot simpler than people make it out to be... if you take a step back and ignore some of what people have said about the role of various instruction prefixes (they're not prefixes, they're extra opcode bits).
(FWIW, this is fairly dated in that it doesn't cover the three-byte opcodes, or the 64-bit prefixes that were added, like the REX and VEX prefixes).
I handwrote some C code (years ago) to do parallel x86 decode in a realistic manner (= similar to how a modern CPU would do it). Was a lot easier than I feared. I didn't complete it, it was just exploratory to see what it would look like.
This is so relevant for me!
I spent some time last weekend on a small side project which involves JIT encoding ARM64 instructions to run them on Apple Silicon.
I’ve written assembly before, but encoding was always kind of black magic.
How surprised was I to learn how simple instruction encoding is on arm64! Arguably simpler than implementing encoding wasm to byte code, which I played with a while ago.
If you want to play with this, based on my very very limited experience so far, I’d suggest starting with arm - fixed length 4 byte instructions, nice register naming scheme, straightforward encoding of arguments, make it very friendly.
> I’d suggest starting with arm
I agree: AArch64 is a nice instruction set to learn. (Source: I taught ARMv7, AArch64, x86-64 to first-year students in the past.)
> how simple instruction encoding is on arm64
Having written encoders, decoders, and compilers for AArch64 and x86-64, I disagree. While AArch64 is, in my opinion, very well designed (also better than RISC-V), it's certainly not simple. Here's some of my favorite complexities:
- Many instructions have (sometimes very) different encodings. While x86 has a more complex encoding structure, most encodings follow the same structure and are therefore remarkably similar.
- Huge amount of instruction operand types: memory + register, memory + unsigned scaled offset, memory + signed offset, optionally with pre/post-increment, but every instruction supports a different subset; vector, vector element, vector table, vector table element; sometimes general-purpose register encodes a stack pointer, sometimes a zero register; various immediate encodings; ...
- Logical immediate encoding. Clever, but also very complex. (To be sure that I implemented the decoding correctly, I brute-force test all inputs...)
- Register constraints: MUL (by element) with 16-bit integers has a register constraint on the lowest 16 registers. CASP requires an even-numbered register. LD64B requires an even-numbered register less than 24 (it writes Xt..Xt+7).
- Much more instructions: AArch64 SIMD (even excluding SVE) has more instructions than x86 including up to AVX-512. SVE/SME takes this to another level.
A32 is simpler, but looking at A64 instructions certainly begs the question "they still call this a RISC?"
x86 is an octal machine (1995): https://gist.github.com/seanjensengrey/f971c20d05d4d0efc0781...
Discussed here: https://news.ycombinator.com/item?id=30409100
I've memorised most of the 1st-page instructions - in octal - and it's easier than it sounds.
Originally, yes. These days, not so much. It's "whatever bit confetti was necessary to squeeze in all the opcode/operand bits".
By "these days" you mean "since AMD messed it up".
This uses frames and the link is to the inner frame, maybe is should rather link to https://www-user.tu-chemnitz.de/~heha/hs/chm/x86.chm/ . Nevertheless, this is a nice looking website.
I was recently working on some x86 emulation code. This is one of the best links that I found to summarize how it works, skipping the giant Intel instruction set references.
It’s been a little bit since I watched it but I recall this playlist being useful/interesting
https://youtube.com/playlist?list=PLJRRppeFlVGIvcTQNISPTxvNm...
Here is some explanation from the source plus some code that can encode/decode x86 instructions in software
It would be really nice to have something like this for the x86-64 variant.
The same site hosts [1], but that's not nearly as nice as the 32-bit version. It's also a bit outdated.
[1]: https://www-user.tu-chemnitz.de/~heha/hs/chm/x86.chm/x64.htm
Thanks. Looks like the original now has some clarifications, including more detail regarding the REX prefixes: https://wiki.osdev.org/X86-64_Instruction_Encoding
sandpile.org is your friend.
X86 feels painful. So many instructions that wind up being decoded by physical hardware, eating up unnecessary die space and electricity, all to save ram and storage space which is now abundant and cheap compared to when x86 was designed
The instruction decoder was a large part of the die in 1985. Today you won't be able to identify it in a die photo. In a world with gigantic vector register files, the area used by decode simply is not relevant. Anyway x86 does not save storage space. x86_64 code tends to be larger than armv8 code.
All the various bits that get tacked on for doing prefetch and branch prediction all are fairly large too, given the amount of random caching, which often is what people account for when measuring decode power usage I think. That’s going to be the case in any arch besides something like a DSP without any kind of dynamic dispatch.
I think it's safe to say that a modern x86 branch predictor with its BTBs is significantly larger than the decode block.
Sure, but branch prediction is (as far as we know) a necessary evil. Decode complexity simply isn't.
Right, but decode compexity doesn't matter because of the giant BTB and such. At least that's what I understand.
For the cores working hardest to achieve the absolute lowest cpi running user code, this is true. But these days the computers have computers in them to manage the system. And these kinds of statements aren’t necessarily true for these “inner cores” that aren’t user accessible.
“ RTKit: Apple's proprietary real-time operating system. Most of the accelerators (AGX, ANE, AOP, DCP, AVE, PMP) run RTKit on an internal processor. The string "RTKSTACKRTKSTACK" is characteristic of a firmware containing RTKit.”
https://asahilinux.org/docs/project/glossary/#r
And those cores do not run x86.
You say its irrelevant, but that's not the same as being necessary. These decode components are simply not necessary whereas a branch prediction actually makes the processor faster
I asked Grok to make recursive fibonacci on assembly. Worked OK but C-version was faster. I asked about it, but explanation was non-trivial and beyond comprehension.
Anyways. This means that learning ANY computer language is waste of time. Soon.
You have the proof right there that it will never be a waste of time. You can't understand why the C one is faster, and someone who does will be superior to a machine because they can apply this learning, context, and much more, to solve really though problems.
If it was a real superhuman AI it would use the closed form expansion
https://en.wikipedia.org/wiki/Fibonacci_sequence
Writing Fibonacci in assembly as recursive functions using C-like calling conventions is like asking Superman to mainline Kryptonite, rather I'd expect an assembly implementation to look like a BASIC version that calculates iteratively.
You have a really sad view on what constitutes a waste of time.