Ask HN: Favourite Assembly Instructions?

5 points | by RetroTechie 3 days ago ago

10 comments

gblargg 3 days ago

I always liked rlwimi on PowerPC. It rotates the source n bits, then writes any contiguous section of bits over the corresponding bits in the destination register. This allows copying any bitfield from any position in one register into another. Basically either of these:

  out = (out & ~mask) | (in << shift & mask)
  out = (out & ~mask) | (in >> shift & mask)

Z80's EXX to swap with the shadow registers was interesting (meant for fast interrupt response so you didn't have to save registers to memory).

[-]

pulvinar 3 days ago

rlwimi was a nice one, especially for emulators.

And it also had eieio, Enforce In-Order Execution of I/O.

brucehoult 3 days ago

> rlwimi / rlwinm

Definitely a nice and pretty much pioneering feature on PowerPC in 1994 (and I guess RS/6000 before that, but I never used one).

Today's Arm64 BFM does both those jobs in one, minus the ability to create a split mask via rotating, but plus adding a choice of sign or zero extension to extracted fields (including extracted to the same place they already were, for pure sign/zero extension). As a result it's got about 100 aliases.

It would be nice to have these in RISC-V but they seriously violate the quite strict "Stanford Standard RISC" 2R1W principle that keeps the RISC-V integer pipeline simple (smaller, faster, cheaper).

When working in the "B" extension working group I suggested adopting the M88000 bitfield instructions which follow the 2R1W principle. Someone had an objection to encoding both field width and offset into a single constant (or `Rs2`), though I think it's well worth it. M88k as a 32 bit ISA used 5 bits for each, but 6 bits for each for RV64 fits RISC-V's 12 bit immediates perfectly.

- ext / extu: Extract signed or unsigned bit field from a register. You specify offset (starting bit position) and width. The extracted field is right-justified (shifted to the low bits) in the destination, with sign-extension or zero-extension.

- mak: Make (insert) a bit field. Takes a value, shifts it left by the offset, and inserts it into the destination while clearing the target field first (or combining in specific ways).

- set: Set (force to 1) a contiguous bit field in a register.

- clr: Clear (force to 0) a contiguous bit field in a register.

All take `Rd`, `Rs1` and a field size:offset as either a literal or as `Rs2`.

Unfortunately, the R-type `mak` violates 2R1W because the `Rd` is also a source, which complicates OoO implementations making them 3R1W. RISC-V could use an alternative formulation in which `mak` (or some other name` masks off the source field and shifts it into place, and then the insert is completed using `clr` and `or`.

On the other hand the forms with 12 bit literals are expensive in encoding space, but even including just the `Rs2` versions would be great, especially as often several instructions in a row can use the same field specification, which fits `addi Rd,zero,imm12` (aka `li`) perfectly.

On the gripping hand, while the immediate version of `mak` violates RISC-V convention by making the `Rd` also a source, any real pipeline is going to have fields for all of `Rd`, `Rs1`, `Rs2`, and `imm32` so only the decoder is affected.

Also, `ext` / `extu` are not needed as a pair of C-extension shifts do the same job with the same code size, and can be decoded into a single µop on a higher end CPU if desired.

As an example: take a 10 bit field at offset 21 and insert into a destination at offset 1 (this is part of decoding RISC-V J/JAL instructions).

PowerPC:

    rlwimi  r4, r3, 11, 1, 10

Arm64:

    ubfx   x2, x0, #21, #10      # extract bits[30:21] → low 10 bits of x2 (unsigned)
    bfi    x1, x2, #1, #10       # insert those 10 bits into x1 starting at bit 1

Alternatively, using `bfm` directly without aliases (exactly the same instructions, just trickier to get right)

    bfm    x2, x0, #21, #30
    bfm    x1, x2, #63-1, #9

M88k:

    extu   r3, r1, 21, 10        # extract 10-bit field starting at bit 21 → low bits of r3
    mak    r2, r3, 1, 10         # make/insert the field at bit 1 in destination

RISC-V:

    srli   x12, x10, 21          # shift field down to low bits
    andi   x12, x12, 0x3FF       # mask to 10 bits
    slli   x12, x12, 1           # position at bit 1 (for imm[10:1])
    li     x13, ~0x7FE           # mask to clear bits [10:1] only
    and    x11, x11, x13
    or     x11, x11, x12         # insert the field

RISC-V with some M88k inspiration:

    extui  r3, r1, 21, 10        # extract 10-bit field starting at bit 21 → low bits of r3
    maki   r4, r3, 1, 10         # modified mak: masks + shifts field to bits [10:1] (others 0)
    clri   r2, 1, 10             # clear the target field in destination
    or     r2, r2, r4            # insert the prepared field

Alternatively

    li     t0, (1<<6) | 10       # specification for insertion bit field
    srli   a3, a1, 21            # shift 10-bit field starting at bit 21 → low bits of r3
    mak    a4, a3, t0            # modified mak: masks + shifts field to bits [10:1] (others 0)
    clr    a2, t0                # clear the target field in destination
    or     a2, a2, r4            # insert the prepared field

Alternatively:

    srli   a3, a1, 21
    maki   a2, a3, (1<<6) | 10   # decoder expands to `maki a2, a2, a3, (1<<6) | 10`

Again, this last formulation of `maki` violates RISC-V instruction format convention in making `a2` both src and dst, BUT if the decoder handles that then the expanded form does NOT cause any issues with the pipeline implementation.

[-]

camel-cdr 2 days ago

bitfield insert/extract was also looked at by the scalar efficiency SIG: https://lists.riscv.org/g/sig-scalar-efficiency/topic/115060...

IIRC it didn't go anywere, because it wasn't worth the encoding space.

But a rlwimi sounds like a good candidate for >32b encoding.

[-]

brucehoult 2 days ago

Both the PowerPC and Arm64 instructions do grab a lot of encoding space.

rlwimi uses 26 bits of opcode space (i.e. 2^26 = 64M code points). In a RISC-V context you can drop the Rc (set status flags) bit, but for RV64 you need to expand the shift/start/end fields from 5 to 6 bits, so you end up needing 28 bits of encoding space, 18 for the field spec and 5 each for Rd1 and Rd/Rs2.

A RISC-V major opcode, such as OP-IMM (which this effectively is, but with a R/W Rd/Rs2) only has 2^25 bits of encoding space for all instructions in total!

PPC64's rldimi expands shift and size to 6 bits each but drops the ability to take the source field from an arbitrary position but only from the LSBs, and so uses 23 encoding bits. i.e. exactly my proposed RISC-V instruction (except for the set flags bit, so 22 bits).

Arm64's BFM/SBFM effectively uses 24 bits to provide both 32 bit and 64 bit operations — there are 25 bits but `sf` and `N` must be the same, potentially allowing the other half of the code points (plus the ones for 32 bit with the MSBs of `immr` and `imms` set) to be used for something else in future. Note that BFM leaves all other bits in the dst unchanged, while SBFM both sign-extends into the higher bits of dst AND zeros the lower bits of DST.

So BFM/SBFM *could* be fit into RISC-V, taking up half of a major opcode, of which there aren't many left. That is a pretty huge amount — the enormous V extension takes 1 1/2 major opcodes, for far more functionality. It would free up various immediate shifts and sign/zero extension instructions, but those don't take much encoding space, no more than 16 bits each.

As nice as they are, it's hard to avoid a conclusion that both (32 bit) PowerPC and Arm64 spend too much opcode space on these.

I think PPC64's `rldimi` and M88K's `mak` (extended to 64 bits) and my last RISC-V suggestion — which are all effectively the same thing — hit the right tradeoff, not using excessive encoding space but allowing a 2-instruction sequence for that bit field move):

    srli   a3, a1, 21
    maki   a2, a3, (1<<6) | 10   # decoder expands to `maki a2, a2, a3, (1<<6) | 10`

That's 22 bits of opcode space, the same as any one of `addi`, `andi`, `ori`, `xori`, `slti`, `sltiu` (OP-IMM) or `addiw` (OP-IMM-32).

The original RV64GC has 5/8 funct3 encodings in OP-IMM-32 unused, which `maki` (or call it `bfi` or whatever) could have used one of. It has a combined `Rd`/`Rs2` field which is unusual in full size 4-byte RISC-V instructions, but not unprecedented: the V extension does that for multiply-add instructions.

I don't immediately see any ratified or currently-proposed extension using this space.

[-]

gblargg 2 days ago

What would justify using this significant space for them these days? Video encoding/decoding in software seems like the most likely candidate, since there's a lot of bitfield packing and high data volume.

(Thanks for your elaboration on various architectures. It's an interesting glimpse into what goes in in allocating opcode space on fixed-length instruction machines.)

[-]

brucehoult 2 days ago

My example is applicable to compiler / assembler / JIT / emulator.

The performance of conventional compilers and assemblers is not important to anyone but developers, but everyone uses JavaScript / WebAsm all the time. And QEMU can be important too (e.g. in docker for non-native ISAs, using binfmt_misc).

I guess I should point out in the proposed RISC-V example, it's 6 bytes of code as the initial shift can be a 2-byte "C" extension instruction. So that's slightly smaller code than everything except 32 bit PowerPC, which is another important aspect. Arm64 and M68k use 8 bytes of code.

Oh! I just realised standard RISC-V can be improved in this case (but not by so much in the general case).

    srli   x12, x10, 20          # shift field down to correct position
    andi   x12, x12, 0x7FE       # mask to 10 bits
    andi   x11, x11, ~0x7FE      # clear space in the destination
    or     x11, x11, x12         # insert the field

That's just 12 bytes of code.

In the more general case you need a `lui` or `lui;andi` pair to load the mask into a register, and then register to register ops, for 14 bytes total.

Note that x86_64 needs four instructions and 14 bytes of code, so no better than RISC-V.

camel-cdr 2 days ago

pext/pdep are incredible, I'm hoping to see them in more SIMD ISAs in the future.

But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning

iefbr14 2 days ago

BR 14 (IBM assembler)

Because it has a utility named after it. Hence my name.

absynth 3 days ago

HCF - Halt and Catch Fire.