RISC-V Conditional Moves

(corsix.org)

32 points | by gok 3 days ago ago

11 comments

sylware 3 days ago

This is implemented with instruction fusion. Just need to document properly and publish properly what will end up "standard instruction fusion patterns" (like the div/rem one).

Adding more instructions is kind of non productive for a R(educed)ISC ISA. It has to be weighted with extreme care. Compressed instructions went thru for the sake of code density (marketing vs arm thumb instructions).

In the end, programs will want probably to stay conservative and will implement only the core ISA, at best giving some love to some instruction fusion patterns and that's it, unless being built knowingly for a specific risc-v hardware implementation.

[-]

mort96 5 minutes ago

> In the end, programs will want probably to stay conservative and will implement only the core ISA

This is probably not the case. The core ISA doesn't include floating point, it doesn't include integer multiply, it doesn't include integer divide, it doesn't include atomic and fence instructions.

What has happened is that most compilers and programs for "normal desktop/laptop/server/phone class systems" all have some baseline set of extensions. Today, this is more or less what we call the "G" extension collection (which is short-hand for IMAFD_Zicsr_Zifencei). Though what we consider "baseline" in "normal systems" will obviously evolve over time (just like how SSE is considered a part of "baseline amd64" these days whereas it was once a new exotic extension).

vardump 3 hours ago

Instruction fusion still means lower code density. You can go overboard, but the newer ARM instruction set(s) are pretty good.

[-]

duskwuff 28 minutes ago

As an aside: it's only relevant on microcontrollers nowadays, but ARM T32 (Thumb) code density is really good. Most instructions are 2 bytes, and it's got some clever ways to represent commonly used 32-bit values in 12 bits:

https://developer.arm.com/documentation/ddi0403/d/Applicatio...

Findecanor 2 hours ago

Not necessarily lower density. On ARM you would often need cmp and csel, which are two instructions, eight bytes.

RISC-V has cmp-and-branch in a single instruction, which with c.mv normally makes six bytes. If the cmp-and-branch instruction tests one of x8..x15 against zero then that could also be a compressed instruction: making four bytes in total.

[-]

astrange 2 hours ago

ARMv8.7 added some new instructions for int min/max to replace cmp+csel. (I'm surprised it took them so long to add popcnt.)

https://www.corsix.org/content/arm-cssc

Pet_Ant 3 hours ago

Compressed instructions are also for microcontroller use. RISC-V -rightly or wrongly- is trying to be an ISA that can handle the whole stack from embedded microcontrollers to a top-end server.

As such, there are compromises for both aims.

Dwedit 2 hours ago

32-bit ARM had literally every instruction be conditional.

brucehoult 3 days ago

> some SiFive cores implement exactly this fusion.

I was not able to open the given link, but it's not true, at least for the U74.

Fusion means that one or more instructions are converted to one internal instruction (µop).

SiFive's optimisation [1] of a short forward conditional branch over exactly one instruction has both instructions executing as normal, the branch in pipe A and the other instruction simultaneously in pipe B. At the final stage if the branch turns out to be taken then it is not in fact physically taken, but is instead implemented by suppressing the register write-back of the 2nd instruction.

There are only a limited set of instructions that can be the 2nd instruction in this optimisation, and loads and stores do not qualify. Only simple register-register or register-immediate ALU operations are allowed, including `lui` and `auipc` as well as C aliases such as `c.mv` and `c.li`

> The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.

The presented code ...

      mv rd, x0
      beq rs2, x0, skip_next
      mv rd, rs1
    skip_next:

... vs ...

    czero.eqz rd, rs1, rs2

... requires that not only rd != rs2 (as stated) but also that rd != rs1. A better implementation is ...

      mv rd, rs1 // safe even if they are the same register
      bne rs2, x0, skip
      mv rd, x0
    skip:

The RISC-V memory consistency model does not come into it, because there are no loads or stores.

Then switching to code involving loads and stores is completely irrelevant:

      lw x1, 0(x2)
      bne x1, x0, next
    next:
      sw x3, 0(x4)

First of all, this code is completely crazy because the `bne` is fancy kind of `nop` and a core could convert it to a canonical `nop` (or simply drop it).

Even putting the `sw` between the `bne` and the label is ludicrous. There is no branch-free code that does the same thing -- not only in RISC-V but also in arm64 or amd64. SiFive's optimisation will not trigger with a store in that position.

[1] SiFive materials consistently describe it as an optimisation not as fusion e.g. in the description of the chicken bits CSR in the U74 core complex manual.

[-]

sxzygz 3 days ago

Thanks for your input. I didn’t know what to make of the article.

[-]

brucehoult 2 days ago

Having taken a second look, this article does in fact have a point, but it is actually nothing at all to do with conditional moves in the RISC-V instruction set Zicond extension -- or amd64 or arm64 style conditional moves either, if they were added at some point.

It is not even about RISC-V but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86. I'm not as familiar with the Aarch64 memory model, but I think this probably also applies to it.

The point here is that if an aggressive implementation wants to implement instruction fusion that removes conditional branches (or indirect branches) to make a branch-free µop -- for example, to turn a conditional branch over a move into something similar to the `czero` instruction -- then in order to maintain memory ordering AS SEEN BY A DIFFERENT CORE the fused µop has to also have `fence r,w` properties.

That is all.

It is irrelevant to this whether the actual RISC-V instruction set has a conditional move instruction, or the properties it has if it exists.

It is irrelevant to the situation where a human programmer or a compiler might choose to transform branchy code into branch-free code. They have a more global view of the program and can make sure things make sense. A CPU core implementing fusion has only a local view.

Finally, I'll note that instruction fusion is at present hypothetical in RISC-V processors that you can buy today while it has been used in both x86 and Arm chips for a long time.

Intel's "Core" µarch had fusion of e.g. `cmp;bCC` sequences in 2006, while AMD added it with Bulldozer in 2011. Arm introduced a limited capability -- `CMP r0, #0; BEQ label` is given as an example -- in A53 in 2012 and A57, A72 etc expanded the generality.

Upcoming RISC-V cores from companies such as Ventana and Tenstorrent are believed to implement instruction fusion for some cases.

Just for completeness, I'll again repeat that SiFive's U74 optimises execution of a condition branch and a following simple ALU instruction that execute simultaneously in two pipelines, but this is NOT fusion into a single µop.