FireStorm Xcond Extension — Conditional Execution Specification
Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA Companions: FireStorm Xcrisp Extension, FireStorm Xstack Extension
1. Overview
The Xcond extension adds full conditional execution ("predication") to FireStorm's R-type ALU instructions. Every standard R-type operation — ADD, SUB, AND, OR, XOR, shifts — gains an opt-in predicated form that executes only when a runtime condition holds, with the destination register additionally serving as a source.
Xcond is wide-mode-only: the encoding consumes bits from the 36-bit fetch word's extension nibble that simply do not exist in narrow-mode 32-bit fetches. The cost in wide mode is one bit (the previously-reserved spare bit in the R-type extension nibble) to signal "this instruction is predicated"; once that bit is set, the rs2 field plus the ext_rs2 bit are repurposed as a 6-bit condition specifier, giving 8 test modes × 8 conditions.
When the predicate-enable bit is set, two new operations join the standard ALU set:
- MOV-cond (
rd = rs1if condition) — replaces SLT in the predicated funct3 map, enabling general conditional assignment. - RSUB-cond (
rd = rs1 - rdif condition) — replaces SLTU, enabling single-instruction abs and reverse-subtract patterns.
1.1 Wins
Predication eliminates short conditional branches in inner loops. The pay-off depends on the data:
- Hot accumulators with data-dependent predicates (e.g.,
if (x[i] > 0) sum += x[i]) collapse to 1 instruction per iteration with no branch. For random-sign data, the mispredict penalty on standard RV64 dominates the inner loop; Xcond removes it entirely. Typical measured gain: ~1.5× to ~2× per iteration depending on data and pipeline depth. - Modulo / wrap-around counters (e.g.,
if (counter ≥ LIMIT) counter -= LIMIT) become 2 instructions instead of 3 + branch. - Saturating decrement / increment (e.g.,
if (counter > 0) counter--) becomes 1 instruction instead of 2 + branch (or 2 branchless). - abs / negate-if-negative becomes 1 instruction via RSUB-cond, beating both the standard
xor/subbit-twiddle (3 instructions) and the Zbbneg/maxform (2 instructions). - DSP inner loops combining Xcrisp auto-inc loads with Xcond predicated accumulators see compound wins. See §6.5 for a worked L1-norm example: ~33% inner-loop instruction reduction and no mispredict.
1.2 Non-Goals
- Not a vector / SIMD predicate mask. Each Xcond instruction tests one condition for one ALU op. Per-lane masking belongs in a future vector extension.
- Not general select. General
y = cond ? a : bwith three independent registers does not fit the predicated R-type encoding (rd doubles as test operand, leaving only rs1 for the source). The standard Zicond extension covers that case, and Xcond and Zicond are complementary (§10.2). - Not predicated loads/stores. I-type and S-type formats have no rs2 field to repurpose. A predicated-immediate form is possible (steal bits from imm[11:0]) and is deferred to v0.2 if usage data justifies it.
- Not narrow-mode. Xcond is unavailable when fetching from 32-bit-wide DDR3 memory because the enable bit lives in the extension nibble. Code that must run in narrow mode falls back to Zicond or standard branchy code.
2. Relationship to Standard RISC-V
Xcond does not modify any standard RISC-V instruction encoding. In narrow mode, FireStorm executes pure RV64GC; the predicate-enable bit simply does not exist in 32-bit fetches. In wide mode, the predicate-enable bit lives in a position that is otherwise reserved-zero in the parent FireStorm ISA's R-type extension nibble (§5.1 of the parent doc). Existing R-type instructions emitted by a standard RV64 compiler will have this bit zero and execute unchanged.
The mxcond CSR (§7) advertises presence and version.
2.1 Coexistence with Zicond
The ratified RISC-V Zicond extension provides czero.eqz and czero.nez: "if the test register is (non-)zero, write zero to rd; otherwise copy rs1." Zicond synthesises general select via two czero ops and an OR.
Xcond and Zicond differ in scope:
| Aspect | Zicond | Xcond |
|---|---|---|
| Encoding format | R-type, full registers | R-type, rd doubles as source |
| Test value | One register (vs 0) | rd, rs1, or virtual computation of both |
| Condition richness | EQ/NE only | 8 conditions × 8 test modes |
| Available in narrow mode | Yes | No |
General c ? a : b select |
Yes (3 ops) | No (rd is consumed) |
| Fused predicate-with-ALU | No (czero + separate ALU op) | Yes (predicate + ALU in one instruction) |
Zicond is implemented when narrow-mode code needs it. Xcond is the wide-mode form, with broader operation set and richer condition encoding but no general select. Code may use both; they share no encoding space.
2.2 Coexistence with Zbb
The Zbb extension provides min, max, minu, maxu and other bit-manipulation primitives. For 2-register min/max patterns, Zbb is already optimal:
min a, a, b ; Zbb: a = min(a, b)
Xcond's MOV-cond can express the same pattern (MOV.cond a, b, mode=010/cond=GT) in 1 instruction — a tie. Xcond is not preferred for min/max specifically.
Xcond wins over Zbb where Zbb has no direct equivalent: abs (via RSUB-cond), conditional accumulate (via ADD-cond with predicate on input), modulo wrap-around (via SUB-cond with predicate on result), and any pattern where the modified register is not also the comparison operand.
3. Encoding
3.1 Predicate-Enable Bit
The R-type extension nibble in wide mode is:
35 34 33 32
+--------+--------+--------+--------+
| spare | ext_rd | ext_rs1| ext_rs2|
+--------+--------+--------+--------+
Xcond claims bit 35 (the spare bit) as PRED-EN ("predicate enable"). When PRED-EN = 0, the instruction executes as a standard R-type per the parent doc. When PRED-EN = 1, the instruction is predicated and rs2 / ext_rs2 are reinterpreted as the condition specifier:
35 34 33 32
+--------+--------+--------+--------+
| 1 | ext_rd | ext_rs1| cond_5 |
+--------+--------+--------+--------+
- ext_rd and ext_rs1 continue to extend their respective register fields — predicated instructions retain full x0–x63 access for the destination and source-1.
- ext_rs2 (bit 32) is repurposed as cond[5], the high bit of the 6-bit condition field.
- The standard
rs2field (bits [24:20]) is repurposed as cond[4:0], the low 5 bits of the condition field. No second source register is available in predicated form.
3.2 Condition Field
The 6-bit condition is split as 3 bits of test mode and 3 bits of condition code:
5 3 2 0
+------+------+
| mode | cond |
+------+------+
3.2.1 Test Mode (3 bits)
The test mode selects what value is compared. In every mode the comparison is "test value <cond> zero" for the unary modes, or "rd <cond> rs1" for the binary mode.
| mode | Mnemonic | Test value | Notes |
|---|---|---|---|
000 |
TZ_RD | rd | Compare rd against zero |
001 |
TZ_RS1 | rs1 | Compare rs1 against zero |
010 |
TC | rd vs rs1 | Compare rd against rs1 (signed/unsigned per condition) |
011 |
TA | rd & rs1 | Compare bitwise-AND against zero (mask test) |
100 |
TO | rd | rs1 | Compare bitwise-OR against zero |
101 |
TX | rd ^ rs1 | Compare bitwise-XOR against zero (equality probe) |
110 |
TS | rd + rs1 | Compare virtual sum against zero (predicate on overflow direction) |
111 |
TD | rd − rs1 | Compare virtual difference against zero (predicate on subtraction result, catches overflow that mode TC's compare path misses) |
Modes TC and TD are nearly equivalent for in-range values; they differ when the subtraction would overflow. TC performs a true signed comparison via the CPU's compare path; TD computes the difference and tests the result. Both are exposed because the choice can matter for saturating-arithmetic patterns.
The test result drives the condition code; the test value itself is not written anywhere — it is consumed by the predicate logic only.
3.2.2 Condition Code (3 bits)
| code | Mnemonic | Predicate true when |
|---|---|---|
000 |
EQ | test value == 0 (or rd == rs1 in mode TC) |
001 |
NE | test value != 0 |
010 |
LT | test value < 0, signed |
011 |
GE | test value >= 0, signed |
100 |
LE | test value <= 0, signed |
101 |
GT | test value > 0, signed |
110 |
LTU | test value < 0, unsigned (equivalent to "carry set" for mode TD) |
111 |
GEU | test value >= 0, unsigned |
In modes other than TC, the conditions LTU/GEU operate on the unsigned interpretation of the computed test value. For unary modes, LTU is always false (unsigned values cannot be negative); these encodings are reserved and trap as illegal-instruction.
3.3 Operation Selection
When PRED-EN = 1, the funct3 field selects the ALU operation per a slightly modified map. SLT and SLTU are replaced with MOV-cond and RSUB-cond because predicated SLT/SLTU are of negligible use (a predicated compare-and-set is rarely the right primitive):
| funct3 | funct7[5] | Predicated mnemonic | Operation when predicate true |
|---|---|---|---|
000 |
0 | ADD.cond | rd = rd + rs1 |
000 |
1 | SUB.cond | rd = rd − rs1 |
001 |
0 | SLL.cond | rd = rd << (rs1 & 0x3F) |
010 |
0 | MOV.cond | rd = rs1 (replaces SLT) |
011 |
0 | RSUB.cond | rd = rs1 − rd (replaces SLTU) |
100 |
0 | XOR.cond | rd = rd ^ rs1 |
101 |
0 | SRL.cond | rd = rd >> (rs1 & 0x3F) (logical) |
101 |
1 | SRA.cond | rd = rd >> (rs1 & 0x3F) (arithmetic) |
110 |
0 | OR.cond | rd = rd | rs1 |
111 |
0 | AND.cond | rd = rd & rs1 |
All other funct7 values are reserved when PRED-EN = 1 and trap as illegal-instruction (i.e., funct7 = 0100000 with funct3 ≠ 000 or 101).
Word-size variants: OP-32 opcode 0x3B is similarly predicatable, with funct3 selecting the 32-bit forms (ADDW.cond, SUBW.cond, SLLW.cond, etc.). The full table mirrors the 64-bit case with the standard 32-bit semantics. MOVW.cond and RSUBW.cond are 32-bit variants of MOV and RSUB.
Side-effect note: When the predicate is false, rd is not modified. This is observably different from a non-predicated ALU op that writes the same destination, including for purposes of subsequent forwarding and write-port allocation. Implementations must treat a false-predicate ALU op as a no-op on the writeback path.
3.4 Encoding Diagram
A predicated R-type instruction in wide mode looks like this (most significant bit on the left):
36-bit fetch word:
| 35 | 34 | 33 | 32 | 31..25 | 24..20 | 19..15 | 14..12 | 11..7 | 6..0 |
| 1 |ext_rd|ext_rs1|c5 | funct7 | c[4:0] | rs1 | funct3 | rd |0110011|
^ ^ ^
| | |
PRED-EN cond[5] cond[4:0]
Assembler syntax:
ADD.cond rd, rs1, mode=MMM, cond=CCC
ADD.cond rd, rs1, MMM/CCC ; shorthand
ADD.cond rd, rs1, GT_RS1 ; named pseudo for mode=001 cond=GT
Named pseudo-conditions covering the most common patterns:
| Pseudo-name | mode/cond | Meaning |
|---|---|---|
GT_RD |
000/101 | rd > 0 |
LT_RD |
000/010 | rd < 0 |
EQZ_RD |
000/000 | rd == 0 |
NEZ_RD |
000/001 | rd != 0 |
GT_RS1 |
001/101 | rs1 > 0 |
LT_RS1 |
001/010 | rs1 < 0 |
EQZ_RS1 |
001/000 | rs1 == 0 |
NEZ_RS1 |
001/001 | rs1 != 0 |
LT |
010/010 | rd < rs1 (signed) |
GE |
010/011 | rd >= rs1 (signed) |
LTU |
010/110 | rd < rs1 (unsigned) |
GEU |
010/111 | rd >= rs1 (unsigned) |
EQ |
010/000 | rd == rs1 |
NE |
010/001 | rd != rs1 |
ANY |
011/001 | (rd & rs1) != 0 (any common bit set) |
NONE |
011/000 | (rd & rs1) == 0 (no common bit set) |
4. Detection
Software detects Xcond by reading the mxcond CSR (§7). A simpler probe: execute a known-encoded predicated instruction (with a never-true condition, so it has no architectural effect) and catch the illegal-instruction trap if not supported. The CSR-based approach is preferred for cross-implementation portability.
5. Instruction Reference
This section gives the full semantics of each predicated form. All operate only when PRED-EN = 1 in the R-type extension nibble. All are wide-mode-only.
5.1 ADD.cond, SUB.cond
ADD.cond rd, rs1, mode/cond ; if (test passes) rd = rd + rs1
SUB.cond rd, rs1, mode/cond ; if (test passes) rd = rd - rs1
The 64-bit signed add/subtract semantics match standard ADD/SUB. ADDW.cond/SUBW.cond (opcode 0x3B) operate on 32-bit values with sign extension to 64 bits, as standard ADDW/SUBW.
5.2 MOV.cond
MOV.cond rd, rs1, mode/cond ; if (test passes) rd = rs1
Single-instruction conditional assignment. The test typically operates on rd (the current value) and the assignment replaces it with rs1. Common usage patterns:
- Saturating clamp:
MOV.cond x, max_reg, LT(ifx < max_reg, rd unchanged; ifx >= max_reg, swap tomax_reg— actuallyMOV.cond x, max_reg, GTwould be the right encoding; see §6.1 for verified examples). - State replacement on threshold:
MOV.cond state, new_state, mode/cond.
MOVW.cond is the 32-bit variant.
5.3 RSUB.cond
RSUB.cond rd, rs1, mode/cond ; if (test passes) rd = rs1 - rd
Reverse subtract — the order of operands is the reverse of SUB.cond. Most useful with rs1 = x0, which yields negation: if (test passes) rd = -rd. Single-instruction abs is the canonical use (§6.4).
RSUBW.cond is the 32-bit variant.
5.4 AND.cond, OR.cond, XOR.cond
AND.cond rd, rs1, mode/cond ; if (test passes) rd = rd & rs1
OR.cond rd, rs1, mode/cond ; if (test passes) rd = rd | rs1
XOR.cond rd, rs1, mode/cond ; if (test passes) rd = rd ^ rs1
Bitwise operations with predication. Most useful in combination with the bitmask test modes (TA = mode 011): "if any/no bit of (rd & rs1) is set, modify rd by op with rs1." See §6.6 for self-modifying flag patterns.
5.5 SLL.cond, SRL.cond, SRA.cond
SLL.cond rd, rs1, mode/cond ; if (test passes) rd = rd << (rs1 & 0x3F)
SRL.cond rd, rs1, mode/cond ; if (test passes) rd = rd >> (rs1 & 0x3F) [logical]
SRA.cond rd, rs1, mode/cond ; if (test passes) rd = rd >> (rs1 & 0x3F) [arithmetic]
Predicated shifts. Useful in saturating-bit-shift idioms (e.g., binary search step: if (val >= midpoint) val >>= 1).
The 32-bit variants SLLW.cond, SRLW.cond, SRAW.cond use the low 5 bits of rs1 (matching standard *W semantics).
6. Examples
All examples in this section have been hand-verified against the encoding tables in §3 and §5 and the standard RV64 reference. Cycle counts assume an in-order issue pipeline with a 4-cycle branch mispredict penalty; the actual penalty on FireStorm depends on the final pipeline depth (open item in parent doc §13).
6.1 Saturating Accumulator (Positive Samples)
The motivating example. Sum the positive elements of an array:
int sum = 0;
for (int i = 0; i < n; i++) {
if (x[i] > 0) sum += x[i];
}
Standard RV64 (with Xcrisp auto-inc):
li sum, 0
mv p, x_ptr
mv cnt, n
loop:
LWPI t0, 0(p)+ ; t0 = *p++
blez t0, skip ; branch if t0 <= 0
add sum, sum, t0 ; sum += t0
skip:
addi cnt, cnt, -1
bnez cnt, loop
Inner body: 4 to 5 instructions depending on branch taken. For random-sign input, the blez mispredicts roughly half the time, costing approximately 2 cycles per iteration on average (4-cycle mispredict × 50% rate, amortised). Effective cost: ~6.5 cycles/iteration.
With Xcond:
li sum, 0
mv p, x_ptr
mv cnt, n
loop:
LWPI t0, 0(p)+
ADD.cond sum, t0, GT_RS1 ; if (t0 > 0) sum += t0
addi cnt, cnt, -1
bnez cnt, loop
Inner body: 4 instructions, no inner conditional branch. The trailing bnez cnt, loop is the loop branch (highly predictable, predicted-taken). Effective cost: ~4 cycles/iteration.
Speedup: ~1.6× for random-sign data, larger for high-mispredict-cost pipelines.
6.2 Modulo Counter (Wrap-Around)
A counter that wraps when it crosses a limit, e.g., a phase accumulator or circular buffer index:
counter += step;
if (counter >= LIMIT) counter -= LIMIT;
Standard RV64:
add counter, counter, step
blt counter, limit, done
sub counter, counter, limit
done:
3 instructions, 1 branch. The branch is data-dependent and may mispredict; for a uniform-step counter the branch rate is step/LIMIT.
With Xcond:
add counter, counter, step
SUB.cond counter, limit, GE ; if (counter >= limit) counter -= limit
2 instructions, no branch. Saves 1 instruction and removes a mispredict-prone branch.
6.3 Saturating Decrement
Decrement a counter, clamped at zero:
if (counter > 0) counter--;
Standard RV64 (branchless with Zbb):
snez t0, counter ; t0 = (counter != 0) ? 1 : 0
sub counter, counter, t0
2 instructions.
Standard RV64 (without Zbb):
beqz counter, done
addi counter, counter, -1
done:
2 instructions, with branch.
With Xcond (requires register holding 1):
SUB.cond counter, one_reg, GT_RD ; if (counter > 0) counter -= 1
1 instruction. Saves 1 instruction. Note the trade-off: a register is consumed to hold the constant 1; this is acceptable in tight loops where the register can be hoisted out of the loop body.
6.4 Branchless Absolute Value
x = abs(x); // if (x < 0) x = -x
Standard RV64 (bit-twiddle, no Zbb):
sraiw t0, x, 31 ; t0 = -1 if negative, 0 if non-negative
xor x, x, t0
sub x, x, t0
3 instructions, branchless, but consumes a temporary.
Standard RV64 (Zbb):
neg t0, x
max x, x, t0
2 instructions.
With Xcond:
RSUB.cond x, x0, LT_RD ; if (x < 0) x = 0 - x = -x
1 instruction. No temporary. Saves 1–2 instructions over standard forms.
Worked encoding:
PRED-EN = 1
ext_rd = 0 (x is in x0-x31)
ext_rs1 = 0 (x0)
cond[5] = 0 (cond[5:3] = 000, mode = TZ_RD)
funct7 = 0000000
cond[4:0]= 00010 (cond[4:3] = 00, mode bit 0; cond[2:0] = 010, LT)
^ Wait — let me re-check.
Actually, cond[5:3] = mode = 000, cond[2:0] = condition = 010 (LT).
So cond[5:0] = 000_010 = 000010.
cond[5] (bit 32) = 0
cond[4:0] (bits 24:20) = 00010
rs1 = 00000 (x0)
funct3 = 011 (RSUB)
rd = x's register number
opcode = 0110011
6.5 L1 Norm (Compound: Xcrisp + Xcond)
Sum of absolute values over an array — a common DSP primitive (e.g., for signal energy estimation):
int sum_abs = 0;
for (int i = 0; i < n; i++) {
int v = x[i];
if (v < 0) v = -v;
sum_abs += v;
}
Standard RV64 (with Xcrisp auto-inc, no Zbb):
loop:
LWPI t0, 0(p)+
sraiw t1, t0, 31
xor t0, t0, t1
sub t0, t0, t1
add sum, sum, t0
addi cnt, cnt, -1
bnez cnt, loop
6 inner-loop instructions plus the loop branch.
Standard RV64 (Xcrisp + Zbb):
loop:
LWPI t0, 0(p)+
neg t1, t0
max t0, t0, t1
add sum, sum, t0
addi cnt, cnt, -1
bnez cnt, loop
5 inner-loop instructions.
With Xcrisp + Xcond:
loop:
LWPI t0, 0(p)+
RSUB.cond t0, x0, LT_RD ; t0 = abs(t0)
add sum, sum, t0
addi cnt, cnt, -1
bnez cnt, loop
4 inner-loop instructions. 20–33% inner-loop reduction versus Zbb / non-Zbb baselines. No mispredict in the abs step.
With Xcrisp + Xcond + load-op fusion (LWADD from Xcrisp):
loop:
LWPI t0, 0(p)+
RSUB.cond t0, x0, LT_RD ; t0 = abs(t0)
add sum, sum, t0
; (could fuse the add with the load if we could predicate the fused op — see open items)
addi cnt, cnt, -1
bnez cnt, loop
The load-op fusion path is not currently extended to predicated forms; see §11 open items.
6.6 Self-Modifying Flag Update
Toggle a bit only when it is currently set (paired with sticky-flag clearing):
if (flags & DONE_BIT) flags ^= DONE_BIT; // clear DONE bit if set
This particular example is degenerate — unconditional andi flags, flags, ~DONE_BIT produces the same result in 1 instruction with no Xcond needed.
The genuinely useful pattern is when same-mask test + same-mask modify with a multi-bit mask, and the operation is something other than clear (i.e., the modification cannot be expressed as a simpler unconditional bitop). One narrow example:
if (flags & RETRY_BITS) {
flags ^= RETRY_BITS; // flip all retry bits at once if any was set
flags |= TRIED_FLAG; // (this second op is the part Xcond can't fold)
}
For the first line alone:
XOR.cond flags, retry_mask, ANY ; if (flags & retry_mask) flags ^= retry_mask
1 instruction versus 3 (and/branch/xor) for the branched form. Real win when the test mask equals the modify mask and the unconditional form is not equivalent.
In practice this combination is rare. The bitmask test mode (TA) is more useful for predicating other operations on a flag's state:
if (status & READY_BIT) counter = max_value; // status is rd, ready bit is rs1
MOV.cond status_or_counter, max_reg, ANY
Wait — that doesn't work, because the modified register (counter) is different from the tested register (status). The bitmask predicate works only when rd doubles as the tested register. So the example above doesn't fit; one must either rearrange or fall back to a branch.
Honest assessment: Bitmask predication (mode TA) is useful but narrow. It enables single-instruction self-modifying flag idioms but does not generalise to "test register A, modify register B."
6.7 Verification Summary
| Example | Pattern | Standard | With Xcond | Saving |
|---|---|---|---|---|
| 6.1 | Conditional accumulate | 4–5 instr + mispredict | 4 instr | ~1.6× speedup (data-dependent) |
| 6.2 | Modulo counter | 3 instr + branch | 2 instr | 1 instr + branch eliminated |
| 6.3 | Saturating decrement | 2 instr | 1 instr | 1 instr |
| 6.4 | Abs | 2–3 instr | 1 instr | 1–2 instr |
| 6.5 | L1 norm inner loop | 5–6 instr/iter | 4 instr/iter | 20–33% |
| 6.6 | Same-mask flag toggle | 3 instr + branch | 1 instr | Narrow but real |
Patterns that do not benefit from Xcond:
- General
c ? a : bselect with three independent registers — use Zicond. - min/max — Zbb does these in 1 instruction; Xcond ties at best.
- Cross-register predicate ("test register A, modify register B") — requires branch or Zicond.
- Predicated loads/stores — not in v0.1.
7. CSR Allocation
| CSR | Address (suggested) | Privilege | Description |
|---|---|---|---|
mxcond |
0xFC3 |
MRO | Xcond version and feature bits |
Bit layout of mxcond:
| Bits | Field | Meaning |
|---|---|---|
[0] |
PRESENT | 1 if Xcond implemented |
[7:1] |
VERSION | Xcond version (1 = v0.1) |
[8] |
HAS_RSUB | 1 if RSUB-cond is implemented (mandatory in v0.1; reserved bit for future variants that omit it) |
[9] |
HAS_MOV | 1 if MOV-cond is implemented (mandatory in v0.1) |
[10] |
HAS_W_VARIANTS | 1 if 32-bit (*W.cond) variants are implemented |
[63:11] |
reserved | — |
8. Compiler and Toolchain Integration
8.1 Target Flags
The +xcond target feature enables Xcond emission. The full FireStorm feature set is +xfirestorm = +xwide,+xcrisp,+xstack,+xcond.
Xcond requires Xwide (wide-mode execution), so +xcond implies +xwide at the toolchain level. Code generation falls back to non-predicated forms for any function that the compiler determines may execute in narrow mode.
Per-function annotation: __attribute__((target("xcond"))) enables predicated codegen for one function, __attribute__((target("no-xcond"))) disables it.
8.2 ABI Compatibility
Xcond does not alter the calling convention or any data layout. A predicated instruction with a false condition produces a no-op on rd, observably identical to standard ABI expectations.
8.3 When to Use Xcond
The compiler should emit predicated forms when all of the following hold:
- The function is in a
.text.widesection (executes in wide mode). - The short conditional region replaces a branch whose mispredict cost exceeds the saved instructions. For very short forward branches over single-instruction bodies, predication is almost always preferable.
- The predicate fits the rd-as-test-operand model. Where it does not (general select), use Zicond or fall back to branches.
The compiler should not emit Xcond when:
- The branch is highly predictable (e.g., loop-exit branches). Predication adds work even when the condition is false; a predicted branch costs nothing.
- The conditional region is long (more than ~4 instructions). Multiple predicated instructions accumulate cost; a single branch is cheaper.
- The pattern fits Zbb min/max more directly.
8.4 Inline Assembly
GCC/Clang inline-assembly support:
- New constraint modifier
%c<n>formats a condition operand asmode/condshorthand. - The asm template emits
.condmnemonics directly; the compiler does not optimise across them.
9. Implementation Guidance
9.1 Decode Path
The predicate-enable bit is bit 35 of the 36-bit fetch word. Decoders should:
- Read bit 35 (PRED-EN) at the start of decode for R-type / OP-32 instructions.
- If PRED-EN = 0, decode normally per the parent doc.
- If PRED-EN = 1:
- Recompute the funct3 → operation map (substitute MOV for SLT, RSUB for SLTU).
- Suppress the rs2 register read (rs2 field is condition, not register index).
- Schedule an rd-read in addition to the rd-write (rd is now also a source).
- Extract the 6-bit condition field from {bit 32, bits[24:20]}.
The rd-read is the most pipeline-significant change: the issue stage must read rd in addition to rs1 for predicated R-type, taking an additional register-file read port. In a 2-read-port implementation, this is a structural hazard with any other instruction needing two reads; in a 3-port implementation, it issues without delay. Sizing the register file port count is an implementation choice.
9.2 Condition Evaluation
The condition logic computes:
- The test value (selected by mode, computed from rd and rs1 as per §3.2.1). For modes TS and TD, the test value is the same as the ALU output computed by the operation under predicate — implementations may share the adder.
- The condition result (true/false) based on the test value and condition code per §3.2.2.
The condition result gates the writeback. A false-predicate instruction must not write rd; this is functionally a write-mask in the writeback stage. Implementations using register renaming should suppress the rename commit on false predicate; in-order implementations gate the GPR write-enable.
9.3 Forwarding
Because rd is read for the test in modes TZ_RD, TC, TA, TO, TX, TS, and TD, the implementation must handle forwarding for the rd-read just like any other source operand. A back-to-back sequence where instruction N+1 has its rd dependent on instruction N's rd commit may require one cycle of forwarding stall in a shallow pipeline.
For false-predicate instructions, the "rd value" that subsequent instructions see is the pre-predicate value (because the false predicate did not write). Implementations using bypass logic should bypass the original rd, not the predicated-op result.
9.4 Trap Behaviour
A predicated instruction that traps (e.g., due to an illegal funct3/funct7 combination, or due to a memory access in a future predicated-memory variant) traps regardless of whether the predicate is true or false. The architectural state on trap entry reflects no commit of the predicated op (rd unchanged).
A false-predicate ALU op does not raise integer-overflow or similar traps even if the operation, were it to execute, would have triggered such. This matches the standard RV64 behaviour (RV64 does not trap on integer overflow at all) but is worth stating explicitly for clarity.
9.5 Cycle Cost
In a typical in-order implementation:
- A true-predicate R-type op costs the same as a non-predicated op (1 cycle).
- A false-predicate R-type op also costs 1 cycle. The savings come from eliminating the branch entirely, not from skipping the predicated op's pipeline slot.
The break-even calculation versus branchy code is: predication wins when (branch + body) > (predicated op). For a single-instruction body with a 4-cycle mispredict penalty and 50% mispredict rate, the branchy form costs 1 (branch) + 0.5 (body, half the time) + 2 (mispredict, half the time) = 3.5 cycles. The predicated form costs 1 cycle. Predication wins by 2.5 cycles per such opportunity.
For longer bodies, the compiler must trade off: branch cost is fixed, predicated-body cost scales with body length. Beyond approximately 4 instructions in the body, branchy code wins.
9.6 Power
False-predicate ALU ops consume datapath energy even though they do not commit. Aggressive implementations may clock-gate the ALU on detected false predicates if the prediction is fast enough to disable the path before it activates. This is a microarchitectural choice; the architecture does not require it.
10. Interaction with Other Extensions
10.1 Xwide (Mandatory)
Xcond requires wide-mode execution. In narrow mode the predicate-enable bit does not exist (the fetch word is 32 bits, not 36). All Xcond instructions must be placed in .text.wide sections.
The extended register file (x0–x63 visible in wide mode) is fully accessible to Xcond instructions because ext_rd and ext_rs1 are preserved in the predicated encoding. Only ext_rs2 is repurposed.
10.2 Xcrisp
Xcond and Xcrisp are independent. Xcrisp's load-op fusion currently does not include predicated forms — a predicated LWADD would require both a memory-access path and predicate evaluation in one instruction, and the encoding space in Xcrisp custom-1 funct3=000 is shaped for a separate rd (load destination) and rs2 (operand). Adding predication would require either:
- Restructuring the LWADD encoding to give up the auto-inc base register (collapsing rd and rs2-as-destination), or
- Allocating a separate custom-2 sub-family for predicated load-op fusion.
Both options are deferred to v0.2 of one or both extensions.
For now, Xcrisp auto-inc loads (LWPI, etc.) compose naturally with subsequent Xcond predicated arithmetic, as shown in §6.5.
10.3 Xstack
Xcond and Xstack are independent and combinable.
10.4 Zicond
Xcond and Zicond cover overlapping but distinct cases (§2.1). The compiler may emit Zicond for narrow-mode code or for general-select patterns and Xcond for wide-mode predicated arithmetic. The two do not share encoding space.
10.5 Zbb
Xcond does not provide single-instruction min/max. Code expecting min/max should continue to use Zbb. Xcond complements Zbb for patterns Zbb does not cover (abs, conditional accumulate, modulo wrap).
10.6 Vector Extension (Future)
The reserved RVV opcodes are unaffected by Xcond. Per-lane vector predication will use the standard RVV mask register mechanism, not Xcond. This separation is intentional: scalar predication and vector masking have different cost/benefit trade-offs.
11. Encoding Summary
11.1 At-a-Glance Map
| Mnemonic | opcode | funct3 | funct7[5] | PRED-EN | Notes |
|---|---|---|---|---|---|
| ADD.cond | 0x33 |
000 | 0 | 1 | Predicated add |
| SUB.cond | 0x33 |
000 | 1 | 1 | Predicated sub |
| SLL.cond | 0x33 |
001 | 0 | 1 | Predicated shift left |
| MOV.cond | 0x33 |
010 | 0 | 1 | Replaces SLT |
| RSUB.cond | 0x33 |
011 | 0 | 1 | Replaces SLTU |
| XOR.cond | 0x33 |
100 | 0 | 1 | Predicated XOR |
| SRL.cond | 0x33 |
101 | 0 | 1 | Predicated logical shift right |
| SRA.cond | 0x33 |
101 | 1 | 1 | Predicated arithmetic shift right |
| OR.cond | 0x33 |
110 | 0 | 1 | Predicated OR |
| AND.cond | 0x33 |
111 | 0 | 1 | Predicated AND |
| ADDW.cond | 0x3B |
000 | 0 | 1 | 32-bit predicated add |
| SUBW.cond | 0x3B |
000 | 1 | 1 | 32-bit predicated sub |
| SLLW.cond | 0x3B |
001 | 0 | 1 | 32-bit predicated shift left |
| MOVW.cond | 0x3B |
010 | 0 | 1 | 32-bit MOV (sign-extends rs1) |
| RSUBW.cond | 0x3B |
011 | 0 | 1 | 32-bit RSUB |
| SRLW.cond | 0x3B |
101 | 0 | 1 | 32-bit logical shift right |
| SRAW.cond | 0x3B |
101 | 1 | 1 | 32-bit arithmetic shift right |
11.2 Reserved
- PRED-EN = 1, funct7 = 0100000, funct3 ∈ {001, 010, 011, 100, 110, 111}: reserved (these are funct7-encoded variants that have no defined predicated meaning).
- PRED-EN = 1 on any non-R-type / non-OP-32 opcode: reserved.
- Test modes TZ_RD (000) and TZ_RS1 (001) with conditions LTU (110) or GEU (111): reserved (unsigned compare against zero is degenerate).
- mode
010(TC), conditions EQ/NE: produce the same result as mode111(TD) with the same conditions for in-range values. Both encodings are valid; the implementation may treat them identically.
12. Open Items
- CSR address.
mxcondat 0xFC3 is suggested; final assignment requires coordination withmxcrisp(0xFC1),mxstack(0xFC2), and the wide-dirty bit. - Predicated I-type. Should v0.2 add predicated immediate ALU ops (ADDI.cond, etc.) by stealing bits from the 12-bit immediate field? Cost: smaller imm range (e.g., 8 bits instead of 12). Benefit: predicated increment/decrement against a constant without consuming a register.
- Predicated loads. Predicated load (e.g., for sparse gather patterns) would require a different encoding scheme entirely. Defer to v0.2.
- Predicated load-op (Xcrisp interaction). A predicated LWADD would be valuable for predicated DSP filter accumulation. Encoding space and complexity are both significant; defer to v0.2 of Xcrisp.
- Branch-cond fusion. A predicated branch (e.g., BEQ.cond on a flag mask) is theoretically possible but adds complexity to the front end. No clear use case yet; not pursued.
- *MOV.cond sign-extension semantics for W variants.* MOVW.cond sign-extends the 32-bit rs1 to 64 bits before writing rd, per standard W conventions. This is consistent but worth confirming with compiler writers.
- Test mode TC vs TD. Both are exposed and functionally equivalent for non-overflow cases. Implementation may collapse them to the same logic, or differentiate for code that relies on subtraction-result behaviour. Architectural mandate TBD.
- Per-mode condition reservation. Some mode/condition combinations are degenerate (e.g., mode TZ_RS1 cond LTU = "rs1 < 0 unsigned" is always false). Currently these are reserved-illegal; an alternative is to define them as "always false" (compiles to no-op). The illegal-trap form catches compiler bugs; the always-false form is more permissive.
13. Glossary
| Term | Meaning |
|---|---|
| Predicate | A run-time test whose result gates whether an instruction commits its writeback. |
| PRED-EN | The predicate-enable bit (bit 35 of the wide-mode R-type fetch word) that activates Xcond semantics for an instruction. |
| Test mode | A 3-bit field selecting what value the predicate compares against zero (or compares two registers). |
| Condition code | A 3-bit field selecting the comparison (EQ/NE/LT/GE/LE/GT/LTU/GEU). |
| MOV-cond | A predicated move (rd = rs1 if predicate true); replaces SLT in the predicated funct3 map. |
| RSUB-cond | A predicated reverse-subtract (rd = rs1 − rd if predicate true); enables single-instruction abs via rs1 = x0; replaces SLTU. |
| TC / TD | Test-Compare (signed compare rd vs rs1) and Test-Difference (predicate on virtual rd − rs1). Equivalent for in-range values. |
| TA | Test-AND (predicate on rd & rs1); useful for bitmask probes. |
End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension.