FireStorm Xcond Extension — Conditional Execution Specification

Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA Companions: FireStorm Xcrisp Extension, FireStorm Xstack Extension

1. Overview

The Xcond extension adds full conditional execution ("predication") to FireStorm's R-type ALU instructions. Every standard R-type operation — ADD, SUB, AND, OR, XOR, shifts — gains an opt-in predicated form that executes only when a runtime condition holds, with the destination register additionally serving as a source.

Xcond is wide-mode-only: the encoding consumes bits from the 36-bit fetch word's extension nibble that simply do not exist in narrow-mode 32-bit fetches. The cost in wide mode is one bit (the previously-reserved spare bit in the R-type extension nibble) to signal "this instruction is predicated"; once that bit is set, the rs2 field plus the ext_rs2 bit are repurposed as a 6-bit condition specifier, giving 8 test modes × 8 conditions.

When the predicate-enable bit is set, two new operations join the standard ALU set:

MOV-cond (rd = rs1 if condition) — replaces SLT in the predicated funct3 map, enabling general conditional assignment.
RSUB-cond (rd = rs1 - rd if condition) — replaces SLTU, enabling single-instruction abs and reverse-subtract patterns.

1.1 Wins

Predication eliminates short conditional branches in inner loops. The pay-off depends on the data:

Hot accumulators with data-dependent predicates (e.g., if (x[i] > 0) sum += x[i]) collapse to 1 instruction per iteration with no branch. For random-sign data, the mispredict penalty on standard RV64 dominates the inner loop; Xcond removes it entirely. Typical measured gain: ~1.5× to ~2× per iteration depending on data and pipeline depth.
Modulo / wrap-around counters (e.g., if (counter ≥ LIMIT) counter -= LIMIT) become 2 instructions instead of 3 + branch.
Saturating decrement / increment (e.g., if (counter > 0) counter--) becomes 1 instruction instead of 2 + branch (or 2 branchless).
abs / negate-if-negative becomes 1 instruction via RSUB-cond, beating both the standard xor/sub bit-twiddle (3 instructions) and the Zbb neg/max form (2 instructions).
DSP inner loops combining Xcrisp auto-inc loads with Xcond predicated accumulators see compound wins. See §6.5 for a worked L1-norm example: ~33% inner-loop instruction reduction and no mispredict.

1.2 Non-Goals

Not a vector / SIMD predicate mask. Each Xcond instruction tests one condition for one ALU op. Per-lane masking belongs in a future vector extension.
Not general select. General y = cond ? a : b with three independent registers does not fit the predicated R-type encoding (rd doubles as test operand, leaving only rs1 for the source). The standard Zicond extension covers that case, and Xcond and Zicond are complementary (§10.2).
Not predicated loads/stores. I-type and S-type formats have no rs2 field to repurpose. A predicated-immediate form is possible (steal bits from imm[11:0]) and is deferred to v0.2 if usage data justifies it.
Not narrow-mode. Xcond is unavailable when fetching from 32-bit-wide DDR3 memory because the enable bit lives in the extension nibble. Code that must run in narrow mode falls back to Zicond or standard branchy code.

2. Relationship to Standard RISC-V

Xcond does not modify any standard RISC-V instruction encoding. In narrow mode, FireStorm executes pure RV64GC; the predicate-enable bit simply does not exist in 32-bit fetches. In wide mode, the predicate-enable bit lives in a position that is otherwise reserved-zero in the parent FireStorm ISA's R-type extension nibble (§5.1 of the parent doc). Existing R-type instructions emitted by a standard RV64 compiler will have this bit zero and execute unchanged.

The mxcond CSR (§7) advertises presence and version.

2.1 Coexistence with Zicond

The ratified RISC-V Zicond extension provides czero.eqz and czero.nez: "if the test register is (non-)zero, write zero to rd; otherwise copy rs1." Zicond synthesises general select via two czero ops and an OR.

Xcond and Zicond differ in scope:

Aspect	Zicond	Xcond
Encoding format	R-type, full registers	R-type, rd doubles as source
Test value	One register (vs 0)	rd, rs1, or virtual computation of both
Condition richness	EQ/NE only	8 conditions × 8 test modes
Available in narrow mode	Yes	No
General `c ? a : b` select	Yes (3 ops)	No (rd is consumed)
Fused predicate-with-ALU	No (czero + separate ALU op)	Yes (predicate + ALU in one instruction)

Zicond is implemented when narrow-mode code needs it. Xcond is the wide-mode form, with broader operation set and richer condition encoding but no general select. Code may use both; they share no encoding space.

2.2 Coexistence with Zbb

The Zbb extension provides min, max, minu, maxu and other bit-manipulation primitives. For 2-register min/max patterns, Zbb is already optimal:

min  a, a, b      ; Zbb: a = min(a, b)

Xcond's MOV-cond can express the same pattern (MOV.cond a, b, mode=010/cond=GT) in 1 instruction — a tie. Xcond is not preferred for min/max specifically.

Xcond wins over Zbb where Zbb has no direct equivalent: abs (via RSUB-cond), conditional accumulate (via ADD-cond with predicate on input), modulo wrap-around (via SUB-cond with predicate on result), and any pattern where the modified register is not also the comparison operand.

3. Encoding

3.1 Predicate-Enable Bit

The R-type extension nibble in wide mode is:

 35      34       33       32
+--------+--------+--------+--------+
| spare  | ext_rd | ext_rs1| ext_rs2|
+--------+--------+--------+--------+

Xcond claims bit 35 (the spare bit) as PRED-EN ("predicate enable"). When PRED-EN = 0, the instruction executes as a standard R-type per the parent doc. When PRED-EN = 1, the instruction is predicated and rs2 / ext_rs2 are reinterpreted as the condition specifier:

 35      34       33       32
+--------+--------+--------+--------+
|   1    | ext_rd | ext_rs1| cond_5 |
+--------+--------+--------+--------+

ext_rd and ext_rs1 continue to extend their respective register fields — predicated instructions retain full x0–x63 access for the destination and source-1.
ext_rs2 (bit 32) is repurposed as cond[5], the high bit of the 6-bit condition field.
The standard rs2 field (bits [24:20]) is repurposed as cond[4:0], the low 5 bits of the condition field. No second source register is available in predicated form.

3.2 Condition Field

The 6-bit condition is split as 3 bits of test mode and 3 bits of condition code:

 5    3 2    0
+------+------+
| mode | cond |
+------+------+

3.2.1 Test Mode (3 bits)

The test mode selects what value is compared. In every mode the comparison is "test value <cond> zero" for the unary modes, or "rd <cond> rs1" for the binary mode.

mode	Mnemonic	Test value	Notes
`000`	TZ_RD	rd	Compare rd against zero
`001`	TZ_RS1	rs1	Compare rs1 against zero
`010`	TC	rd vs rs1	Compare rd against rs1 (signed/unsigned per condition)
`011`	TA	rd & rs1	Compare bitwise-AND against zero (mask test)
`100`	TO	rd \| rs1	Compare bitwise-OR against zero
`101`	TX	rd ^ rs1	Compare bitwise-XOR against zero (equality probe)
`110`	TS	rd + rs1	Compare virtual sum against zero (predicate on overflow direction)
`111`	TD	rd − rs1	Compare virtual difference against zero (predicate on subtraction result, catches overflow that mode TC's compare path misses)

Modes TC and TD are nearly equivalent for in-range values; they differ when the subtraction would overflow. TC performs a true signed comparison via the CPU's compare path; TD computes the difference and tests the result. Both are exposed because the choice can matter for saturating-arithmetic patterns.

The test result drives the condition code; the test value itself is not written anywhere — it is consumed by the predicate logic only.

3.2.2 Condition Code (3 bits)

code	Mnemonic	Predicate true when
`000`	EQ	test value == 0 (or rd == rs1 in mode TC)
`001`	NE	test value != 0
`010`	LT	test value < 0, signed
`011`	GE	test value >= 0, signed
`100`	LE	test value <= 0, signed
`101`	GT	test value > 0, signed
`110`	LTU	test value < 0, unsigned (equivalent to "carry set" for mode TD)
`111`	GEU	test value >= 0, unsigned

In modes other than TC, the conditions LTU/GEU operate on the unsigned interpretation of the computed test value. For unary modes, LTU is always false (unsigned values cannot be negative); these encodings are reserved and trap as illegal-instruction.

3.3 Operation Selection

When PRED-EN = 1, the funct3 field selects the ALU operation per a slightly modified map. SLT and SLTU are replaced with MOV-cond and RSUB-cond because predicated SLT/SLTU are of negligible use (a predicated compare-and-set is rarely the right primitive):

funct3	funct7[5]	Predicated mnemonic	Operation when predicate true
`000`	0	ADD.cond	rd = rd + rs1
`000`	1	SUB.cond	rd = rd − rs1
`001`	0	SLL.cond	rd = rd << (rs1 & 0x3F)
`010`	0	MOV.cond	rd = rs1 (replaces SLT)
`011`	0	RSUB.cond	rd = rs1 − rd (replaces SLTU)
`100`	0	XOR.cond	rd = rd ^ rs1
`101`	0	SRL.cond	rd = rd >> (rs1 & 0x3F) (logical)
`101`	1	SRA.cond	rd = rd >> (rs1 & 0x3F) (arithmetic)
`110`	0	OR.cond	rd = rd \| rs1
`111`	0	AND.cond	rd = rd & rs1

All other funct7 values are reserved when PRED-EN = 1 and trap as illegal-instruction (i.e., funct7 = 0100000 with funct3 ≠ 000 or 101).

Word-size variants: OP-32 opcode 0x3B is similarly predicatable, with funct3 selecting the 32-bit forms (ADDW.cond, SUBW.cond, SLLW.cond, etc.). The full table mirrors the 64-bit case with the standard 32-bit semantics. MOVW.cond and RSUBW.cond are 32-bit variants of MOV and RSUB.

Side-effect note: When the predicate is false, rd is not modified. This is observably different from a non-predicated ALU op that writes the same destination, including for purposes of subsequent forwarding and write-port allocation. Implementations must treat a false-predicate ALU op as a no-op on the writeback path.

3.4 Encoding Diagram

A predicated R-type instruction in wide mode looks like this (most significant bit on the left):

36-bit fetch word:
| 35  | 34  | 33  | 32  | 31..25 | 24..20  | 19..15 | 14..12 | 11..7 | 6..0 |
| 1   |ext_rd|ext_rs1|c5 | funct7 | c[4:0]  |  rs1   | funct3 |  rd   |0110011|
   ^                  ^             ^
   |                  |             |
  PRED-EN          cond[5]      cond[4:0]

Assembler syntax:

ADD.cond   rd, rs1, mode=MMM, cond=CCC
ADD.cond   rd, rs1, MMM/CCC          ; shorthand
ADD.cond   rd, rs1, GT_RS1            ; named pseudo for mode=001 cond=GT

Named pseudo-conditions covering the most common patterns:

Pseudo-name	mode/cond	Meaning
`GT_RD`	000/101	rd > 0
`LT_RD`	000/010	rd < 0
`EQZ_RD`	000/000	rd == 0
`NEZ_RD`	000/001	rd != 0
`GT_RS1`	001/101	rs1 > 0
`LT_RS1`	001/010	rs1 < 0
`EQZ_RS1`	001/000	rs1 == 0
`NEZ_RS1`	001/001	rs1 != 0
`LT`	010/010	rd < rs1 (signed)
`GE`	010/011	rd >= rs1 (signed)
`LTU`	010/110	rd < rs1 (unsigned)
`GEU`	010/111	rd >= rs1 (unsigned)
`EQ`	010/000	rd == rs1
`NE`	010/001	rd != rs1
`ANY`	011/001	(rd & rs1) != 0 (any common bit set)
`NONE`	011/000	(rd & rs1) == 0 (no common bit set)

4. Detection

Software detects Xcond by reading the mxcond CSR (§7). A simpler probe: execute a known-encoded predicated instruction (with a never-true condition, so it has no architectural effect) and catch the illegal-instruction trap if not supported. The CSR-based approach is preferred for cross-implementation portability.

5. Instruction Reference

This section gives the full semantics of each predicated form. All operate only when PRED-EN = 1 in the R-type extension nibble. All are wide-mode-only.

5.1 ADD.cond, SUB.cond

ADD.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd + rs1
SUB.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd - rs1

The 64-bit signed add/subtract semantics match standard ADD/SUB. ADDW.cond/SUBW.cond (opcode 0x3B) operate on 32-bit values with sign extension to 64 bits, as standard ADDW/SUBW.

5.2 MOV.cond

MOV.cond  rd, rs1, mode/cond     ; if (test passes) rd = rs1

Single-instruction conditional assignment. The test typically operates on rd (the current value) and the assignment replaces it with rs1. Common usage patterns:

Saturating clamp: MOV.cond x, max_reg, LT (if x < max_reg, rd unchanged; if x >= max_reg, swap to max_reg — actually MOV.cond x, max_reg, GT would be the right encoding; see §6.1 for verified examples).
State replacement on threshold: MOV.cond state, new_state, mode/cond.

MOVW.cond is the 32-bit variant.

5.3 RSUB.cond

RSUB.cond rd, rs1, mode/cond     ; if (test passes) rd = rs1 - rd

Reverse subtract — the order of operands is the reverse of SUB.cond. Most useful with rs1 = x0, which yields negation: if (test passes) rd = -rd. Single-instruction abs is the canonical use (§6.4).

RSUBW.cond is the 32-bit variant.

5.4 AND.cond, OR.cond, XOR.cond

AND.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd & rs1
OR.cond   rd, rs1, mode/cond     ; if (test passes) rd = rd | rs1
XOR.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd ^ rs1

Bitwise operations with predication. Most useful in combination with the bitmask test modes (TA = mode 011): "if any/no bit of (rd & rs1) is set, modify rd by op with rs1." See §6.6 for self-modifying flag patterns.

5.5 SLL.cond, SRL.cond, SRA.cond

SLL.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd << (rs1 & 0x3F)
SRL.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd >> (rs1 & 0x3F)  [logical]
SRA.cond  rd, rs1, mode/cond     ; if (test passes) rd = rd >> (rs1 & 0x3F)  [arithmetic]

Predicated shifts. Useful in saturating-bit-shift idioms (e.g., binary search step: if (val >= midpoint) val >>= 1).

The 32-bit variants SLLW.cond, SRLW.cond, SRAW.cond use the low 5 bits of rs1 (matching standard *W semantics).

6. Examples

All examples in this section have been hand-verified against the encoding tables in §3 and §5 and the standard RV64 reference. Cycle counts assume an in-order issue pipeline with a 4-cycle branch mispredict penalty; the actual penalty on FireStorm depends on the final pipeline depth (open item in parent doc §13).

6.1 Saturating Accumulator (Positive Samples)

The motivating example. Sum the positive elements of an array:

int sum = 0;
for (int i = 0; i < n; i++) {
    if (x[i] > 0) sum += x[i];
}

Standard RV64 (with Xcrisp auto-inc):

        li      sum, 0
        mv      p, x_ptr
        mv      cnt, n
loop:
        LWPI    t0, 0(p)+               ; t0 = *p++
        blez    t0, skip                ; branch if t0 <= 0
        add     sum, sum, t0            ; sum += t0
skip:
        addi    cnt, cnt, -1
        bnez    cnt, loop

Inner body: 4 to 5 instructions depending on branch taken. For random-sign input, the blez mispredicts roughly half the time, costing approximately 2 cycles per iteration on average (4-cycle mispredict × 50% rate, amortised). Effective cost: ~6.5 cycles/iteration.

With Xcond:

        li      sum, 0
        mv      p, x_ptr
        mv      cnt, n
loop:
        LWPI    t0, 0(p)+
        ADD.cond sum, t0, GT_RS1        ; if (t0 > 0) sum += t0
        addi    cnt, cnt, -1
        bnez    cnt, loop

Inner body: 4 instructions, no inner conditional branch. The trailing bnez cnt, loop is the loop branch (highly predictable, predicted-taken). Effective cost: ~4 cycles/iteration.

Speedup: ~1.6× for random-sign data, larger for high-mispredict-cost pipelines.

6.2 Modulo Counter (Wrap-Around)

A counter that wraps when it crosses a limit, e.g., a phase accumulator or circular buffer index:

counter += step;
if (counter >= LIMIT) counter -= LIMIT;

Standard RV64:

        add     counter, counter, step
        blt     counter, limit, done
        sub     counter, counter, limit
done:

3 instructions, 1 branch. The branch is data-dependent and may mispredict; for a uniform-step counter the branch rate is step/LIMIT.

With Xcond:

        add        counter, counter, step
        SUB.cond   counter, limit, GE       ; if (counter >= limit) counter -= limit

2 instructions, no branch. Saves 1 instruction and removes a mispredict-prone branch.

6.3 Saturating Decrement

Decrement a counter, clamped at zero:

if (counter > 0) counter--;

Standard RV64 (branchless with Zbb):

        snez    t0, counter             ; t0 = (counter != 0) ? 1 : 0
        sub     counter, counter, t0

2 instructions.

Standard RV64 (without Zbb):

        beqz    counter, done
        addi    counter, counter, -1
done:

2 instructions, with branch.

With Xcond (requires register holding 1):

        SUB.cond   counter, one_reg, GT_RD     ; if (counter > 0) counter -= 1

1 instruction. Saves 1 instruction. Note the trade-off: a register is consumed to hold the constant 1; this is acceptable in tight loops where the register can be hoisted out of the loop body.

6.4 Branchless Absolute Value

x = abs(x);   // if (x < 0) x = -x

Standard RV64 (bit-twiddle, no Zbb):

        sraiw   t0, x, 31               ; t0 = -1 if negative, 0 if non-negative
        xor     x, x, t0
        sub     x, x, t0

3 instructions, branchless, but consumes a temporary.

Standard RV64 (Zbb):

        neg     t0, x
        max     x, x, t0

2 instructions.

With Xcond:

        RSUB.cond  x, x0, LT_RD         ; if (x < 0) x = 0 - x = -x

1 instruction. No temporary. Saves 1–2 instructions over standard forms.

Worked encoding:

PRED-EN  = 1
ext_rd   = 0  (x is in x0-x31)
ext_rs1  = 0  (x0)
cond[5]  = 0  (cond[5:3] = 000, mode = TZ_RD)
funct7   = 0000000
cond[4:0]= 00010  (cond[4:3] = 00, mode bit 0; cond[2:0] = 010, LT)
                                                       ^ Wait — let me re-check.

Actually, cond[5:3] = mode = 000, cond[2:0] = condition = 010 (LT).
So cond[5:0] = 000_010 = 000010.
cond[5] (bit 32) = 0
cond[4:0] (bits 24:20) = 00010

rs1      = 00000  (x0)
funct3   = 011    (RSUB)
rd       = x's register number
opcode   = 0110011

6.5 L1 Norm (Compound: Xcrisp + Xcond)

Sum of absolute values over an array — a common DSP primitive (e.g., for signal energy estimation):

int sum_abs = 0;
for (int i = 0; i < n; i++) {
    int v = x[i];
    if (v < 0) v = -v;
    sum_abs += v;
}

Standard RV64 (with Xcrisp auto-inc, no Zbb):

loop:
        LWPI    t0, 0(p)+
        sraiw   t1, t0, 31
        xor     t0, t0, t1
        sub     t0, t0, t1
        add     sum, sum, t0
        addi    cnt, cnt, -1
        bnez    cnt, loop

6 inner-loop instructions plus the loop branch.

Standard RV64 (Xcrisp + Zbb):

loop:
        LWPI    t0, 0(p)+
        neg     t1, t0
        max     t0, t0, t1
        add     sum, sum, t0
        addi    cnt, cnt, -1
        bnez    cnt, loop

5 inner-loop instructions.

With Xcrisp + Xcond:

loop:
        LWPI       t0, 0(p)+
        RSUB.cond  t0, x0, LT_RD                 ; t0 = abs(t0)
        add        sum, sum, t0
        addi       cnt, cnt, -1
        bnez       cnt, loop

4 inner-loop instructions. 20–33% inner-loop reduction versus Zbb / non-Zbb baselines. No mispredict in the abs step.

With Xcrisp + Xcond + load-op fusion (LWADD from Xcrisp):

loop:
        LWPI       t0, 0(p)+
        RSUB.cond  t0, x0, LT_RD                 ; t0 = abs(t0)
        add        sum, sum, t0
        ; (could fuse the add with the load if we could predicate the fused op — see open items)
        addi       cnt, cnt, -1
        bnez       cnt, loop

The load-op fusion path is not currently extended to predicated forms; see §11 open items.

6.6 Self-Modifying Flag Update

Toggle a bit only when it is currently set (paired with sticky-flag clearing):

if (flags & DONE_BIT) flags ^= DONE_BIT;   // clear DONE bit if set

This particular example is degenerate — unconditional andi flags, flags, ~DONE_BIT produces the same result in 1 instruction with no Xcond needed.

The genuinely useful pattern is when same-mask test + same-mask modify with a multi-bit mask, and the operation is something other than clear (i.e., the modification cannot be expressed as a simpler unconditional bitop). One narrow example:

if (flags & RETRY_BITS) {
    flags ^= RETRY_BITS;     // flip all retry bits at once if any was set
    flags |= TRIED_FLAG;     // (this second op is the part Xcond can't fold)
}

For the first line alone:

        XOR.cond   flags, retry_mask, ANY     ; if (flags & retry_mask) flags ^= retry_mask

1 instruction versus 3 (and/branch/xor) for the branched form. Real win when the test mask equals the modify mask and the unconditional form is not equivalent.

In practice this combination is rare. The bitmask test mode (TA) is more useful for predicating other operations on a flag's state:

if (status & READY_BIT) counter = max_value;   // status is rd, ready bit is rs1

        MOV.cond   status_or_counter, max_reg, ANY

Wait — that doesn't work, because the modified register (counter) is different from the tested register (status). The bitmask predicate works only when rd doubles as the tested register. So the example above doesn't fit; one must either rearrange or fall back to a branch.

Honest assessment: Bitmask predication (mode TA) is useful but narrow. It enables single-instruction self-modifying flag idioms but does not generalise to "test register A, modify register B."

6.7 Verification Summary

Example	Pattern	Standard	With Xcond	Saving
6.1	Conditional accumulate	4–5 instr + mispredict	4 instr	~1.6× speedup (data-dependent)
6.2	Modulo counter	3 instr + branch	2 instr	1 instr + branch eliminated
6.3	Saturating decrement	2 instr	1 instr	1 instr
6.4	Abs	2–3 instr	1 instr	1–2 instr
6.5	L1 norm inner loop	5–6 instr/iter	4 instr/iter	20–33%
6.6	Same-mask flag toggle	3 instr + branch	1 instr	Narrow but real

Patterns that do not benefit from Xcond:

General c ? a : b select with three independent registers — use Zicond.
min/max — Zbb does these in 1 instruction; Xcond ties at best.
Cross-register predicate ("test register A, modify register B") — requires branch or Zicond.
Predicated loads/stores — not in v0.1.

7. CSR Allocation

CSR	Address (suggested)	Privilege	Description
`mxcond`	`0xFC3`	MRO	Xcond version and feature bits

Bit layout of mxcond:

Bits	Field	Meaning
`[0]`	PRESENT	1 if Xcond implemented
`[7:1]`	VERSION	Xcond version (1 = v0.1)
`[8]`	HAS_RSUB	1 if RSUB-cond is implemented (mandatory in v0.1; reserved bit for future variants that omit it)
`[9]`	HAS_MOV	1 if MOV-cond is implemented (mandatory in v0.1)
`[10]`	HAS_W_VARIANTS	1 if 32-bit (`*W.cond`) variants are implemented
`[63:11]`	reserved	—

8. Compiler and Toolchain Integration

8.1 Target Flags

The +xcond target feature enables Xcond emission. The full FireStorm feature set is +xfirestorm = +xwide,+xcrisp,+xstack,+xcond.

Xcond requires Xwide (wide-mode execution), so +xcond implies +xwide at the toolchain level. Code generation falls back to non-predicated forms for any function that the compiler determines may execute in narrow mode.

Per-function annotation: __attribute__((target("xcond"))) enables predicated codegen for one function, __attribute__((target("no-xcond"))) disables it.

8.2 ABI Compatibility

Xcond does not alter the calling convention or any data layout. A predicated instruction with a false condition produces a no-op on rd, observably identical to standard ABI expectations.

8.3 When to Use Xcond

The compiler should emit predicated forms when all of the following hold:

The function is in a .text.wide section (executes in wide mode).
The short conditional region replaces a branch whose mispredict cost exceeds the saved instructions. For very short forward branches over single-instruction bodies, predication is almost always preferable.
The predicate fits the rd-as-test-operand model. Where it does not (general select), use Zicond or fall back to branches.

The compiler should not emit Xcond when:

The branch is highly predictable (e.g., loop-exit branches). Predication adds work even when the condition is false; a predicted branch costs nothing.
The conditional region is long (more than ~4 instructions). Multiple predicated instructions accumulate cost; a single branch is cheaper.
The pattern fits Zbb min/max more directly.

8.4 Inline Assembly

GCC/Clang inline-assembly support:

New constraint modifier %c<n> formats a condition operand as mode/cond shorthand.
The asm template emits .cond mnemonics directly; the compiler does not optimise across them.

9. Implementation Guidance

9.1 Decode Path

The predicate-enable bit is bit 35 of the 36-bit fetch word. Decoders should:

Read bit 35 (PRED-EN) at the start of decode for R-type / OP-32 instructions.
If PRED-EN = 0, decode normally per the parent doc.
If PRED-EN = 1:
- Recompute the funct3 → operation map (substitute MOV for SLT, RSUB for SLTU).
- Suppress the rs2 register read (rs2 field is condition, not register index).
- Schedule an rd-read in addition to the rd-write (rd is now also a source).
- Extract the 6-bit condition field from {bit 32, bits[24:20]}.

The rd-read is the most pipeline-significant change: the issue stage must read rd in addition to rs1 for predicated R-type, taking an additional register-file read port. In a 2-read-port implementation, this is a structural hazard with any other instruction needing two reads; in a 3-port implementation, it issues without delay. Sizing the register file port count is an implementation choice.

9.2 Condition Evaluation

The condition logic computes:

The test value (selected by mode, computed from rd and rs1 as per §3.2.1). For modes TS and TD, the test value is the same as the ALU output computed by the operation under predicate — implementations may share the adder.
The condition result (true/false) based on the test value and condition code per §3.2.2.

The condition result gates the writeback. A false-predicate instruction must not write rd; this is functionally a write-mask in the writeback stage. Implementations using register renaming should suppress the rename commit on false predicate; in-order implementations gate the GPR write-enable.

9.3 Forwarding

Because rd is read for the test in modes TZ_RD, TC, TA, TO, TX, TS, and TD, the implementation must handle forwarding for the rd-read just like any other source operand. A back-to-back sequence where instruction N+1 has its rd dependent on instruction N's rd commit may require one cycle of forwarding stall in a shallow pipeline.

For false-predicate instructions, the "rd value" that subsequent instructions see is the pre-predicate value (because the false predicate did not write). Implementations using bypass logic should bypass the original rd, not the predicated-op result.

9.4 Trap Behaviour

A predicated instruction that traps (e.g., due to an illegal funct3/funct7 combination, or due to a memory access in a future predicated-memory variant) traps regardless of whether the predicate is true or false. The architectural state on trap entry reflects no commit of the predicated op (rd unchanged).

A false-predicate ALU op does not raise integer-overflow or similar traps even if the operation, were it to execute, would have triggered such. This matches the standard RV64 behaviour (RV64 does not trap on integer overflow at all) but is worth stating explicitly for clarity.

9.5 Cycle Cost

In a typical in-order implementation:

A true-predicate R-type op costs the same as a non-predicated op (1 cycle).
A false-predicate R-type op also costs 1 cycle. The savings come from eliminating the branch entirely, not from skipping the predicated op's pipeline slot.

The break-even calculation versus branchy code is: predication wins when (branch + body) > (predicated op). For a single-instruction body with a 4-cycle mispredict penalty and 50% mispredict rate, the branchy form costs 1 (branch) + 0.5 (body, half the time) + 2 (mispredict, half the time) = 3.5 cycles. The predicated form costs 1 cycle. Predication wins by 2.5 cycles per such opportunity.

For longer bodies, the compiler must trade off: branch cost is fixed, predicated-body cost scales with body length. Beyond approximately 4 instructions in the body, branchy code wins.

9.6 Power

False-predicate ALU ops consume datapath energy even though they do not commit. Aggressive implementations may clock-gate the ALU on detected false predicates if the prediction is fast enough to disable the path before it activates. This is a microarchitectural choice; the architecture does not require it.

10. Interaction with Other Extensions

10.1 Xwide (Mandatory)

Xcond requires wide-mode execution. In narrow mode the predicate-enable bit does not exist (the fetch word is 32 bits, not 36). All Xcond instructions must be placed in .text.wide sections.

The extended register file (x0–x63 visible in wide mode) is fully accessible to Xcond instructions because ext_rd and ext_rs1 are preserved in the predicated encoding. Only ext_rs2 is repurposed.

10.2 Xcrisp

Xcond and Xcrisp are independent. Xcrisp's load-op fusion currently does not include predicated forms — a predicated LWADD would require both a memory-access path and predicate evaluation in one instruction, and the encoding space in Xcrisp custom-1 funct3=000 is shaped for a separate rd (load destination) and rs2 (operand). Adding predication would require either:

Restructuring the LWADD encoding to give up the auto-inc base register (collapsing rd and rs2-as-destination), or
Allocating a separate custom-2 sub-family for predicated load-op fusion.

Both options are deferred to v0.2 of one or both extensions.

For now, Xcrisp auto-inc loads (LWPI, etc.) compose naturally with subsequent Xcond predicated arithmetic, as shown in §6.5.

10.3 Xstack

Xcond and Xstack are independent and combinable.

10.4 Zicond

Xcond and Zicond cover overlapping but distinct cases (§2.1). The compiler may emit Zicond for narrow-mode code or for general-select patterns and Xcond for wide-mode predicated arithmetic. The two do not share encoding space.

10.5 Zbb

Xcond does not provide single-instruction min/max. Code expecting min/max should continue to use Zbb. Xcond complements Zbb for patterns Zbb does not cover (abs, conditional accumulate, modulo wrap).

10.6 Vector Extension (Future)

The reserved RVV opcodes are unaffected by Xcond. Per-lane vector predication will use the standard RVV mask register mechanism, not Xcond. This separation is intentional: scalar predication and vector masking have different cost/benefit trade-offs.

11. Encoding Summary

11.1 At-a-Glance Map

Mnemonic	opcode	funct3	funct7[5]	PRED-EN	Notes
ADD.cond	`0x33`	000	0	1	Predicated add
SUB.cond	`0x33`	000	1	1	Predicated sub
SLL.cond	`0x33`	001	0	1	Predicated shift left
MOV.cond	`0x33`	010	0	1	Replaces SLT
RSUB.cond	`0x33`	011	0	1	Replaces SLTU
XOR.cond	`0x33`	100	0	1	Predicated XOR
SRL.cond	`0x33`	101	0	1	Predicated logical shift right
SRA.cond	`0x33`	101	1	1	Predicated arithmetic shift right
OR.cond	`0x33`	110	0	1	Predicated OR
AND.cond	`0x33`	111	0	1	Predicated AND
ADDW.cond	`0x3B`	000	0	1	32-bit predicated add
SUBW.cond	`0x3B`	000	1	1	32-bit predicated sub
SLLW.cond	`0x3B`	001	0	1	32-bit predicated shift left
MOVW.cond	`0x3B`	010	0	1	32-bit MOV (sign-extends rs1)
RSUBW.cond	`0x3B`	011	0	1	32-bit RSUB
SRLW.cond	`0x3B`	101	0	1	32-bit logical shift right
SRAW.cond	`0x3B`	101	1	1	32-bit arithmetic shift right

11.2 Reserved

PRED-EN = 1, funct7 = 0100000, funct3 ∈ {001, 010, 011, 100, 110, 111}: reserved (these are funct7-encoded variants that have no defined predicated meaning).
PRED-EN = 1 on any non-R-type / non-OP-32 opcode: reserved.
Test modes TZ_RD (000) and TZ_RS1 (001) with conditions LTU (110) or GEU (111): reserved (unsigned compare against zero is degenerate).
mode 010 (TC), conditions EQ/NE: produce the same result as mode 111 (TD) with the same conditions for in-range values. Both encodings are valid; the implementation may treat them identically.

12. Open Items

CSR address. mxcond at 0xFC3 is suggested; final assignment requires coordination with mxcrisp (0xFC1), mxstack (0xFC2), and the wide-dirty bit.
Predicated I-type. Should v0.2 add predicated immediate ALU ops (ADDI.cond, etc.) by stealing bits from the 12-bit immediate field? Cost: smaller imm range (e.g., 8 bits instead of 12). Benefit: predicated increment/decrement against a constant without consuming a register.
Predicated loads. Predicated load (e.g., for sparse gather patterns) would require a different encoding scheme entirely. Defer to v0.2.
Predicated load-op (Xcrisp interaction). A predicated LWADD would be valuable for predicated DSP filter accumulation. Encoding space and complexity are both significant; defer to v0.2 of Xcrisp.
Branch-cond fusion. A predicated branch (e.g., BEQ.cond on a flag mask) is theoretically possible but adds complexity to the front end. No clear use case yet; not pursued.
*MOV.cond sign-extension semantics for W variants.* MOVW.cond sign-extends the 32-bit rs1 to 64 bits before writing rd, per standard W conventions. This is consistent but worth confirming with compiler writers.
Test mode TC vs TD. Both are exposed and functionally equivalent for non-overflow cases. Implementation may collapse them to the same logic, or differentiate for code that relies on subtraction-result behaviour. Architectural mandate TBD.
Per-mode condition reservation. Some mode/condition combinations are degenerate (e.g., mode TZ_RS1 cond LTU = "rs1 < 0 unsigned" is always false). Currently these are reserved-illegal; an alternative is to define them as "always false" (compiles to no-op). The illegal-trap form catches compiler bugs; the always-false form is more permissive.

13. Glossary

Term	Meaning
Predicate	A run-time test whose result gates whether an instruction commits its writeback.
PRED-EN	The predicate-enable bit (bit 35 of the wide-mode R-type fetch word) that activates Xcond semantics for an instruction.
Test mode	A 3-bit field selecting what value the predicate compares against zero (or compares two registers).
Condition code	A 3-bit field selecting the comparison (EQ/NE/LT/GE/LE/GT/LTU/GEU).
MOV-cond	A predicated move (`rd = rs1` if predicate true); replaces SLT in the predicated funct3 map.
RSUB-cond	A predicated reverse-subtract (`rd = rs1 − rd` if predicate true); enables single-instruction abs via rs1 = x0; replaces SLTU.
TC / TD	Test-Compare (signed compare rd vs rs1) and Test-Difference (predicate on virtual `rd − rs1`). Equivalent for in-range values.
TA	Test-AND (predicate on `rd & rs1`); useful for bitmask probes.

End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension.