FireStorm Performance Examples

Document version: 0.3 (draft) Status: Verified example set including Xcond and Xctx Companion documents: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension, FireStorm Xlate Extension, FireStorm Xctx Extension, FireStorm Xmath Extension

1. Introduction

This document presents concrete examples of code patterns where the FireStorm extensions (Xwide, Xcrisp, Xstack, Xcond, Xlate, Xctx, Xmath) deliver measurable wins over standard RV64GC. Each example shows:

The C source code.
The best-effort standard RV64GC assembly (representative compiler output).
The equivalent FireStorm assembly using the relevant extensions.
Comparison of instruction count and cycle estimate.

The examples were chosen to be realistic, representative of the workloads FireStorm targets (audio synthesis, signal processing, OS kernels, modular software), and small enough to verify by hand.

2. Methodology and Reference Microarchitecture

Cycle counts in this document assume the following reference microarchitecture, which is conservative — real FireStorm implementations are expected to match or exceed it:

Single-issue, in-order pipeline.
ALU operations: 1 cycle, fully bypassed.
Branches: 1 cycle when predicted correctly, +2 cycles on misprediction.
Loads: 2-cycle latency from memory; load-use stall of 1 cycle for immediately-dependent ALU op, none for further-out instructions.
Stores: 1 cycle; commit deferred to a store buffer, not blocking.
Xstack BSRAM port: 4 dwords per cycle (Ant64 baseline, mxstack[PORT_WIDTH_LOG2] = 2).
Xcrisp memory-fused instructions: 1-cycle throughput in steady state on properly pipelined implementations; 3-cycle latency.
Xcrisp PIC instructions: 1 cycle (LAPC, JALPC), 2 cycles (LDPC, LWPC), 2 cycles (CALLM, JMPM — the memory load is on the critical path).
Xcond predicated R-type: 1 cycle (same as non-predicated R-type). A false-predicate op consumes a pipeline slot but does not commit rd; the win is from eliminating short conditional branches, not from skipping the predicated op.
Xctx context switch (YIELD/HALT/FREE/NEW): ~30 cycles (576-bit BSRAM port). The switch is a single instruction architecturally but consumes the cycles needed to save/load 1 KB of context state through the BSRAM port.
Xctx RESUME: ~3 cycles. RESUME only updates the state table and ready queue; it does not itself context-switch.

Instruction count is implementation-independent and reported precisely. Cycle counts are estimates for the reference microarchitecture and should be read as rough ratios rather than absolute figures.

All assembly snippets use lowercase for standard RV64 mnemonics and uppercase for FireStorm extension mnemonics, making it easy to see at a glance which instructions are extension-specific.

3. Buffer Sum (Accumulator Pattern)

A textbook accumulator loop summing 32-bit signed integers from a buffer.

3.1 C Source

int64_t sum_buffer(const int32_t *buf, int n) {
    int64_t sum = 0;
    const int32_t *end = buf + n;
    while (buf < end) {
        sum += *buf++;
    }
    return sum;
}

3.2 Standard RV64GC

    ; a0 = buf, a1 = n
    ; on entry, a0 -> int32_t array, a1 = count
    li      a2, 0               ; sum = 0
    slli    a3, a1, 2           ; n * 4 = byte count
    add     a3, a0, a3          ; end pointer
    beq     a0, a3, .Lret       ; n == 0
.Lloop:
    lw      a4, 0(a0)           ; load buf[i]
    addi    a0, a0, 4           ; ++buf
    add     a2, a2, a4          ; sum += buf[i]
    bne     a0, a3, .Lloop      ; while (buf < end)
.Lret:
    mv      a0, a2              ; return sum
    ret

Inner loop: 4 instructions per iteration.

3.3 FireStorm (Xcrisp Auto-Inc Load)

    li      a2, 0
    slli    a3, a1, 2
    add     a3, a0, a3
    beq     a0, a3, .Lret
.Lloop:
    LWPI    a4, 4(a0)           ; a4 = sext32(mem32[a0]); a0 += 4
    add     a2, a2, a4
    bne     a0, a3, .Lloop
.Lret:
    mv      a0, a2
    ret

Inner loop: 3 instructions per iteration.

3.4 FireStorm (Xcrisp Load-Op Fusion)

If we prefer to keep the pointer increment separate (allowing the compiler more scheduling flexibility), LWADD does the load and the add in one instruction:

.Lloop:
    LWADD   a2, (a0), a2        ; a2 = sext32(mem32[a0]) + a2
    addi    a0, a0, 4
    bne     a0, a3, .Lloop

Inner loop: still 3 instructions; same shape, different decomposition. On a wider implementation that can issue the addi in parallel with the load-op's ALU stage, this form has higher peak throughput.

3.5 Comparison

Implementation	Inner-loop instructions	Cycles per element (reference µarch)
Standard RV64GC	4	4
FireStorm (LWPI)	3	3
FireStorm (LWADD + addi)	3	3

25% reduction in instruction count and a matching reduction in cycle count on a single-issue in-order pipeline. On wider implementations the win is the same in instructions but smaller in cycles, since the standard loop can issue two instructions per cycle if the load result is forwarded promptly.

4. In-Place Buffer Mutation

A loop that modifies every element of a buffer in place — clearing a bit, applying a bias, inverting all bits. The pattern is buf[i] = buf[i] OP constant.

4.1 C Source

void apply_mask(uint32_t *buf, uint32_t mask, int n) {
    for (int i = 0; i < n; i++) {
        buf[i] &= mask;
    }
}

4.2 Standard RV64GC

    ; a0 = buf, a1 = mask, a2 = n
    slli    a3, a2, 2
    add     a3, a0, a3          ; end pointer
    beq     a0, a3, .Lret
.Lloop:
    lw      a4, 0(a0)           ; load
    and     a4, a4, a1          ; AND with mask
    sw      a4, 0(a0)           ; store
    addi    a0, a0, 4           ; advance pointer
    bne     a0, a3, .Lloop
.Lret:
    ret

Inner loop: 5 instructions per iteration.

4.3 FireStorm (Xcrisp Load-Op-Store Fusion)

    slli    a3, a2, 2
    add     a3, a0, a3
    beq     a0, a3, .Lret
.Lloop:
    MMWAND  [a0], [a0], a1      ; mem32[a0] = mem32[a0] & a1 (in-place)
    addi    a0, a0, 4
    bne     a0, a3, .Lloop
.Lret:
    ret

Inner loop: 3 instructions per iteration.

4.4 Comparison

Implementation	Inner-loop instructions	Cycles per element
Standard RV64GC	5	5–6 (load-use stall on `and`)
FireStorm (MMWAND)	3	3 (pipelined load-op-store)

40% reduction in instruction count. No architectural temporary is ever written: the load result is forwarded through the ALU into the store-buffer entry without traversing the register file. On the reference microarchitecture, the load-use stall in the standard code disappears entirely in the fused form.

The same pattern applies to buf[i] += dc, buf[i] ^= 0xFFFFFFFF, buf[i] |= flag_bits, etc. — any C operation of the form *p = *p OP v compiles to a single load-op-store instruction.

5. Element-Wise Transform with Different Source and Destination

A loop that reads one buffer, transforms each element, and writes to another buffer — the classic vector-style operation. Example: applying a DC bias.

5.1 C Source

void add_bias(int32_t *dst, const int32_t *src, int32_t bias, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i] + bias;
    }
}

5.2 Standard RV64GC

    ; a0 = dst, a1 = src, a2 = bias, a3 = n
    slli    a4, a3, 2
    add     a4, a1, a4          ; src end
    beq     a1, a4, .Lret
.Lloop:
    lw      a5, 0(a1)           ; load src[i]
    add     a5, a5, a2          ; + bias
    sw      a5, 0(a0)           ; store to dst[i]
    addi    a1, a1, 4
    addi    a0, a0, 4
    bne     a1, a4, .Lloop
.Lret:
    ret

Inner loop: 6 instructions per iteration.

5.3 FireStorm (Xcrisp Load-Op-Store)

    slli    a4, a3, 2
    add     a4, a1, a4
    beq     a1, a4, .Lret
.Lloop:
    MMWADD  [a0], [a1], a2      ; mem32[a0] = mem32[a1] + a2
    addi    a1, a1, 4
    addi    a0, a0, 4
    bne     a1, a4, .Lloop
.Lret:
    ret

Inner loop: 4 instructions per iteration.

5.4 Comparison

Implementation	Inner-loop instructions	Cycles per element
Standard RV64GC	6	6
FireStorm (MMWADD)	4	4

33% reduction. The fused form would benefit from a hypothetical "load-op-store with dual auto-increment" instruction that advances both pointers in one shot, but that doesn't fit in 32 bits and is parked as a v0.2 candidate. Even without it, the win is substantial.

6. Null-Terminated String Length

A classic strlen implementation. Note that real libc implementations of strlen use word-at-a-time scanning with bit-twiddling tricks (the "Mycroft" technique); this comparison shows the byte-at-a-time form as a representative tight scan loop, which is also what compilers typically emit for while(*p) p++; patterns when no library call optimisation is available.

6.1 C Source

size_t strlen_byte(const char *s) {
    const char *p = s;
    while (*p != 0) p++;
    return p - s;
}

6.2 Standard RV64GC

    ; a0 = s
    mv      a1, a0              ; p = s
.Lloop:
    lbu     a2, 0(a1)           ; load byte
    addi    a1, a1, 1           ; advance
    bne     a2, zero, .Lloop    ; loop while non-zero
.Lend:
    sub     a0, a1, a0          ; p - s
    addi    a0, a0, -1          ; account for past-the-terminator
    ret

Inner loop: 3 instructions per iteration.

6.3 FireStorm (Xcrisp Auto-Inc Load)

    mv      a1, a0
.Lloop:
    LBUPI   a2, 1(a1)           ; a2 = zext8(mem8[a1]); a1 += 1
    bne     a2, zero, .Lloop
.Lend:
    sub     a0, a1, a0
    addi    a0, a0, -1
    ret

Inner loop: 2 instructions per iteration.

6.4 Comparison

Implementation	Inner-loop instructions	Cycles per byte
Standard byte-at-a-time	3	3
FireStorm byte-at-a-time	2	2

33% reduction. Both forms can be replaced by word-at-a-time scanning for further speedup in libc implementations; the FireStorm form remains 33% faster than the standard form when both use the same algorithm.

7. Linear Table Search

Searching for a key in a table of structs, a common dispatch pattern.

7.1 C Source

typedef struct {
    uint32_t key;
    uint32_t value;
} entry_t;

entry_t *find_entry(entry_t *table, entry_t *end, uint32_t key) {
    while (table < end) {
        if (table->key == key) return table;
        table++;
    }
    return NULL;
}

7.2 Standard RV64GC

    ; a0 = table, a1 = end, a2 = key
    beq     a0, a1, .Lnotfound
.Lloop:
    lw      a3, 0(a0)           ; load entry->key
    beq     a3, a2, .Lret       ; match: return current pointer
    addi    a0, a0, 8           ; ++table (struct is 8 bytes)
    bne     a0, a1, .Lloop
.Lnotfound:
    li      a0, 0
.Lret:
    ret

Inner loop: 4 instructions per iteration.

7.3 FireStorm (Xcrisp Compare-Mem-Branch)

    beq     a0, a1, .Lnotfound
.Lloop:
    BEQM    a2, (a0), .Lret     ; if a2 == mem32[a0], branch
    addi    a0, a0, 8
    bne     a0, a1, .Lloop
.Lnotfound:
    li      a0, 0
.Lret:
    ret

Inner loop: 3 instructions per iteration.

7.4 Comparison

Implementation	Inner-loop instructions	Cycles per probe
Standard RV64GC	4	4–5 (load-use stall on `beq`)
FireStorm (BEQM)	3	3

25% reduction in instruction count, larger cycle reduction because the load-use stall vanishes — the compare-mem-branch instruction's load result is consumed directly by the compare unit without traversing the register file.

This pattern matters in dispatcher code (looking up a handler by message ID), parser tables, command-name matchers, and any associative lookup. In hot dispatch paths this 25% reduction can meaningfully reduce latency.

8. Memory Copy

A straightforward memcpy of n bytes.

8.1 C Source

void memcpy_simple(void *dst, const void *src, size_t n) {
    char *d = dst;
    const char *s = src;
    while (n--) *d++ = *s++;
}

8.2 Standard RV64GC

    ; a0 = dst, a1 = src, a2 = n
    beqz    a2, .Lret
.Lloop:
    lb      a3, 0(a1)
    sb      a3, 0(a0)
    addi    a1, a1, 1
    addi    a0, a0, 1
    addi    a2, a2, -1
    bnez    a2, .Lloop
.Lret:
    ret

Inner loop: 6 instructions per byte.

(A real memcpy libc routine uses word-at-a-time transfers and alignment handling; counts shown are for the simple byte loop, which is what naive C compiles to without special optimisation.)

8.3 FireStorm (Xcrisp Block Memory)

    BMCPY   a0, a1, a2          ; copy a2 bytes from a1 to a0
    ret

Total: 1 instruction.

The BMCPY instruction is restartable on interrupt and may issue wide transfers internally (multi-byte per cycle on aligned operands). Its execution time is variable but typically far less than the standard byte-loop equivalent.

8.4 Comparison

Implementation	Inner-loop instructions	Notes
Standard byte loop	6 per byte	Dominates execution time on small copies
Optimised standard (word + tail)	~3 amortised per byte for large `n`	Adds prologue/epilogue cost
FireStorm BMCPY	1 total instruction	Throughput depends on implementation; potentially 8 bytes per cycle on aligned wide transfers

For small copies, the standard byte loop's prologue overhead dominates; BMCPY's single-instruction issue beats it handily. For large copies where word-at-a-time matters, a well-implemented BMCPY ties or beats the optimised libc version while being one instruction at the call site.

9. Function with Saved Registers (Prologue/Epilogue)

A typical short function that calls another function and uses some saved registers across the call. The frame management overhead is a substantial fraction of total execution time for small functions.

9.1 C Source

extern int lookup(int);

int compute(int a, int b, int c, int d) {
    int x = lookup(a);
    int y = lookup(b);
    int z = lookup(c);
    int w = lookup(d);
    return x + y + z + w;
}

The compiler must save ra (return address) plus four callee-saved registers (s0–s3) to hold the four lookup results across the four calls.

9.2 Standard RV64GC

compute:
    addi    sp, sp, -48         ; allocate frame (must be 16-byte aligned)
    sd      ra, 40(sp)
    sd      s0, 32(sp)
    sd      s1, 24(sp)
    sd      s2, 16(sp)
    sd      s3, 8(sp)
    mv      s0, a1              ; save args before clobbering a0
    mv      s1, a2
    mv      s2, a3
    jal     ra, lookup
    mv      s3, a0              ; x = lookup(a)
    mv      a0, s0
    jal     ra, lookup
    mv      s0, a0              ; y = lookup(b)
    mv      a0, s1
    jal     ra, lookup
    mv      s1, a0              ; z = lookup(c)
    mv      a0, s2
    jal     ra, lookup
    add     a0, s3, s0          ; x + y
    add     a0, a0, s1          ; + z
    add     a0, a0, s2          ; not quite — see note
    ld      ra, 40(sp)
    ld      s0, 32(sp)
    ld      s1, 24(sp)
    ld      s2, 16(sp)
    ld      s3, 8(sp)
    addi    sp, sp, 48
    ret

Prologue: 6 instructions. Epilogue: 7 instructions. Frame overhead: 13 instructions.

(The a0/s0 shuffling between calls is independent of the frame management and is the same in both standard and Xstack versions; we focus on the prologue/epilogue cost.)

9.3 FireStorm (Xstack)

compute:
    PUSH    rlist=00100, spimm=0   ; save ra, s0–s3 (5 registers)
    mv      s0, a1
    mv      s1, a2
    mv      s2, a3
    jal     ra, lookup
    mv      s3, a0
    mv      a0, s0
    jal     ra, lookup
    mv      s0, a0
    mv      a0, s1
    jal     ra, lookup
    mv      s1, a0
    mv      a0, s2
    jal     ra, lookup
    add     a0, s3, s0
    add     a0, a0, s1
    add     a0, a0, s2
    POPRET  rlist=00100, spimm=0   ; restore ra, s0–s3, return

Prologue: 1 instruction. Epilogue: 1 instruction. Frame overhead: 2 instructions.

9.4 Comparison

Implementation	Frame overhead instructions	Cycles (reference µarch)
Standard RV64GC	13 (6 + 7)	~13
FireStorm Xstack	2 (1 + 1)	~4 (5 regs on 4-dword/cycle port: PUSH ≈ 2 cycles, POPRET ≈ 2 cycles, including the implicit return)

85% reduction in frame-overhead instruction count. For functions where the body is also small (say, 10–20 instructions), this changes the overall function cost dramatically: a 20-instruction body with 13 instructions of frame overhead becomes nearly 40% smaller as 20 + 2 = 22 instructions total.

The cycle saving is comparable: ~13 cycles of frame overhead becomes ~3 cycles. For a small function called frequently (in a hot path), this is the largest single optimisation FireStorm offers.

10. Interrupt Service Routine

A trap handler that saves the standard caller-saved set so it can call into C-language helpers.

10.1 Pseudo-C Source (Conceptual)

void trap_handler(void) {
    // Save user-mode registers t0-t6 and a0-a7
    // Call C handler
    c_dispatch_trap();
    // Restore and return to user mode
}

10.2 Standard RV64GC

A handler saving t0–t6 (7 regs) and a0–a7 (8 regs) needs a 128-byte frame:

trap_handler:
    addi    sp, sp, -128
    sd      t0, 0(sp)
    sd      t1, 8(sp)
    sd      t2, 16(sp)
    sd      t3, 24(sp)
    sd      t4, 32(sp)
    sd      t5, 40(sp)
    sd      t6, 48(sp)
    sd      a0, 56(sp)
    sd      a1, 64(sp)
    sd      a2, 72(sp)
    sd      a3, 80(sp)
    sd      a4, 88(sp)
    sd      a5, 96(sp)
    sd      a6, 104(sp)
    sd      a7, 112(sp)
    sd      ra, 120(sp)
    jal     ra, c_dispatch_trap
    ld      ra, 120(sp)
    ld      a7, 112(sp)
    ; ... (14 more loads in reverse) ...
    ld      t0, 0(sp)
    addi    sp, sp, 128
    mret

Save: 17 instructions (16 sd + 1 addi). Restore: 17 instructions. Total overhead: 34 instructions.

10.3 FireStorm (Xstack)

trap_handler:
    PUSHS   rlist=10010, spimm=0   ; save ra, t0–t6, a0–a7 (16 regs total)
    jal     ra, c_dispatch_trap
    POPSRET rlist=10010, spimm=0   ; restore all and return
    ; (POPSRET is the supervisor-mode equivalent of POPRET; the actual return-from-trap
    ; instruction depends on the privilege level; see §10.4)

Save: 1 instruction. Restore: 1 instruction. Total overhead: 2 instructions.

10.4 A Note on Trap Returns

The example above uses POPSRET as a placeholder. In a real trap handler the return path is mret (machine mode) or sret (supervisor mode), not a plain jalr — POPSRET combines pop with ret (which is jalr x0, 0(ra)), not with mret/sret. A clean handler structure would be:

trap_handler:
    PUSHS   rlist=10010, spimm=0
    jal     ra, c_dispatch_trap
    POPS    rlist=10010, spimm=0   ; restore without returning
    sret                            ; or mret per privilege

This is 3 instructions of overhead rather than 2, but still dramatically less than the standard's 34.

10.5 Comparison

Implementation	Overhead instructions	Cycles (4-dword/cycle BSRAM port)
Standard RV64GC	34 (17 + 17)	~34
FireStorm Xstack	3 (1 + 1 + 1)	~10 (PUSHS for 16 regs = 4 cycles, POPS = 4 cycles, sret = 1 cycle, jal = 1 cycle)

~91% reduction in instruction count and roughly 75% reduction in cycle overhead. For interrupt-heavy workloads (timer ticks, peripheral I/O, OS scheduling), this changes the achievable interrupt rate by nearly an order of magnitude.

11. Coroutine Yield

A cooperative coroutine that yields to a scheduler. Each coroutine maintains its full register state on its own user stack; switching is a stack-pointer swap.

11.1 Conceptual Model

Each coroutine has a coroutine_t structure holding (among other things) its parked usp. When coroutine A yields to the scheduler:

A saves its callee-saved registers to its own stack.
A switches usp to the scheduler's saved usp.
The scheduler restores its callee-saved registers from the (now-current) stack.

The scheduler subsequently restores from a chosen coroutine B via a symmetric sequence.

11.2 Standard RV64GC (Simplified)

A typical implementation might use a setjmp/longjmp-like mechanism, saving 12 callee-saved registers plus ra to the coroutine struct and restoring from the target:

coro_yield:
    ; a0 = coroutine_t* of self (caller)
    ; a1 = coroutine_t* of target
    sd      ra, 0(a0)
    sd      s0, 8(a0)
    sd      s1, 16(a0)
    sd      s2, 24(a0)
    sd      s3, 32(a0)
    sd      s4, 40(a0)
    sd      s5, 48(a0)
    sd      s6, 56(a0)
    sd      s7, 64(a0)
    sd      s8, 72(a0)
    sd      s9, 80(a0)
    sd      s10, 88(a0)
    sd      s11, 96(a0)
    sd      sp, 104(a0)         ; save sp too (frame pointer if used)
    ; Now restore from target
    ld      ra, 0(a1)
    ld      s0, 8(a1)
    ld      s1, 16(a1)
    ld      s2, 24(a1)
    ld      s3, 32(a1)
    ld      s4, 40(a1)
    ld      s5, 48(a1)
    ld      s6, 56(a1)
    ld      s7, 64(a1)
    ld      s8, 72(a1)
    ld      s9, 80(a1)
    ld      s10, 88(a1)
    ld      s11, 96(a1)
    ld      sp, 104(a1)
    ret

~28 instructions total (14 saves, 14 loads).

11.3 FireStorm (Xstack)

With each coroutine maintaining its full callee-saved state on its own hardware stack, yield is a save + stack swap + restore:

coro_yield:
    ; a0 = saved-usp register of caller's coroutine (read/write slot)
    ; a1 = saved-usp register of target coroutine (read)
    PUSH    rlist=01100, spimm=0   ; save ra, s0–s11 to current user stack
    USWAP   t0, a1                  ; t0 = old usp; new usp = a1 (target coroutine's stack)
    sd      t0, 0(a0)               ; store caller's parked usp into its coroutine_t
    POP     rlist=01100, spimm=0   ; restore ra, s0–s11 from new (target's) stack
    ret

5 instructions total.

11.4 Comparison

Implementation	Total instructions	Cycles (4-dword/cycle BSRAM port)
Standard RV64GC	~28	~28
FireStorm Xstack	5	~11 (PUSH 13 regs ≈ 4 cycles, USWAP 1, sd 1, POP 13 regs ≈ 4 cycles, ret 1)

82% reduction in instruction count, roughly 40–60% reduction in cycle count for the full yield. Combined with zero pressure on the tiny 8 KB D-cache from the saved register state (Xctx context lives in dedicated BSRAM), this makes green-thread schedulers practical at frequencies (thousands of yields per millisecond) where software fibers on FireStorm would thrash the small D-cache and serialise on DRAM accesses.

12. PC-Relative Global Address

Materialising the address of a global variable — the most basic PIC primitive.

12.1 C Source

extern int32_t global_var;

int32_t *get_addr_of_global(void) {
    return &global_var;
}

12.2 Standard RV64GC (PIC)

get_addr_of_global:
.La:
    auipc   a0, %pcrel_hi(global_var)
    addi    a0, a0, %pcrel_lo(.La)
    ret

2 instructions for the address materialisation plus the return.

12.3 FireStorm (Xcrisp PIC LAPC, wide-mode only)

get_addr_of_global:
    LAPC    a0, global_var
    ret

1 instruction for the address materialisation.

12.4 Comparison

Implementation	Address-materialisation instructions	Cycles
Standard RV64GC	2	2
FireStorm Xcrisp PIC	1	1

50% reduction in this primitive. While the absolute saving is small per occurrence, this pattern appears so frequently in PIC code (every reference to a global) that the cumulative impact across a translation unit is substantial — typically 10–15% of total instruction count in modular PIC code can be PIC-relocation pairs.

13. Function Call Through Global Function Pointer

A call to a function whose address is in a global variable. Common in OS dispatch tables, plugin systems, and library-internal indirection.

13.1 C Source

extern void (*handler)(int);

void invoke_handler(int arg) {
    handler(arg);
}

13.2 Standard RV64GC (PIC)

invoke_handler:
    addi    sp, sp, -16
    sd      ra, 8(sp)
.La:
    auipc   t0, %pcrel_hi(handler)
    ld      t0, %pcrel_lo(.La)(t0)   ; load handler pointer
    jalr    ra, 0(t0)
    ld      ra, 8(sp)
    addi    sp, sp, 16
    ret

8 instructions total.

13.3 FireStorm (Xcrisp PIC LDPC + Xstack PUSH)

invoke_handler:
    PUSH    rlist=00000, spimm=0     ; save just ra
    LDPC    t0, handler              ; load handler pointer
    jalr    ra, 0(t0)
    POPRET  rlist=00000, spimm=0     ; pop ra, return

4 instructions total.

13.4 Comparison

Implementation	Instructions	Cycles
Standard RV64GC	8	~10 (load-use stall on `jalr`)
FireStorm	4	~6

50% reduction in instruction count, with a further cycle saving from the eliminated load-use stall. For an OS or runtime that dispatches through global function tables on every system call, the savings compound.

14. Virtual Dispatch

The classic C++ / vtable-style virtual call: load a function pointer from a structure (the vtable), then call it.

14.1 C Source (C-flavoured equivalent)

typedef struct vtable {
    void (*method0)(void*);
    void (*method1)(void*);
    void (*method2)(void*);
    void (*method3)(void*);
} vtable_t;

typedef struct object {
    const vtable_t *vt;
    /* fields */
} object_t;

void call_method2(object_t *obj) {
    obj->vt->method2(obj);
}

14.2 Standard RV64GC

call_method2:
    addi    sp, sp, -16
    sd      ra, 8(sp)
    ld      t0, 0(a0)               ; obj->vt
    ld      t0, 16(t0)              ; vt->method2 (offset 16 = 2 × 8 bytes)
    jalr    ra, 0(t0)
    ld      ra, 8(sp)
    addi    sp, sp, 16
    ret

8 instructions total (3 for the dispatch chain, plus prologue/epilogue/return).

14.3 FireStorm (Xcrisp PIC CALLM + Xstack)

call_method2:
    PUSH    rlist=00000, spimm=0
    ld      t0, 0(a0)               ; obj->vt (vtable pointer)
    CALLM   ra, 16(t0)              ; load fn ptr at vt+16, call
    POPRET  rlist=00000, spimm=0

4 instructions total. The CALLM instruction collapses the second load + JALR pair into one instruction.

14.4 Comparison

Implementation	Instructions	Cycles
Standard RV64GC	8	~10
FireStorm (Xcrisp PIC + Xstack)	4	~6

50% reduction. Virtual dispatch is the canonical use case for CALLM; any C++ codebase or any object-oriented system in C with explicit vtables benefits proportionally to its dispatch frequency.

15. Wide-Mode Register Pressure: 8-Tap FIR Filter

A finite-impulse-response filter holds many coefficients live in registers simultaneously. In narrow mode the compiler must spill some; in wide mode all coefficients fit in the extended file.

15.1 C Source

int32_t fir8(int16_t x, int32_t *state, const int32_t *coeffs) {
    // Shift state
    state[7] = state[6];
    state[6] = state[5];
    state[5] = state[4];
    state[4] = state[3];
    state[3] = state[2];
    state[2] = state[1];
    state[1] = state[0];
    state[0] = (int32_t)x;
    // Convolve
    return coeffs[0] * state[0]
         + coeffs[1] * state[1]
         + coeffs[2] * state[2]
         + coeffs[3] * state[3]
         + coeffs[4] * state[4]
         + coeffs[5] * state[5]
         + coeffs[6] * state[6]
         + coeffs[7] * state[7];
}

15.2 Narrow-Mode Compilation Sketch

The function takes state and coeffs as pointers, so state shift and coefficient access touch memory. The body must:

Load coeffs[0..7]: 8 loads.
Load state[0..6] (the old values needed for the convolve and for shifting): 7 loads.
Multiply each coefficient by its corresponding state value: 8 muls (where state[0] in the new frame is x, so coef[0] × x doesn't need a state load).
Accumulate: 7 adds.
Store the shifted state values to state[0..7]: 8 stores.
Function entry/exit.

That's approximately 32 instructions of pure body work. The challenge is register pressure during the convolve: 8 coefficients + 7 state values + 1 accumulator = 16 live values needed simultaneously. In narrow mode, the available caller-saved register set (15 registers, minus the 3 used for function arguments) is roughly 12. The compiler must spill 4+ values, adding roughly 8 instructions (one store per spill, one reload per consumption).

Narrow-mode total: ~40 instructions, with 4+ spills causing additional cycle stalls.

15.3 Wide-Mode Compilation Sketch

In wide mode the compiler has 32 additional caller-saved registers (x32–x63). The full set of 8 coefficients + 7 state values + accumulator + temporaries fits comfortably without any spill. The body work is the same ~32 instructions, but with zero spill overhead.

Wide-mode total: ~32 instructions, no spills.

15.4 Comparison

Implementation	Body instructions	Live-register spills	Cycles (in-order, 1-cycle multiply)
Narrow-mode	~40	4+	~50+ (with load-use stalls on spill reload)
Wide-mode	~32	0	~35

~20% reduction in instruction count, ~30% reduction in cycle count. The exact numbers depend on the multiplier latency and the compiler's register allocator; the qualitative win is clear: wide mode eliminates the spill traffic that narrow mode is forced into by the register-pressure ceiling.

For larger FIR filters (16-tap, 32-tap) the win compounds — narrow-mode spill counts grow roughly linearly with the tap count, while wide-mode continues to fit the working set in registers up to ~24-tap depending on what else is live. A more aggressive API variant (where the caller maintains state in registers across calls, e.g., a fully-inlined biquad cascade) would expose much larger wide-mode wins, since the per-call memory traffic for state would disappear entirely.

16. Combined Extension Showcase: Audio Synth Voice Inner Loop

A combined example using Xwide, Xcrisp, and Xstack together: a simplified single-oscillator synth voice with an envelope and a one-pole low-pass filter, generating one sample.

16.1 C Source

extern const int16_t sine_table[1024];     /* shared LUT */

typedef struct {
    uint32_t phase;          /* 16.16 fixed-point phase accumulator */
    uint32_t phase_inc;      /* phase increment per sample */
    int32_t env;             /* envelope value, Q24 */
    int32_t env_inc;         /* envelope increment per sample */
    int32_t filter_state;    /* Q23 */
    int32_t filter_coef;     /* Q31 cutoff factor */
} voice_t;

int32_t voice_render_one(voice_t *v) {
    /* Oscillator: phase lookup */
    uint32_t idx = (v->phase >> 16) & 1023;
    int32_t osc = (int32_t)sine_table[idx];   /* sign-extended */
    v->phase += v->phase_inc;
    /* Envelope */
    int32_t enveloped = (osc * v->env) >> 15;
    v->env += v->env_inc;
    /* One-pole LPF: state = state + coef * (input - state), where coef ∈ [0,1) */
    int32_t diff = enveloped - v->filter_state;
    v->filter_state += (int32_t)(((int64_t)diff * v->filter_coef) >> 31);
    return v->filter_state;
}

16.2 Standard RV64GC (leaf-optimised, no frame)

A typical compilation for the leaf function. The compiler recognises voice_render_one as a leaf (no calls), so it omits the prologue/epilogue and ra save entirely:

voice_render_one:
    ; --- oscillator ---
    lwu     t0, 0(a0)                    ; t0 = phase
    srli    t1, t0, 16
    andi    t1, t1, 1023                 ; idx = (phase >> 16) & 1023
    slli    t1, t1, 1                    ; idx * 2 (sine entries are 16-bit)
.La:
    auipc   t2, %pcrel_hi(sine_table)
    addi    t2, t2, %pcrel_lo(.La)
    add     t2, t2, t1
    lh      t3, 0(t2)                    ; osc = sine_table[idx] (sign-extended)
    lwu     t4, 4(a0)                    ; phase_inc
    add     t0, t0, t4                   ; phase += phase_inc
    sw      t0, 0(a0)                    ; store back
    ; --- envelope ---
    lw      t5, 8(a0)                    ; env
    mul     t3, t3, t5                   ; osc * env
    srai    t3, t3, 15                   ; >> 15
    lw      t6, 12(a0)                   ; env_inc
    add     t5, t5, t6                   ; env += env_inc
    sw      t5, 8(a0)
    ; --- filter ---
    lw      t1, 16(a0)                   ; filter_state
    sub     t2, t3, t1                   ; diff = enveloped - state
    lw      t4, 20(a0)                   ; filter_coef
    mul     t2, t2, t4
    srai    t2, t2, 31                   ; >> 31
    add     t1, t1, t2                   ; state += diff * coef >> 31
    sw      t1, 16(a0)                   ; store new state
    mv      a0, t1                       ; return state
    ret

26 instructions.

16.3 FireStorm Single-Sample Render

For a single-sample leaf function, the available FireStorm wins are limited. The dominant Xcrisp instructions (auto-inc, load-op-store) target loop patterns and in-place memory updates — neither applies cleanly to this shape, where struct fields are loaded once, used in arithmetic, and stored back. The only clear win is LAPC replacing the standard PIC pair for the sine table base:

voice_render_one:           ; leaf function, no frame management needed
    ; --- oscillator ---
    lwu     t0, 0(a0)                    ; phase
    srli    t1, t0, 16
    andi    t1, t1, 1023
    slli    t1, t1, 1
    LAPC    t2, sine_table               ; replaces auipc + addi (saves 1)
    add     t2, t2, t1
    lh      t3, 0(t2)
    lwu     t4, 4(a0)                    ; phase_inc
    add     t0, t0, t4                   ; phase += phase_inc
    sw      t0, 0(a0)
    ; --- envelope ---
    lw      t5, 8(a0)
    mul     t3, t3, t5
    srai    t3, t3, 15
    lw      t6, 12(a0)
    add     t5, t5, t6
    sw      t5, 8(a0)
    ; --- filter ---
    lw      t1, 16(a0)
    sub     t2, t3, t1
    lw      t4, 20(a0)
    mul     t2, t2, t4
    srai    t2, t2, 31
    add     t1, t1, t2
    sw      t1, 16(a0)
    mv      a0, t1
    ret

25 instructions — a ~4% reduction. The two multiplies and the LUT lookup arithmetic dominate and are not addressed by the current FireStorm extensions. (A future load-op-multiply extension would be the natural next step for DSP-heavy workloads — added to the v0.2 candidate list.)

16.4 Where the Real Wins Live: Block Rendering

In practice, audio code does not call voice_render_one once per sample — it renders blocks of typically 32, 64, or 128 samples in a single call. Block rendering moves the state into registers across the inner loop and exposes auto-increment opportunities on the output buffer.

C source:

void voice_render_block(voice_t *v, int32_t *output, int n) {
    uint32_t phase = v->phase;
    uint32_t phase_inc = v->phase_inc;
    int32_t env = v->env;
    int32_t env_inc = v->env_inc;
    int32_t state = v->filter_state;
    int32_t coef = v->filter_coef;

    for (int i = 0; i < n; i++) {
        uint32_t idx = (phase >> 16) & 1023;
        int32_t osc = sine_table[idx];
        int32_t enveloped = (osc * env) >> 15;
        int32_t diff = enveloped - state;
        state += (int32_t)(((int64_t)diff * coef) >> 31);
        output[i] = state;
        phase += phase_inc;
        env += env_inc;
    }

    v->phase = phase;
    v->env = env;
    v->filter_state = state;
}

Standard inner loop (state in callee-saved registers, sine_base preloaded outside loop):

.Lloop:
    srli    t0, s0, 16              ; phase = s0
    andi    t0, t0, 1023
    slli    t0, t0, 1
    add     t0, s6, t0              ; s6 = sine_base
    lh      t1, 0(t0)               ; osc
    mul     t1, t1, s2              ; osc * env (s2 = env)
    srai    t1, t1, 15              ; enveloped
    sub     t2, t1, s4              ; diff = enveloped - state (s4 = state)
    mul     t2, t2, s5              ; diff * coef (s5 = coef)
    srai    t2, t2, 31
    add     s4, s4, t2              ; state += ...
    sw      s4, 0(a1)               ; output[i] = state
    addi    a1, a1, 4               ; output++
    add     s0, s0, s1              ; phase += phase_inc (s1 = phase_inc)
    add     s2, s2, s3              ; env += env_inc (s3 = env_inc)
    addi    a2, a2, -1              ; i counter (n decrementing)
    bnez    a2, .Lloop

17 instructions per iteration.

Plus prologue/epilogue to save callee-saved registers s0–s6 (7 regs), and one LAPC-equivalent for the sine_table base outside the loop.

Standard prologue: 1 addi (sp adjust) + 7 sd + 2 (auipc + addi for sine_base) + 6 (load state into s-regs) ≈ 16 instructions of setup. Standard epilogue: 7 ld + 1 addi + 3 sw (write state back) + ret ≈ 12 instructions.

FireStorm inner loop with SWPI (auto-inc store):

.Lloop:
    srli    t0, s0, 16
    andi    t0, t0, 1023
    slli    t0, t0, 1
    add     t0, s6, t0
    lh      t1, 0(t0)
    mul     t1, t1, s2
    srai    t1, t1, 15
    sub     t2, t1, s4
    mul     t2, t2, s5
    srai    t2, t2, 31
    add     s4, s4, t2
    SWPI    s4, 4(a1)               ; output[i] = state, output++ (saves 1)
    add     s0, s0, s1
    add     s2, s2, s3
    addi    a2, a2, -1
    bnez    a2, .Lloop

16 instructions per iteration (SWPI replaces sw + addi).

In wide mode, the compiler can keep state in caller-saved x32–x63 instead of callee-saved s-registers, avoiding the prologue/epilogue saves entirely. Combined with LAPC:

FireStorm prologue: 1 LAPC + 6 register-direct loads ≈ 7 instructions. FireStorm epilogue: 3 sw + ret = 4 instructions.

16.5 Comparison

For a block of 64 samples:

Implementation	Setup	Inner loop (×64)	Teardown	Total
Standard RV64GC	16	1088 (17×64)	12	1116
FireStorm (Xcrisp + Xwide)	7	1024 (16×64)	4	1035

~7% reduction for a 64-sample block. The win is dominated by the inner-loop saving from SWPI, plus the wide-mode elimination of the callee-saved spill in the function frame.

The remaining inner-loop cost is dominated by:

Two multiplies per sample (no load-op-multiply yet — v0.2 candidate)
Three additions for state update (phase, env, output write)
LUT lookup arithmetic (4 instructions to index a 16-bit table)

These are the patterns the next FireStorm DSP extension would target. The current extensions deliver a real but modest single-digit-percent win on this workload.

16.6 Where FireStorm Really Helps in Audio: Buffer Operations

Audio pipelines spend a substantial fraction of time on simple buffer operations — mixing voices into a master bus, applying gain, summing effect sends. These are the patterns where FireStorm's MMW (load-op-store) and auto-increment instructions shine.

Two-voice mix into output buffer, with auto-increment on the source pointer and load-op-store for the in-place accumulation:

C source:

void mix_into(int32_t *bus, const int32_t *voice, int n) {
    for (int i = 0; i < n; i++) {
        bus[i] += voice[i];
    }
}

Standard inner loop:

.Lloop:
    lw      t0, 0(a0)               ; bus[i]
    lw      t1, 0(a1)               ; voice[i]
    add     t0, t0, t1
    sw      t0, 0(a0)               ; store back to bus[i]
    addi    a0, a0, 4
    addi    a1, a1, 4
    addi    a2, a2, -1
    bnez    a2, .Lloop

8 instructions per element.

FireStorm with LWPI and MMWADD:

.Lloop:
    LWPI    t0, 4(a1)               ; t0 = voice[i], voice++
    MMWADD  [a0], [a0], t0          ; bus[i] += t0 (in-place)
    addi    a0, a0, 4
    addi    a2, a2, -1
    bnez    a2, .Lloop

5 instructions per element — a 37% reduction in inner-loop instruction count. For a master mix that sums 32 voices into a stereo bus across a 64-sample block (2 × 32 × 64 = 4096 elements processed), that's roughly 12 000 fewer instructions per audio block. At a 48 kHz / 64-sample block rate (750 blocks/sec), that's a measurable reduction in DSP load.

17. Conditional Execution Patterns (Xcond)

The Xcond extension eliminates short conditional branches by attaching predicates directly to R-type ALU operations. The wins are concentrated in inner loops with data-dependent conditionals (where standard code suffers from branch mispredicts) and in short arithmetic idioms that would otherwise require either a branch or a bit-twiddle sequence.

All Xcond examples in this section are wide-mode-only. The cycle figures assume a 4-cycle mispredict penalty (per §2 methodology) and a 50% mispredict rate for fully data-dependent branches on random input.

17.1 Sum of Positive Samples (Conditional Accumulate)

A common DSP / statistics primitive: sum only the positive elements of a signed array. The textbook expression of "accumulate when predicate holds" — and the inner conditional branch is exactly the case Xcond targets.

17.1.1 C Source

int64_t sum_positive(const int32_t *buf, int n) {
    int64_t sum = 0;
    for (int i = 0; i < n; i++) {
        if (buf[i] > 0) sum += buf[i];
    }
    return sum;
}

17.1.2 Standard RV64GC

sum_positive:
    li      a2, 0                       ; sum = 0
    slli    a3, a1, 2
    add     a3, a0, a3                  ; end = buf + n
    beq     a0, a3, .Lret
.Lloop:
    lw      a4, 0(a0)
    blez    a4, .Lskip                  ; data-dependent branch
    add     a2, a2, a4
.Lskip:
    addi    a0, a0, 4
    bne     a0, a3, .Lloop
.Lret:
    mv      a0, a2
    ret

Inner loop: 4 instructions on take-skip path, 5 on add path. For random-sign data, the average path length is 4.5 instructions. The blez is fully data-dependent — for unbiased random input it mispredicts ~50% of the time, costing 2 cycles per iteration on the reference pipeline. Effective cost: ~6.5 cycles per iteration.

17.1.3 FireStorm (Xcrisp + Xcond)

sum_positive:
    li      a2, 0
    slli    a3, a1, 2
    add     a3, a0, a3
    beq     a0, a3, .Lret
.Lloop:
    LWPI       t0, 4(a0)                ; t0 = *a0; a0 += 4
    ADD.cond   a2, t0, GT_RS1           ; if (t0 > 0) sum += t0
    bne        a0, a3, .Lloop
.Lret:
    mv      a0, a2
    ret

Inner loop: 3 instructions, no inner branch. The loop branch at the bottom is highly predictable (predicted-taken until the last iteration). Effective cost: ~3 cycles per iteration.

17.1.4 Comparison

Implementation	Inner-loop instr	Cycles/iter (random-sign data)
Standard RV64GC	4–5	~6.5
FireStorm (Xcrisp + Xcond)	3	~3

~2× speedup on random-sign data. The win is the elimination of the inner branch entirely, plus the merge of pointer-advance and load via LWPI.

For data with strong bias (e.g., 95% positive samples), the standard code's branch becomes predictable and the speedup shrinks; for unbiased data, the speedup approaches 2×.

17.2 L1 Norm (Sum of Absolute Values)

Sum-of-absolute-values is the L1 norm of a vector — used for signal energy estimation, robust statistics, simple distance metrics, and many DSP primitives.

17.2.1 C Source

int64_t l1_norm(const int32_t *buf, int n) {
    int64_t sum = 0;
    for (int i = 0; i < n; i++) {
        int32_t v = buf[i];
        sum += (v < 0) ? -v : v;
    }
    return sum;
}

17.2.2 Standard RV64GC (Zbb available)

l1_norm:
    li      a2, 0
    slli    a3, a1, 2
    add     a3, a0, a3
    beq     a0, a3, .Lret
.Lloop:
    lw      t0, 0(a0)
    neg     t1, t0                      ; t1 = -t0
    max     t0, t0, t1                  ; t0 = max(t0, -t0) = abs(t0)  (Zbb)
    add     a2, a2, t0
    addi    a0, a0, 4
    bne     a0, a3, .Lloop
.Lret:
    mv      a0, a2
    ret

Inner loop: 6 instructions. Branchless via the neg/max idiom — no mispredict cost. (The non-Zbb form using sraiw/xor/sub is 1 instruction longer.)

17.2.3 FireStorm (Xcrisp + Xcond)

l1_norm:
    li      a2, 0
    slli    a3, a1, 2
    add     a3, a0, a3
    beq     a0, a3, .Lret
.Lloop:
    LWPI       t0, 4(a0)
    RSUB.cond  t0, x0, LT_RD            ; if (t0 < 0) t0 = 0 - t0 = -t0  (abs in 1 instr)
    add        a2, a2, t0
    bne        a0, a3, .Lloop
.Lret:
    mv      a0, a2
    ret

Inner loop: 4 instructions. RSUB-cond does abs in a single instruction by predicating 0 − t0 on the test t0 < 0.

17.2.4 Comparison

Implementation	Inner-loop instr	Cycles/iter
Standard RV64GC (Zbb)	6	~6
FireStorm (Xcrisp + Xcond)	4	~4

~33% inner-loop reduction. This is the canonical example of where Xcond's RSUB-cond beats Zbb: abs is a 2-instruction idiom in Zbb (neg + max) but a 1-instruction operation with RSUB-cond.

17.3 Phase Accumulator with Wrap-Around

A phase accumulator that increments by a fixed step and wraps modulo a period — the fundamental building block of every oscillator, NCO, LFO, and many counter/scheduler patterns.

17.3.1 C Source

uint32_t phase_step(uint32_t phase, uint32_t inc, uint32_t period) {
    phase += inc;
    if (phase >= period) phase -= period;
    return phase;
}

(Pre-condition: inc < period, so the wrap is at most one subtraction. This is the common case for fixed-step phase advance.)

17.3.2 Standard RV64GC

phase_step:
    add     a0, a0, a1                  ; phase += inc
    bltu    a0, a2, .Lend               ; if (phase < period) skip
    sub     a0, a0, a2                  ; phase -= period
.Lend:
    ret

3 instructions + branch. The branch frequency is inc/period per call — for typical audio NCOs this is a few percent (unpredictable in detail; modern predictors do reasonably well but not perfectly).

17.3.3 FireStorm (Xcond)

phase_step:
    add        a0, a0, a1
    SUB.cond   a0, a2, GEU              ; if (phase >= period) phase -= period
    ret

2 instructions, no branch. Mode TC condition GEU compares a0 against a2 as unsigned (correct for unsigned phase / period), and SUB-cond conditionally subtracts.

17.3.4 Comparison and Inlined Use

Function-call overhead aside, the inlined body is the relevant comparison:

Implementation	Body instr	Branch
Standard RV64GC	3	1 (data-dependent)
FireStorm (Xcond)	2	0

For a synth oscillator updating phase 64 times per sample block, the savings compound: 64 calls × 1 instruction + 64 branches saved. For an FM synth with multiple operators per voice and many voices, the cycle reduction is real.

17.4 Saturating Decrement (Retry Counter / Time Slice)

Decrement a counter, clamped at zero. Used for retry counters, time-slice quanta, hysteresis timers — anywhere a "tick toward zero, then stop" semantic is wanted.

17.4.1 C Source

void tick(int32_t *counter) {
    if (*counter > 0) (*counter)--;
}

17.4.2 Standard RV64GC (branchless with Zbb)

tick:
    lw      t0, 0(a0)
    snez    t1, t0                      ; t1 = (t0 != 0) ? 1 : 0
    sub     t0, t0, t1                  ; t0 -= t1
    sw      t0, 0(a0)
    ret

5 instructions (including load/store). The branchless form using snez works because t0 > 0 implies t0 != 0 for any non-negative input; for a counter that never goes negative, this is correct.

Alternative branchy form:

tick:
    lw      t0, 0(a0)
    blez    t0, .Lend
    addi    t0, t0, -1
    sw      t0, 0(a0)
.Lend:
    ret

5 instructions with a branch when condition false. Either form is 5 instructions total.

17.4.3 FireStorm (Xcond with Xcrisp MMW)

tick:
    LWADD.cond  ...    -- NOT AVAILABLE in v0.1 (predicated load-op-store deferred)

Without predicated load-op-store, the cleanest Xcond form is:

tick:
    li         t1, 1
    lw         t0, 0(a0)
    SUB.cond   t0, t1, GT_RD            ; if (t0 > 0) t0 -= 1
    sw         t0, 0(a0)
    ret

5 instructions. The Xcond saves nothing here — Zbb's snez is already optimal for this exact pattern.

17.4.4 Where Xcond Saves: Inline in a Hot Loop

The Xcond form wins when the operation is inlined into a loop with a pre-loaded register holding 1, and the load/store are absent:

/* Inside a loop, processing 'count' items down to zero, with side effects */
for (int item = 0; item < n; item++) {
    if (timeslice > 0) timeslice--;
    process(items[item], timeslice);
}

Inner-loop body for the timeslice decrement:

Standard (Zbb): snez t1, timeslice + sub timeslice, timeslice, t1 = 2 instructions
Xcond: SUB.cond timeslice, one_reg, GT_RD = 1 instruction

Saves 1 instruction per iteration when the constant 1 is hoisted out of the loop.

This example illustrates an important Xcond limitation: single-occurrence wins are small. The big gains come from inline use in inner loops, where the predicated form replaces a multi-instruction Zbb idiom or a branch + body pair.

17.5 Where Xcond Does Not Help

Honest reporting of patterns where Xcond was considered but does not deliver a win:

17.5.1 min / max (Tie with Zbb)

a = min(a, b) is one instruction with Zbb (min a, a, b) and one instruction with Xcond (MOV.cond a, b, LT). No improvement — use Zbb for clarity and narrow-mode portability.

17.5.2 General Select (`c ? a : b`)

The classic three-register select cannot be expressed in Xcond's R-type predicated form: rd doubles as the test operand and destination, leaving only rs1 as the source. Use Zicond (czero.eqz + czero.nez + or) instead — 3 instructions in narrow mode, available wherever Zicond is implemented.

17.5.3 Cross-Register Predicate

Patterns of the form "test register A, modify register B" do not fit Xcond:

if (status & READY_BIT) counter++;   // test 'status', modify 'counter'

The predicate must operate on rd (and optionally rs1); there is no provision for a third register to hold the test value. Use a branch:

    andi    t0, status, READY_BIT
    beqz    t0, .Lend
    addi    counter, counter, 1
.Lend:

If the branch is highly predictable (one direction dominates), this is fine. If it is not, Zicond can synthesise the equivalent in 3 instructions.

17.5.4 Predicated Memory Operations

v0.1 does not predicate loads, stores, or load-op-store. A "sparse gather" pattern (load only when an index is in-range) needs a branch around the load:

if (idx < table_size) result = table[idx];

This is deferred to a possible v0.2 Xcond extension (open item §11 of ee_xcond).

18. Hardware Context Switching (Xctx)

The Xctx extension replaces software register-save/restore sequences with a single hardware instruction. The instruction-count wins are dramatic on context-switch-heavy code; the cycle-count wins are more modest because hardware context switches still take real cycles (~30 — see §2 methodology). The compelling cases are:

Fiber-heavy code where many yields per second amortise the BSRAM-backed switch cost.
Blocking synchronisation (producer/consumer, mutex, condition variables) where HALT/RESUME beats the spin-yield pattern by orders of magnitude.
Interrupt-driven I/O where ISR-to-wake latency matters.
Preemptive kernels where the hardware slice timer eliminates the entire timer-tick ISR.

The single biggest non-cycle win is simplification: kernel and fiber-runtime code shrinks substantially, with downstream wins in prefetch-buffer footprint, code maintenance, and predictability.

18.1 Cooperative Fiber Yield

The fundamental operation in any fiber or coroutine library.

18.1.1 C Source

typedef struct {
    void *sp;
    uint64_t saved[13];      /* ra + s0..s11 */
} fiber_t;

void yield_to(fiber_t *from, fiber_t *to);

18.1.2 Standard RV64GC (Hand-Tuned)

A best-effort hand-tuned yield_to saves and restores all callee-saved registers through the fiber's saved-state block (kept in DRAM):

yield_to:
    sd      sp,   0(a0)
    sd      ra,   8(a0)
    sd      s0,  16(a0)
    sd      s1,  24(a0)
    sd      s2,  32(a0)
    sd      s3,  40(a0)
    sd      s4,  48(a0)
    sd      s5,  56(a0)
    sd      s6,  64(a0)
    sd      s7,  72(a0)
    sd      s8,  80(a0)
    sd      s9,  88(a0)
    sd      s10, 96(a0)
    sd      s11, 104(a0)
    ld      sp,   0(a1)
    ld      ra,   8(a1)
    ld      s0,  16(a1)
    ld      s1,  24(a1)
    ld      s2,  32(a1)
    ld      s3,  40(a1)
    ld      s4,  48(a1)
    ld      s5,  56(a1)
    ld      s6,  64(a1)
    ld      s7,  72(a1)
    ld      s8,  80(a1)
    ld      s9,  88(a1)
    ld      s10, 96(a1)
    ld      s11, 104(a1)
    ret

29 instructions in the function body. Caller cost: 2 cycles (jal + ret).

Cycle cost on FireStorm: the saved-state block lives in DRAM. With a small 8 KB direct-mapped D-cache, the stack frame may stay in cache between successive yields if the fiber's working-set fits and is accessed regularly — in which case the cost is ~30 cycles. If the cache line gets evicted (a likely outcome for many-fiber schedulers where each fiber's saved state competes for cache slots), the cost rises to 50–100 cycles depending on DRAM latency. Realistic average for fiber-heavy workloads: 40–70 cycles per yield plus unpredictable jitter.

(For comparison, on a richly-cached RV64GC implementation with abundant L1 capacity: a consistent ~30 cycles. FireStorm's tiny D-cache helps, but doesn't fully hide the cost.)

18.1.3 FireStorm (Xctx)

yield_to:                        ; no longer takes any arguments
    YIELD
    ret                          ; (returns to whatever code follows the YIELD
                                 ;  in the resumed context)

Actually, yield_to(from, to) in the software-fiber API specifies which fiber to switch to. Xctx uses a queue model — YIELD pushes self onto the ready queue and pulls the next ready context, without naming a target. For fiber libraries that want to mimic the explicit-target API, the wrapper is:

yield_to:                        ; a0 = (unused with Xctx), a1 = target ctx ID
                                 ; (compiler arranges target to be runnable)
    YIELD                        ; single instruction
    ret

1 instruction in the body, ~30 cycles to execute (on Ant64).

18.1.4 Comparison

Implementation	Instructions	Cycles on FireStorm	DRAM bandwidth
Software fiber (yield_to)	29	40–70 typical (D-cache dependent)	13 stores + 13 loads through cache
FireStorm (Xctx YIELD)	1	~30 (BSRAM-bounded, deterministic)	0 (BSRAM-resident)

The cycle win on FireStorm is real but moderate (the tiny D-cache helps somewhat); the bigger wins are predictability and footprint:

Cycle count: 40–70 → ~30. Roughly 1.5–2.5× speedup per yield, and the Xctx path is deterministic while the software-fiber path varies with D-cache state.
Instruction count: 29 → 1. Every yield call site is ~120 bytes of code smaller. For a fiber-heavy program with many yield points, this is meaningful prefetch-buffer footprint savings.
D-cache pressure: 13 lines → 0. Each context switch in the software version touches over 100 bytes of stack data, evicting useful unrelated data from the small 8 KB D-cache. Xctx state lives in BSRAM and consumes zero D-cache capacity. For a hot loop that yields frequently between many fibers, this is the dominant practical win — the D-cache is small enough that fiber-state churn quickly saturates it.
Predictability. Software yield cost varies with D-cache state (30 cycles hot, 70+ cold); Xctx YIELD is always ~30 cycles regardless of memory traffic or D-cache state.

For a runtime doing 100,000 yields/second at 200 MHz, the instruction-count win plus the D-cache-eviction avoidance compounds easily reaches 10–15% real-world speedup in fiber-heavy workloads, mainly because the D-cache stops thrashing on fiber-state churn.

18.2 Task Creation

Spawning a new fiber or task.

18.2.1 C Source

fiber_t *spawn(void (*entry)(void *), void *arg, size_t stack_size) {
    fiber_t *f = malloc(sizeof(fiber_t));
    void *stack = malloc(stack_size);
    fiber_setup(f, stack, stack_size, entry, arg);
    add_to_ready_queue(f);
    return f;
}

The interesting cost is fiber_setup plus add_to_ready_queue; malloc is the same in both worlds.

18.2.2 Standard RV64GC

Software fiber setup writes a fake saved-state block so the first yield_to into the new fiber lands at entry:

fiber_setup:                 ; a0 = fiber_t*, a1 = stack base,
                             ; a2 = stack_size, a3 = entry, a4 = arg
    add     t0, a1, a2       ; sp_top = stack + size
    addi    t0, t0, -16
    sd      a4, 0(t0)        ; push arg onto stack
    sd      t0, 0(a0)        ; f->sp = sp_top - 16
    sd      a3, 8(a0)        ; f->saved[0] = entry (becomes ra on first switch)
    sd      x0, 16(a0)       ; zero s0
    sd      x0, 24(a0)
    sd      x0, 32(a0)
    sd      x0, 40(a0)
    sd      x0, 48(a0)
    sd      x0, 56(a0)
    sd      x0, 64(a0)
    sd      x0, 72(a0)
    sd      x0, 80(a0)
    sd      x0, 88(a0)
    sd      x0, 96(a0)
    sd      x0, 104(a0)      ; zero s11
    ret

add_to_ready_queue:          ; a0 = fiber_t*
    ld      t0, ready_tail   ; (assume some queue structure)
    sd      a0, 0(t0)
    addi    t0, t0, 8
    sd      t0, ready_tail
    ret

Setup: 17 instructions. Queue add: 5 instructions. Total: 22 instructions (plus malloc, which is the same in both implementations).

18.2.3 FireStorm (Xctx)

spawn_xctx:                  ; a0 = entry, a1 = stack_top (with arg already pushed),
                             ; entry ABI: a0 = arg
    NEW     t0, a0, a1       ; t0 = new context ID (or -1 if pool exhausted)
    ret

1 instruction (NEW also automatically pushes the new context onto the ready queue, per §5.4 of ee_xctx).

Note that NEW initialises the new context with all registers zeroed except PC = rs1 and x2 (sp) = rs2. The standard convention has the entry function read arg from a0. To pass an arg, the caller pushes it onto the stack before NEW, or constructs a small trampoline.

A cleaner pattern uses the new context's argument-register zeroing as a feature: the entry function can read arg from a known stack offset:

; Caller: spawn(entry, arg)
spawn:
    li      t0, STACK_SIZE
    add     a1, a1, t0           ; sp_top = stack + size
    addi    a1, a1, -8
    sd      a4, 0(a1)            ; push arg
    NEW     t0, a0, a1
    ret

; Entry function trampoline:
entry_trampoline:
    ld      a0, 0(sp)            ; pop arg
    addi    sp, sp, 8
    j       actual_entry         ; or call

Caller side: 5 instructions instead of 22. 77% reduction in spawn overhead (excluding the malloc cost which is identical).

18.2.4 Comparison

Implementation	Setup instructions
Standard RV64GC fiber	22 (setup + queue add)
FireStorm Xctx	5 (1 NEW + arg push)

For programs that spawn frequently (event-handling, parallel pipelines), the spawn cost is in the hot path.

18.3 Producer-Consumer Blocking

The "spinning yield" anti-pattern vs proper HALT/RESUME synchronisation.

18.3.1 C Source

A producer-consumer pair sharing a bounded buffer:

int buf[N];
int head, tail;

void produce(int item) {
    while ((head + 1) % N == tail) {
        wait_for_consumer();      /* block until consumer drains */
    }
    buf[head] = item;
    head = (head + 1) % N;
    signal_consumer();
}

int consume(void) {
    while (head == tail) {
        wait_for_producer();
    }
    int item = buf[tail];
    tail = (tail + 1) % N;
    signal_producer();
    return item;
}

The interesting cost is the implementation of wait_for_* and signal_*.

18.3.2 Standard RV64GC (Spin-Yield Pattern)

Without proper blocking primitives, wait_for_* spins by yielding:

void wait_for_consumer(void) {
    yield_to(scheduler);      /* costs 29 instructions per call */
}

void signal_consumer(void) {
    /* no-op: consumer will check on its next yield */
}

Wait cost when blocked: 29 instructions × N (where N = number of yields before consumer drains). In a tight scenario with both fibers active, N could be 1–10. Worst case (consumer not running): unbounded polling.

18.3.3 Standard RV64GC (Proper Blocking — Kernel Mutex)

A real implementation uses kernel mutexes:

wait_for_consumer:
    addi    sp, sp, -16
    sd      ra, 8(sp)
    la      a0, producer_cond
    la      a1, buffer_mutex
    call    cond_wait                ; goes through syscall path
    ld      ra, 8(sp)
    addi    sp, sp, 16
    ret

Syscall path: ~50–100 instructions for the user-to-kernel-and-back transition, plus the kernel's scheduler code. Total wait cost: ~150–300 cycles.

18.3.4 FireStorm (Xctx HALT/RESUME)

int producer_id = -1, consumer_id = -1;

void wait_for_consumer(void) {
    asm volatile("CTXID %0" : "=r"(producer_id));
    asm volatile("HALT");           /* will be RESUMEd by consumer */
}

void signal_consumer(void) {
    if (consumer_id >= 0) {
        int c = consumer_id;
        consumer_id = -1;
        asm volatile("RESUME %0" :: "r"(c));
    }
}

In assembly:

wait_for_consumer:
    CTXID    t0
    la       t1, producer_id
    sw       t0, 0(t1)
    HALT
    ret

signal_consumer:
    la       t1, consumer_id
    lw       t0, 0(t1)
    bltz     t0, .Lend
    li       t2, -1
    sw       t2, 0(t1)
    RESUME   t0
.Lend:
    ret

wait_for_consumer: 4 instructions + HALT. The HALT context-switches once (~30 cycles); the producer then sleeps until consumer issues RESUME.

signal_consumer: 5 instructions + RESUME. The RESUME just adds the woken context to the ready queue (~3 cycles); the woken context runs at the next natural scheduling point.

18.3.5 Comparison

Implementation	Wait cost	Signal cost	Notes
Spin-yield	29 × N instructions	0	N = waits before consumer drains; can be 1 to ∞
Kernel cond_wait	~150–300 cycles	~150–300 cycles	Each goes through full syscall
Xctx HALT/RESUME	~35 cycles	~10 cycles	No syscall, no polling

For a producer that blocks 10 times waiting for a consumer to drain:

Spin-yield: 290 instructions of pointless yielding.
Kernel cond_wait: 1500–3000 cycles (one wait, but expensive).
Xctx: 35 cycles total (one HALT, one RESUME from consumer).

The Xctx pattern is 40–100× cheaper than kernel cond_wait and avoids the unbounded spin-yield trap entirely.

18.4 Interrupt-Driven I/O Completion

The classic "task waits on I/O; ISR wakes it" pattern.

18.4.1 C Source

ssize_t blocking_read(int fd, void *buf, size_t len) {
    if (has_data(fd)) {
        return immediate_read(fd, buf, len);
    }
    register_waiter(fd, current_task_id());
    wait_for_io();                      /* HALT-like */
    return immediate_read(fd, buf, len);
}

/* ISR for device that produces data on fd */
void device_isr(int fd) {
    int waiter = lookup_waiter(fd);
    if (waiter >= 0) {
        clear_waiter(fd);
        signal_io(waiter);              /* RESUME-like */
    }
}

18.4.2 Standard RV64GC ISR (Preemptive Kernel)

Realistic ISR for a preemptive kernel that may reschedule on I/O completion:

device_isr:
    addi    sp, sp, -128
    ; save all caller-saved registers (~15 stores)
    sd      ra, 0(sp)
    sd      t0, 8(sp)
    sd      t1, 16(sp)
    ; ... 12 more saves ...
    sd      a7, 112(sp)
    ; ISR body
    csrr    a0, mcause              ; or read device register
    call    get_irq_source
    call    lookup_waiter
    bltz    a0, .Lnowaiter
    mv      s0, a0
    li      a0, -1
    call    clear_waiter
    mv      a0, s0
    call    scheduler_make_ready    ; queue the waiter
    call    scheduler_should_preempt
    beqz    a0, .Lnopreempt
    call    scheduler_yield_to       ; full context switch (29 instr)
.Lnopreempt:
.Lnowaiter:
    ; restore caller-saved
    ld      ra, 0(sp)
    ld      t0, 8(sp)
    ; ... 13 more loads ...
    ld      a7, 112(sp)
    addi    sp, sp, 128
    mret

ISR overhead (entry+exit save/restore): ~30 instructions. Body: ~10–40 instructions depending on preempt path. Total: 40–70 instructions per ISR.

Cycle cost: 50–100 cycles for the no-preempt path, 100–150 cycles for the preempt path.

18.4.3 FireStorm (Xctx)

The ISR doesn't need to context-switch; it just queues the woken task:

device_isr:
    addi    sp, sp, -32
    sd      ra, 0(sp)
    sd      t0, 8(sp)
    sd      t1, 16(sp)
    sd      a0, 24(sp)
    ; ISR body
    csrr    a0, mcause
    call    get_irq_source           ; a0 = fd
    call    lookup_waiter            ; a0 = waiter ctx ID or -1
    bltz    a0, .Lnowaiter
    mv      t0, a0
    li      a0, -1
    call    clear_waiter
    RESUME  t0                       ; wake the task (single instr, ~3 cycles)
.Lnowaiter:
    ld      ra, 0(sp)
    ld      t0, 8(sp)
    ld      t1, 16(sp)
    ld      a0, 24(sp)
    addi    sp, sp, 32
    mret

ISR overhead (save/restore): ~10 instructions (only the registers we use). Body: ~6 instructions + RESUME. Total: ~16 instructions.

Cycle cost: ~20 cycles.

18.4.4 Comparison

Implementation	ISR instructions	ISR cycles
Standard RV64GC (preemptive kernel)	40–70	50–150
FireStorm (Xctx RESUME)	~16	~20

ISR latency reduced by 3–7×. For interrupt-driven workloads (network I/O, audio interrupts, sensor readings), this directly improves throughput: more time for the ISR's actual purpose, less time in scheduling boilerplate.

The Xctx ISR is also smaller (fewer caller-saved registers touched), which improves prefetch-buffer locality of the trap-handler code and reduces trap-entry overhead pressure on the pipeline.

18.5 Preemptive Timer-Tick Elimination

For preemptive multitasking, the standard approach uses a hardware timer firing periodically; an ISR decides whether to preempt. With Xctx's hardware slice timer, no ISR runs at all — the hardware switches contexts directly.

18.5.1 Standard RV64GC (Software Timer ISR)

timer_isr:
    addi    sp, sp, -128
    ; save all caller-saved (~15 stores)
    sd      ra, 0(sp)
    ; ...
    sd      a7, 112(sp)
    csrr    t0, mtime
    li      t1, NEXT_TICK
    add     t1, t0, t1
    csrw    mtimecmp, t1            ; rearm timer
    call    decrement_quantum
    call    scheduler_should_preempt
    beqz    a0, .Lnopreempt
    call    scheduler_choose_next
    call    scheduler_yield_to      ; 29 instr context switch
.Lnopreempt:
    ; restore caller-saved
    ld      ra, 0(sp)
    ; ...
    ld      a7, 112(sp)
    addi    sp, sp, 128
    mret

Per-tick cost (no preempt): ~40 instructions, ~60 cycles. Per-tick cost (preempt): ~70 instructions, ~120 cycles.

At a 1 kHz tick rate (1 ms ticks), this is ~60–120k cycles per second of pure scheduling overhead at 1 GHz — 0.006–0.012% of CPU.

At a 10 kHz tick rate, it's 10× that = 0.06–0.12% of CPU.

For aggressive sub-millisecond preemption (10–100 kHz), the timer-ISR overhead becomes meaningful.

18.5.2 FireStorm (Xctx Hardware Slice Timer)

; Kernel setup
csrw    mxctxslice, t0              ; t0 = slice in cycles
; ... no ISR registration needed for preemption

That's it. No timer ISR exists for slicing purposes. Hardware decrements the slice counter; on zero, it executes the YIELD-equivalent automatically. Switch cost: 30 cycles on Ant64 (the BSRAM transfer cost), with zero software instructions.

(The kernel may still have a timer ISR for other purposes — e.g., wall-clock time advancement, watchdog kicks — but those are independent of preemption and can run at much lower frequencies.)

18.5.3 Comparison

For a 10 kHz preemption rate:

Implementation	Per-tick instructions	Per-second instruction overhead
Standard RV64GC	40–70	400k–700k instructions/sec
FireStorm (slice timer)	0	0 instructions/sec (only the 30-cycle BSRAM switch cost remains)

The Xctx kernel reclaims roughly 0.06–0.1% of CPU at this preemption rate. At 100 kHz preemption (10 µs slices for highly interactive workloads), the savings scale to 0.6–1.0% of CPU — a real gain for hard-real-time and high-fan-out workloads.

Secondary wins:

No jitter. Software timer ISRs vary in cost depending on the path taken; the hardware slice timer is exact.
Smaller kernel. The entire timer-tick ISR can be deleted (or kept for wall-clock at a low rate, decoupled from preemption).
No ISR-entry/exit cost. Each saved tick avoids the trap-entry/exit overhead which is itself non-trivial on aggressive pipelines.

18.6 Verification Summary

Example	Pattern	Standard	FireStorm Xctx	Win
§18.1 Cooperative yield	yield_to call body	29 instructions (40–70 cycles)	1 instruction (~30 cycles)	1.5–2.5× faster + deterministic
§18.2 Task spawn	Fiber setup + enqueue	22 instructions	5 instructions	Faster spawn
§18.3 Producer/consumer block	Wait via kernel cond_wait	~150 cycles per wait	~35 cycles per HALT+RESUME	4×–100× faster sync
§18.4 I/O completion ISR	Wake task on interrupt	40–70 instructions	~16 instructions	3×–7× lower ISR latency
§18.5 Preemptive timer-tick	10 kHz ISR overhead	400k–700k instr/sec	0 software, ~30-cycle switch only	0.06–1.0% CPU reclaimed

The Xctx wins are concentrated in synchronisation-heavy code (§18.3) and kernel-internal infrastructure (§18.4, §18.5). For pure computation, Xctx neither helps nor hurts.

19. Xmath: 8-Tap FIR Filter with MADD

This complements §15 (which showed wide-mode register-pressure benefit on a FIR filter); here we focus on the arithmetic acceleration from Xmath's integer fused multiply-add. The example uses Q15 fixed-point coefficients and 16-bit input samples, producing a 32-bit output.

19.1 C Source

int32_t fir8_q15(const int16_t *samples, const int16_t *coeffs) {
    int32_t acc = 0;
    for (int i = 0; i < 8; i++) {
        acc += (int32_t)samples[i] * (int32_t)coeffs[i];
    }
    return acc;
}

19.2 Standard RV64GC

The inner loop body per tap:

.Lloop:
    lh      t0, 0(a0)        ; load sample (sign-extend)
    lh      t1, 0(a1)        ; load coefficient
    mul     t2, t0, t1       ; multiply (3 cycles on DSP-backed unit)
    add     a2, a2, t2       ; accumulate
    addi    a0, a0, 2        ; advance sample pointer
    addi    a1, a1, 2        ; advance coefficient pointer
    addi    a3, a3, -1
    bnez    a3, .Lloop

Per tap: 8 instructions, ~10 cycles with MUL latency. Eight taps unrolled: ~80 cycles plus loop overhead. Fully unrolled and pipelined by the compiler: ~50 cycles.

19.3 FireStorm with Xmath MADD + Xcrisp Auto-Inc

Combined with Xcrisp's auto-incrementing loads, the inner loop body becomes:

.Lloop:
    LHPI    t0, 2(a0)        ; load sample, post-increment (Xcrisp)
    LHPI    t1, 2(a1)        ; load coefficient, post-increment (Xcrisp)
    MADD    a2, t0, t1, a2   ; acc = acc + sample * coef  (Xmath, 1 instruction, 2-3 cycle latency)
    addi    a3, a3, -1
    bnez    a3, .Lloop

Per tap: 5 instructions, ~5 cycles with MADD's pipelined throughput (one MADD result per cycle once started). Eight taps: ~15 cycles + setup. Fully unrolled by compiler with the help of dual-issue: ~12 cycles.

19.4 Comparison

Implementation	Instructions per tap	Cycles for 8-tap filter
Standard RV64GC	8	~50
FireStorm Xmath + Xcrisp	5	~12

~4× speedup on FIR throughput. Per-sample cost for an 8-tap filter at 48 kHz: ~30K cycles/sec at standard speed → ~7K cycles/sec with Xmath. The savings scale linearly with filter length — a 64-tap filter (audio room reverb component) sees the same 4× ratio but on much larger absolute work.

For 128 voices each with an 8-tap filter, the savings are roughly 6% of total CPU at 380 MHz. For room-reverb (4 channels × 64-tap), savings approach 15% of CPU.

20. Xmath: 3D Vector Normalisation with VNORM3

Normalising a 3D vector is a per-pixel or per-vertex operation in lighting and rendering setup, performed many thousands of times per frame in any non-trivial 3D game.

20.1 C Source

typedef struct { float x, y, z; } vec3;

vec3 normalize(vec3 v) {
    float len_sq = v.x*v.x + v.y*v.y + v.z*v.z;
    float inv_len = 1.0f / sqrtf(len_sq);
    return (vec3){ v.x * inv_len, v.y * inv_len, v.z * inv_len };
}

20.2 Standard RV64GC (with F/D)

normalize:
    fmul.s  ft0, fa0, fa0        ; x*x
    fmadd.s ft0, fa1, fa1, ft0   ; + y*y
    fmadd.s ft0, fa2, fa2, ft0   ; + z*z = len_sq
    fsqrt.s ft1, ft0             ; sqrt (typically 20+ cycles on FPU)
    fdiv.s  ft2, f1, ft1         ; 1/sqrt (typically 15+ cycles)
    fmul.s  fa0, fa0, ft2        ; x normalised
    fmul.s  fa1, fa1, ft2
    fmul.s  fa2, fa2, ft2
    ret

Cycle count: 3 (FMA chain) + 20 (sqrt) + 15 (div) + 3 (multiplies, pipelined) = ~41 cycles.

20.3 FireStorm with Xmath VNORM3

normalize:                            ; vec3 in f0, f1, f2 (a 3-tuple at f0)
    VNORM3.S  f0, f0                  ; one instruction, 8 cycles
    ret

VNORM3 internally:

Computes v.x² + v.y² + v.z² (3 internal FMAs, ~3 cycles)
Computes 1/sqrt(len_sq) via FRSQRT (~3 cycles)
Scales each component (3 parallel multiplies, ~2 cycles with multi-port writeback)

Total: 8 cycles including the internal forwarding.

20.4 Comparison

Implementation	Cycles	Instructions
Standard RV64GD	~41	8 (plus internal pipeline of sqrt/div)
FireStorm Xmath VNORM3	8	1

~5× speedup on per-vertex / per-pixel normalisation. For a Phong-shaded renderer doing per-pixel lighting at 1080p60 with normal interpolation, this is the difference between using ~20% of CPU on normalisations alone and ~4%.

The win comes from two sources: (1) VNORM3 is one instruction rather than 8, saving decode bandwidth and register pressure; (2) FRSQRT is a 3-cycle approximation vs. the 35+ cycles needed for separate FSQRT + FDIV at IEEE-754-correct precision. For visual rendering, the 0.1% accuracy of FRSQRT is far better than the precision of the colour pipeline; the loss is invisible.

21. Xmath: Ray-Triangle Intersection (Möller-Trumbore)

The ray-triangle test is the fundamental primitive for picking, raycasting bullets, line-of-sight checks, and BSP traversal in 3D games. The Möller-Trumbore algorithm is the standard branchless implementation.

21.1 C Source

bool ray_triangle(vec3 ro, vec3 rd, vec3 v0, vec3 v1, vec3 v2, float *out_t) {
    vec3 edge1 = sub(v1, v0);
    vec3 edge2 = sub(v2, v0);
    vec3 h = cross(rd, edge2);
    float a = dot(edge1, h);
    if (fabsf(a) < EPSILON) return false;
    float f = 1.0f / a;
    vec3 s = sub(ro, v0);
    float u = f * dot(s, h);
    if (u < 0 || u > 1) return false;
    vec3 q = cross(s, edge1);
    float v = f * dot(rd, q);
    if (v < 0 || u + v > 1) return false;
    float t = f * dot(edge2, q);
    if (t > EPSILON) { *out_t = t; return true; }
    return false;
}

21.2 Standard RV64GD

Each operation maps to multiple instructions:

Operation	Instructions	Cycles
`sub(v1, v0)`	3× FSUB	3
`sub(v2, v0)`	3× FSUB	3
`cross(rd, edge2)`	6× FMUL + 3× FSUB	~9
`dot(edge1, h)`	3× FMUL + 2× FADD (or 2× FMA + 1 FMUL)	~3
`1 / a`	FDIV	~15
`sub(ro, v0)`	3× FSUB	3
`dot(s, h)`	~3	3
`f * dot()`	FMUL	1
`cross(s, edge1)`	~9	9
`dot(rd, q)`	~3	3
`f * dot()`	FMUL	1
`dot(edge2, q)`	~3	3
`f * dot()`	FMUL	1

Plus EPSILON compares (~6 cycles).

Total: ~62 cycles per ray-triangle test (well-pipelined; out-of-order completion via scoreboarding hides some latency).

21.3 FireStorm with Xmath

ray_triangle:
    VSUB3.S    edge1, v1, v0           ; 3 cycles
    VSUB3.S    edge2, v2, v0           ; 3 cycles
    CROSS3.S   h, rd, edge2            ; 10 cycles
    DOT3.S     a, edge1, h             ; 6 cycles
    ; if (|a| < EPS) return false    
    FRECIP.S   f, a                    ; 3 cycles
    VSUB3.S    s, ro, v0               ; 3 cycles
    DOT3.S     ft0, s, h               ; 6 cycles
    fmul.s     u, f, ft0               ; 1 cycle
    ; clamp tests
    CROSS3.S   q, s, edge1             ; 10 cycles
    DOT3.S     ft0, rd, q              ; 6 cycles
    fmul.s     v, f, ft0               ; 1 cycle
    ; clamp tests
    DOT3.S     ft0, edge2, q           ; 6 cycles
    fmul.s     t, f, ft0               ; 1 cycle
    ; final EPSILON test
    ret

Cycle accounting (with scoreboarding pipelining the FMA chain through the FP unit): the dependency-chain depth determines the critical path. Through forwarding and scoreboarding, the cross-products and dot-products overlap. ~45 cycles for the full test on Ant64.

21.4 Comparison

Implementation	Instructions	Cycles
Standard RV64GD	~28 (excluding compares)	~62
FireStorm Xmath	~14	~45

~1.4× speedup, ~50% fewer instructions. For a game raycasting against 500 triangles per query (line-of-sight from one entity to another, for example), this saves ~9000 cycles per query = ~24 microseconds at 380 MHz. For 60 raycast queries per frame at 60 fps: ~1.4 ms saved, or ~9% of frame budget for that workload.

The bigger win is instruction count: half the I-cache pressure, half the decode bandwidth. In a tight raycasting inner loop where the prefetch buffer is competing with the rest of the renderer, the smaller code footprint helps disproportionately.

22. Xmath: A* Pathfinding with OCTILE2 Heuristic

A* pathfinding on a 2D grid is the workhorse algorithm for tile-based games, RTS unit movement, and dungeon-crawler enemy AI. The dominant per-node cost is heuristic computation.

22.1 C Source

// A* inner loop: expand current node, push neighbours with estimated cost
typedef struct { int32_t x, y; } point;

int32_t octile_dist(point a, point b) {
    int32_t dx = abs(a.x - b.x);
    int32_t dy = abs(a.y - b.y);
    int32_t d_max = (dx > dy) ? dx : dy;
    int32_t d_min = (dx > dy) ? dy : dx;
    return d_max * 1024 + d_min * 424;   // approximating (√2 - 1) * 1024
}

22.2 Standard RV64GC

octile_dist:
    sub     t0, a0, a2           ; dx = a.x - b.x
    sra     t1, t0, 31           ; sign mask
    xor     t0, t0, t1
    sub     t0, t0, t1           ; abs(dx)    (3 instructions for ABS)
    sub     t2, a1, a3           ; dy = a.y - b.y
    sra     t3, t2, 31
    xor     t2, t2, t3
    sub     t2, t2, t3           ; abs(dy)
    ; max and min via compare-and-branch or branchless
    blt     t0, t2, .Lswap       ; if dx < dy, swap
    mv      t4, t0               ; d_max = dx
    mv      t5, t2               ; d_min = dy
    j       .Lcompute
.Lswap:
    mv      t4, t2
    mv      t5, t0
.Lcompute:
    li      t6, 1024
    mul     t4, t4, t6           ; d_max * 1024
    li      t6, 424
    mul     t5, t5, t6           ; d_min * 424
    add     a0, t4, t5
    ret

Per call: ~16 instructions, ~12 cycles (with the conditional branch potentially mispredicting).

In A*'s inner loop with 8 neighbours per expansion and ~3000 nodes per query:

3000 × 8 × 12 = 288,000 cycles per pathfind = ~0.76 ms at 380 MHz

22.3 FireStorm with OCTILE2

octile_dist:
    ; Inputs: a0 = a (packed x:32, y:32), a1 = b (packed)
    OCTILE2 a0, a0, a1                ; 3 cycles, full computation
    ret

Per call: 1 instruction, 3 cycles. In A*'s inner loop:

3000 × 8 × 3 = 72,000 cycles per pathfind = ~0.19 ms at 380 MHz

22.4 Comparison

Implementation	Instructions per heuristic	Cycles per heuristic	Cycles per pathfind (3000 nodes × 8 neighbours)
Standard RV64GC	16	12	288,000
FireStorm Xmath	1	3	72,000

*~4× speedup on A throughput.** For a strategy game with 50 AI units pathfinding once per second, this saves ~11 ms per second = ~1% of CPU. For a tower-defence with many concurrent paths, the savings scale linearly.

The cleaner story is the inner loop becomes branchless and predictable: one instruction, fixed 3-cycle latency, no branch-predictor pollution from the max/min compare. This is significant for code that runs inside scoreboarded multi-cycle operations — A*'s heuristic computation no longer competes for branch-predictor entries with the rest of the game logic.

23. Xmath: BAM-Based Sprite Rotation

A 2D sprite rotating once per frame at constant angular velocity — the classic top-down game enemy turning to face the player, or a propeller, or an animated icon.

23.1 C Source

typedef struct { float sin_a, cos_a; } rotation;

rotation update_rotation_fp(rotation r, float dt, float turn_rate) {
    static float angle = 0.0f;  // persistent
    angle += turn_rate * dt;
    if (angle >= 2.0f * 3.14159f) angle -= 2.0f * 3.14159f;
    r.sin_a = sinf(angle);
    r.cos_a = cosf(angle);
    return r;
}

// BAM-based variant
typedef struct { uint32_t angle_bam; } rotation_bam;

rotation update_rotation_bam(rotation_bam *r, uint32_t turn_rate_bam) {
    r->angle_bam += turn_rate_bam;  // perfect modular wraparound
    // call site uses FSINCOSBAM to get sin/cos for the matrix
}

23.2 Standard RV64GC + FP Library

update_rotation_fp:
    flw     ft0, angle           ; load angle
    fmul.s  ft1, fa1, fa0        ; turn_rate * dt
    fadd.s  ft0, ft0, ft1        ; angle += turn_rate * dt
    li      a0, 0x40C90FDB        ; 2π in hex
    fmv.w.x ft2, a0
    flt.s   t0, ft0, ft2          ; angle < 2π ?
    bnez    t0, .Lnomod
    fsub.s  ft0, ft0, ft2        ; angle -= 2π
.Lnomod:
    fsw     ft0, angle
    fmv.s   fa0, ft0              ; argument for sin
    call    sinf                  ; ~50-100 cycles (library)
    fsw     fa0, sin_a
    flw     fa0, angle
    call    cosf                  ; ~50-100 cycles
    fsw     fa0, cos_a
    ret

Per frame: ~12 instructions of inline code + 2 library calls (~150 cycles total).

23.3 FireStorm with BAM + FSINCOSBAM

update_rotation_bam:
    ; r->angle_bam is at 0(a0); turn_rate_bam in a1
    lw      t0, 0(a0)             ; load angle_bam
    addw    t0, t0, a1            ; angle += turn_rate (automatic wraparound!)
    sw      t0, 0(a0)             ; store
    ; Now compute sin/cos for rendering:
    FSINCOSBAM.S  fa0, t0         ; fa0 = sin, fa1 = cos (3 cycles)
    ret

Per frame: 4 instructions, ~7 cycles total.

23.4 Comparison

Implementation	Instructions	Cycles
FP-radians + library sin/cos	~12 + 2 calls	~150
BAM + FSINCOSBAM	4	~7

~20× speedup on rotation update. For a game with many rotating elements — sprite enemies, particle systems, projectile orientation, UI animation, oscillator phase in a wavetable synth — the cumulative savings are enormous.

Beyond the cycle count, two qualitative wins:

Perfect angular precision over time. FP angle accumulation drifts: each frame's angle += turn_rate * dt adds a small floating-point error, and over thousands of frames the rotation phase drifts measurably. BAM accumulation is bit-exact forever — angle_bam += turn_rate_bam wraps with perfect modular semantics, no precision loss possible.
Branchless wraparound. The FP version needs a compare-and-conditional-subtract; the BAM version is just an integer add that naturally wraps in 2's complement. No branch-predictor entry, deterministic timing.

For retro-style code (rotozoomers, wavetable oscillators, particle effects), BAM is the natural primitive. FireStorm is one of the only modern architectures with hardware BAM trig.

24. Xmath: Quaternion Bone Update

Per-character skeletal animation: walking the bone hierarchy and computing each bone's world-space orientation by composing parent and local rotations as quaternions.

24.1 C Source

typedef struct { float w, x, y, z; } quat;

// Multiply two quaternions
quat quat_mul(quat a, quat b) {
    quat r;
    r.w = a.w*b.w - a.x*b.x - a.y*b.y - a.z*b.z;
    r.x = a.w*b.x + a.x*b.w + a.y*b.z - a.z*b.y;
    r.y = a.w*b.y - a.x*b.z + a.y*b.w + a.z*b.x;
    r.z = a.w*b.z + a.x*b.y - a.y*b.x + a.z*b.w;
    return r;
}

// Per-bone update
void update_bone(bone *b) {
    b->world_quat = quat_mul(b->parent->world_quat, b->local_quat);
}

24.2 Standard RV64GD

quat_mul compiles to 16 FMUL + 12 FADD/FSUB = 28 FP operations. With FMA, half the FMULs become FMAs but the total count and dependency depth stay similar. Best case with scoreboarding:

~28 instructions, ~25 cycles per quaternion multiplication.

24.3 FireStorm with QMUL.S

update_bone:
    ; a0 = bone pointer; parent's world_quat at offset 16, local_quat at offset 32, world_quat output at offset 48
    addi    t0, a0, 16              ; t0 → parent world_quat
    addi    t1, a0, 32              ; t1 → local_quat
    flw.4   f0, 0(t0)               ; load 4-tuple (4 FP loads — wide LD4 if available, else 4 separate)
    flw.4   f4, 0(t1)
    QMUL.S  f8, f0, f4              ; world = parent_world × local  (8 cycles)
    fsw.4   f8, 48(a0)              ; store 4-tuple
    ret

~12 instructions, ~14 cycles per bone (mostly memory load/store; QMUL itself is 8 cycles).

24.4 Comparison

Implementation	Instructions per bone	Cycles per bone
Standard RV64GD	~36 (load + math + store)	~32
FireStorm Xmath QMUL	~12	~14

~2.3× speedup on skeletal animation. For a character with 50 bones at 60 fps:

Implementation	Cycles per character	% CPU at 380 MHz
Standard	1,600	0.025%
Xmath	700	0.011%

For 20 characters on screen, total skeletal-animation CPU goes from ~0.5% to ~0.22%. Small absolutely, but at higher character counts (action games with 100+ characters, large-scale tactical games) the savings scale linearly and become significant.

QMUL's bigger contribution is clarity and density: animation code becomes shorter and more readable, leaving the compiler more flexibility for register allocation and the prefetch buffer less pressure from animation-code footprint.

25. Summary of Wins

The following table summarises the instruction-count savings across the examples in this document:

Example	Pattern	Standard	FireStorm	Reduction
§3 Buffer sum	Inner loop	4	3	25%
§4 Mask in-place	Inner loop	5	3	40%
§5 Bias transform	Inner loop	6	4	33%
§6 Strlen	Inner loop	3	2	33%
§7 Table search	Inner loop	4	3	25%
§8 Memcpy	Total	~6/byte	1	dramatic
§9 Function frame	Overhead	13	2	85%
§10 Trap handler	Overhead	34	3	91%
§11 Coroutine yield	Total	~28	5	82%
§12 Global address	Pattern	2	1	50%
§13 Indirect call	Total	8	4	50%
§14 Virtual dispatch	Total	8	4	50%
§15 FIR-8	Body	~40	~32	~20%
§16.3 Synth voice (one sample)	Total	26	25	~4%
§16.5 Synth block (64 samples)	Total	1116	1035	~7%
§16.6 Buffer mix	Inner loop	8	5	37%
§17.1 Sum positive	Inner loop	4–5 + mispredict	3	~2× speedup
§17.2 L1 norm	Inner loop	6	4	33%
§17.3 Phase wrap	Body	3 + branch	2	33% + branch eliminated
§17.4 Saturating dec	Inline body	2	1	50% (small)
§18.1 Fiber yield	Body	29	1	97%
§18.2 Task spawn	Setup + enqueue	22	5	77%
§18.3 Producer block	Per wait+wake	~150 cycles	~45 cycles	~3× faster
§18.4 I/O ISR	Total ISR	40–70	~16	60–77%
§18.5 Timer-tick ISR	Per tick	40–70	0	100% (ISR eliminated)
§19 FIR-8 with MADD	8-tap filter	~50 cycles	~12 cycles	~4× speedup
§20 Vector normalise	Full normalise	~41 cycles	~8 cycles	~5× speedup
§21 Ray-triangle test	Full test	~62 cycles, ~28 instr	~45 cycles, ~14 instr	1.4× speed, 50% fewer instr
§22 A* heuristic	OCTILE2 distance	16 instr, ~12 cycles	1 instr, 3 cycles	~4× speedup
§23 Sprite rotation	Per-frame update	~150 cycles (FP + lib)	~7 cycles (BAM)	~20× speedup
§24 Quaternion bone	Per-bone update	~32 cycles	~14 cycles	~2.3× speedup

The patterns showing the largest wins are: control-flow and frame management (§9, §10, §11) targeted by Xstack; PIC-heavy dispatch (§13, §14) targeted by Xcrisp PIC; branchy inner loops (§17.1) targeted by Xcond; synchronisation/scheduling (§18) targeted by Xctx; and arithmetic-heavy game/audio inner loops (§19–§24) targeted by Xmath. The patterns showing the smallest wins (§16.3, §16.5) are dense arithmetic kernels where the standard instruction mix is multiply/shift/add — even those see meaningful improvement now that Xmath provides MADD and saturating arithmetic. The patterns in between (§3–§7, §16.6, §17.2, §17.3) cover loop kernels and buffer/predicated operations where Xcrisp and Xcond deliver consistent 25–50% reductions plus branch elimination.

Honest takeaway: FireStorm extensions deliver large wins across the full workload spectrum:

Control-flow / frame management → Xstack (~50% on prologue/epilogue)
PIC-heavy dispatch → Xcrisp PIC (~50% on indirect calls)
Memory-loop code → Xcrisp (25–50% on buffer kernels)
Branchy predicated loops → Xcond (2× on mispredict-prone loops)
Scheduling / synchronisation → Xctx (60–100% on ISR / yield paths)
Game / audio / DSP math → Xmath (4–20× on transcendentals, FMA chains, BAM rotation, A* heuristics, quaternion math)

In aggregate, on typical modular system code with a realistic mix of arithmetic, control flow, memory access, concurrency, and game-state math, the combined extensions deliver roughly 40–60% reduction in instruction count and a comparable or greater reduction in cycle count once branch-mispredict savings, ISR elimination, and Xmath's transcendental speedups are factored in. Game inner loops — physics integration, AI pathfinding, animation, audio synthesis — see proportionally larger wins thanks to Xmath's fused operations and approximations.

26. What These Examples Don't Show

This document focuses on instruction-count and cycle-count wins of individual code patterns. There are several systemic wins that are harder to quantify but real:

Prefetch buffer pressure: smaller code means more of the working set fits in the buffer pool (§4 of ee_cpu), fewer fetch stalls in larger programs.
D-cache pressure for Xstack and Xctx: hardware stacks and context state live in dedicated BSRAM, not DRAM, so push/pop and context-switch traffic doesn't pass through the small 8 KB D-cache. This is critical because the D-cache is tiny — every avoided cache-line eviction matters disproportionately compared to a cache-rich architecture.
Scratchpad placement: hot data structures placed in scratchpad BSRAM (§5.3 of ee_cpu) — voice state, filter coefficients, sample LUTs, scheduler queues — get single-cycle access with no cache pressure. The compiler's .bsram section attribute makes this idiomatic.
Branch predictor pressure for Xcond: every branch eliminated by predication is one fewer entry competing for predictor table capacity. In branch-heavy code (interpreters, parsers, OS dispatch), this compounds across the working set.
Translator state amortisation for Xlate: a translator configured once outside a hot loop benefits every iteration. The setup cost is paid once; the per-access cost is zero.
Predictable preemption for Xctx: the hardware slice timer gives exact preemption boundaries — useful for hard real-time workloads where software-timer-jitter is unacceptable.
Multi-core load balancing (future): when Xctx v0.2 specifies multi-core handoff, the ready queue is shared and load balancing is automatic — no software dispatcher needed.
Power: fewer cycles spent in front-end fetch and decode means proportionally less switching activity.
Compiler flexibility: with the extended register file the compiler has more options, often finding better schedules than what hand-optimised assembly would have chosen.
Trap latency variability: Xstack's bounded prologue cost and Xctx's hardware slice timer both make interrupt and preemption latency more predictable.

These wins do not appear in any single example but compound across a system's lifetime.

End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension, FireStorm Xlate Extension, FireStorm Xctx Extension, FireStorm Xmath Extension.