FireStorm Performance Examples
Document version: 0.3 (draft) Status: Verified example set including Xcond and Xctx Companion documents: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension, FireStorm Xlate Extension, FireStorm Xctx Extension, FireStorm Xmath Extension
1. Introduction
This document presents concrete examples of code patterns where the FireStorm extensions (Xwide, Xcrisp, Xstack, Xcond, Xlate, Xctx, Xmath) deliver measurable wins over standard RV64GC. Each example shows:
- The C source code.
- The best-effort standard RV64GC assembly (representative compiler output).
- The equivalent FireStorm assembly using the relevant extensions.
- Comparison of instruction count and cycle estimate.
The examples were chosen to be realistic, representative of the workloads FireStorm targets (audio synthesis, signal processing, OS kernels, modular software), and small enough to verify by hand.
2. Methodology and Reference Microarchitecture
Cycle counts in this document assume the following reference microarchitecture, which is conservative — real FireStorm implementations are expected to match or exceed it:
- Single-issue, in-order pipeline.
- ALU operations: 1 cycle, fully bypassed.
- Branches: 1 cycle when predicted correctly, +2 cycles on misprediction.
- Loads: 2-cycle latency from memory; load-use stall of 1 cycle for immediately-dependent ALU op, none for further-out instructions.
- Stores: 1 cycle; commit deferred to a store buffer, not blocking.
- Xstack BSRAM port: 4 dwords per cycle (Ant64 baseline,
mxstack[PORT_WIDTH_LOG2] = 2). - Xcrisp memory-fused instructions: 1-cycle throughput in steady state on properly pipelined implementations; 3-cycle latency.
- Xcrisp PIC instructions: 1 cycle (LAPC, JALPC), 2 cycles (LDPC, LWPC), 2 cycles (CALLM, JMPM — the memory load is on the critical path).
- Xcond predicated R-type: 1 cycle (same as non-predicated R-type). A false-predicate op consumes a pipeline slot but does not commit rd; the win is from eliminating short conditional branches, not from skipping the predicated op.
- Xctx context switch (YIELD/HALT/FREE/NEW): ~30 cycles (576-bit BSRAM port). The switch is a single instruction architecturally but consumes the cycles needed to save/load 1 KB of context state through the BSRAM port.
- Xctx RESUME: ~3 cycles. RESUME only updates the state table and ready queue; it does not itself context-switch.
Instruction count is implementation-independent and reported precisely. Cycle counts are estimates for the reference microarchitecture and should be read as rough ratios rather than absolute figures.
All assembly snippets use lowercase for standard RV64 mnemonics and uppercase for FireStorm extension mnemonics, making it easy to see at a glance which instructions are extension-specific.
3. Buffer Sum (Accumulator Pattern)
A textbook accumulator loop summing 32-bit signed integers from a buffer.
3.1 C Source
int64_t sum_buffer(const int32_t *buf, int n) {
int64_t sum = 0;
const int32_t *end = buf + n;
while (buf < end) {
sum += *buf++;
}
return sum;
}
3.2 Standard RV64GC
; a0 = buf, a1 = n
; on entry, a0 -> int32_t array, a1 = count
li a2, 0 ; sum = 0
slli a3, a1, 2 ; n * 4 = byte count
add a3, a0, a3 ; end pointer
beq a0, a3, .Lret ; n == 0
.Lloop:
lw a4, 0(a0) ; load buf[i]
addi a0, a0, 4 ; ++buf
add a2, a2, a4 ; sum += buf[i]
bne a0, a3, .Lloop ; while (buf < end)
.Lret:
mv a0, a2 ; return sum
ret
Inner loop: 4 instructions per iteration.
3.3 FireStorm (Xcrisp Auto-Inc Load)
li a2, 0
slli a3, a1, 2
add a3, a0, a3
beq a0, a3, .Lret
.Lloop:
LWPI a4, 4(a0) ; a4 = sext32(mem32[a0]); a0 += 4
add a2, a2, a4
bne a0, a3, .Lloop
.Lret:
mv a0, a2
ret
Inner loop: 3 instructions per iteration.
3.4 FireStorm (Xcrisp Load-Op Fusion)
If we prefer to keep the pointer increment separate (allowing the compiler more scheduling flexibility), LWADD does the load and the add in one instruction:
.Lloop:
LWADD a2, (a0), a2 ; a2 = sext32(mem32[a0]) + a2
addi a0, a0, 4
bne a0, a3, .Lloop
Inner loop: still 3 instructions; same shape, different decomposition. On a wider implementation that can issue the addi in parallel with the load-op's ALU stage, this form has higher peak throughput.
3.5 Comparison
| Implementation | Inner-loop instructions | Cycles per element (reference µarch) |
|---|---|---|
| Standard RV64GC | 4 | 4 |
| FireStorm (LWPI) | 3 | 3 |
| FireStorm (LWADD + addi) | 3 | 3 |
25% reduction in instruction count and a matching reduction in cycle count on a single-issue in-order pipeline. On wider implementations the win is the same in instructions but smaller in cycles, since the standard loop can issue two instructions per cycle if the load result is forwarded promptly.
4. In-Place Buffer Mutation
A loop that modifies every element of a buffer in place — clearing a bit, applying a bias, inverting all bits. The pattern is buf[i] = buf[i] OP constant.
4.1 C Source
void apply_mask(uint32_t *buf, uint32_t mask, int n) {
for (int i = 0; i < n; i++) {
buf[i] &= mask;
}
}
4.2 Standard RV64GC
; a0 = buf, a1 = mask, a2 = n
slli a3, a2, 2
add a3, a0, a3 ; end pointer
beq a0, a3, .Lret
.Lloop:
lw a4, 0(a0) ; load
and a4, a4, a1 ; AND with mask
sw a4, 0(a0) ; store
addi a0, a0, 4 ; advance pointer
bne a0, a3, .Lloop
.Lret:
ret
Inner loop: 5 instructions per iteration.
4.3 FireStorm (Xcrisp Load-Op-Store Fusion)
slli a3, a2, 2
add a3, a0, a3
beq a0, a3, .Lret
.Lloop:
MMWAND [a0], [a0], a1 ; mem32[a0] = mem32[a0] & a1 (in-place)
addi a0, a0, 4
bne a0, a3, .Lloop
.Lret:
ret
Inner loop: 3 instructions per iteration.
4.4 Comparison
| Implementation | Inner-loop instructions | Cycles per element |
|---|---|---|
| Standard RV64GC | 5 | 5–6 (load-use stall on and) |
| FireStorm (MMWAND) | 3 | 3 (pipelined load-op-store) |
40% reduction in instruction count. No architectural temporary is ever written: the load result is forwarded through the ALU into the store-buffer entry without traversing the register file. On the reference microarchitecture, the load-use stall in the standard code disappears entirely in the fused form.
The same pattern applies to buf[i] += dc, buf[i] ^= 0xFFFFFFFF, buf[i] |= flag_bits, etc. — any C operation of the form *p = *p OP v compiles to a single load-op-store instruction.
5. Element-Wise Transform with Different Source and Destination
A loop that reads one buffer, transforms each element, and writes to another buffer — the classic vector-style operation. Example: applying a DC bias.
5.1 C Source
void add_bias(int32_t *dst, const int32_t *src, int32_t bias, int n) {
for (int i = 0; i < n; i++) {
dst[i] = src[i] + bias;
}
}
5.2 Standard RV64GC
; a0 = dst, a1 = src, a2 = bias, a3 = n
slli a4, a3, 2
add a4, a1, a4 ; src end
beq a1, a4, .Lret
.Lloop:
lw a5, 0(a1) ; load src[i]
add a5, a5, a2 ; + bias
sw a5, 0(a0) ; store to dst[i]
addi a1, a1, 4
addi a0, a0, 4
bne a1, a4, .Lloop
.Lret:
ret
Inner loop: 6 instructions per iteration.
5.3 FireStorm (Xcrisp Load-Op-Store)
slli a4, a3, 2
add a4, a1, a4
beq a1, a4, .Lret
.Lloop:
MMWADD [a0], [a1], a2 ; mem32[a0] = mem32[a1] + a2
addi a1, a1, 4
addi a0, a0, 4
bne a1, a4, .Lloop
.Lret:
ret
Inner loop: 4 instructions per iteration.
5.4 Comparison
| Implementation | Inner-loop instructions | Cycles per element |
|---|---|---|
| Standard RV64GC | 6 | 6 |
| FireStorm (MMWADD) | 4 | 4 |
33% reduction. The fused form would benefit from a hypothetical "load-op-store with dual auto-increment" instruction that advances both pointers in one shot, but that doesn't fit in 32 bits and is parked as a v0.2 candidate. Even without it, the win is substantial.
6. Null-Terminated String Length
A classic strlen implementation. Note that real libc implementations of strlen use word-at-a-time scanning with bit-twiddling tricks (the "Mycroft" technique); this comparison shows the byte-at-a-time form as a representative tight scan loop, which is also what compilers typically emit for while(*p) p++; patterns when no library call optimisation is available.
6.1 C Source
size_t strlen_byte(const char *s) {
const char *p = s;
while (*p != 0) p++;
return p - s;
}
6.2 Standard RV64GC
; a0 = s
mv a1, a0 ; p = s
.Lloop:
lbu a2, 0(a1) ; load byte
addi a1, a1, 1 ; advance
bne a2, zero, .Lloop ; loop while non-zero
.Lend:
sub a0, a1, a0 ; p - s
addi a0, a0, -1 ; account for past-the-terminator
ret
Inner loop: 3 instructions per iteration.
6.3 FireStorm (Xcrisp Auto-Inc Load)
mv a1, a0
.Lloop:
LBUPI a2, 1(a1) ; a2 = zext8(mem8[a1]); a1 += 1
bne a2, zero, .Lloop
.Lend:
sub a0, a1, a0
addi a0, a0, -1
ret
Inner loop: 2 instructions per iteration.
6.4 Comparison
| Implementation | Inner-loop instructions | Cycles per byte |
|---|---|---|
| Standard byte-at-a-time | 3 | 3 |
| FireStorm byte-at-a-time | 2 | 2 |
33% reduction. Both forms can be replaced by word-at-a-time scanning for further speedup in libc implementations; the FireStorm form remains 33% faster than the standard form when both use the same algorithm.
7. Linear Table Search
Searching for a key in a table of structs, a common dispatch pattern.
7.1 C Source
typedef struct {
uint32_t key;
uint32_t value;
} entry_t;
entry_t *find_entry(entry_t *table, entry_t *end, uint32_t key) {
while (table < end) {
if (table->key == key) return table;
table++;
}
return NULL;
}
7.2 Standard RV64GC
; a0 = table, a1 = end, a2 = key
beq a0, a1, .Lnotfound
.Lloop:
lw a3, 0(a0) ; load entry->key
beq a3, a2, .Lret ; match: return current pointer
addi a0, a0, 8 ; ++table (struct is 8 bytes)
bne a0, a1, .Lloop
.Lnotfound:
li a0, 0
.Lret:
ret
Inner loop: 4 instructions per iteration.
7.3 FireStorm (Xcrisp Compare-Mem-Branch)
beq a0, a1, .Lnotfound
.Lloop:
BEQM a2, (a0), .Lret ; if a2 == mem32[a0], branch
addi a0, a0, 8
bne a0, a1, .Lloop
.Lnotfound:
li a0, 0
.Lret:
ret
Inner loop: 3 instructions per iteration.
7.4 Comparison
| Implementation | Inner-loop instructions | Cycles per probe |
|---|---|---|
| Standard RV64GC | 4 | 4–5 (load-use stall on beq) |
| FireStorm (BEQM) | 3 | 3 |
25% reduction in instruction count, larger cycle reduction because the load-use stall vanishes — the compare-mem-branch instruction's load result is consumed directly by the compare unit without traversing the register file.
This pattern matters in dispatcher code (looking up a handler by message ID), parser tables, command-name matchers, and any associative lookup. In hot dispatch paths this 25% reduction can meaningfully reduce latency.
8. Memory Copy
A straightforward memcpy of n bytes.
8.1 C Source
void memcpy_simple(void *dst, const void *src, size_t n) {
char *d = dst;
const char *s = src;
while (n--) *d++ = *s++;
}
8.2 Standard RV64GC
; a0 = dst, a1 = src, a2 = n
beqz a2, .Lret
.Lloop:
lb a3, 0(a1)
sb a3, 0(a0)
addi a1, a1, 1
addi a0, a0, 1
addi a2, a2, -1
bnez a2, .Lloop
.Lret:
ret
Inner loop: 6 instructions per byte.
(A real memcpy libc routine uses word-at-a-time transfers and alignment handling; counts shown are for the simple byte loop, which is what naive C compiles to without special optimisation.)
8.3 FireStorm (Xcrisp Block Memory)
BMCPY a0, a1, a2 ; copy a2 bytes from a1 to a0
ret
Total: 1 instruction.
The BMCPY instruction is restartable on interrupt and may issue wide transfers internally (multi-byte per cycle on aligned operands). Its execution time is variable but typically far less than the standard byte-loop equivalent.
8.4 Comparison
| Implementation | Inner-loop instructions | Notes |
|---|---|---|
| Standard byte loop | 6 per byte | Dominates execution time on small copies |
| Optimised standard (word + tail) | ~3 amortised per byte for large n |
Adds prologue/epilogue cost |
| FireStorm BMCPY | 1 total instruction | Throughput depends on implementation; potentially 8 bytes per cycle on aligned wide transfers |
For small copies, the standard byte loop's prologue overhead dominates; BMCPY's single-instruction issue beats it handily. For large copies where word-at-a-time matters, a well-implemented BMCPY ties or beats the optimised libc version while being one instruction at the call site.
9. Function with Saved Registers (Prologue/Epilogue)
A typical short function that calls another function and uses some saved registers across the call. The frame management overhead is a substantial fraction of total execution time for small functions.
9.1 C Source
extern int lookup(int);
int compute(int a, int b, int c, int d) {
int x = lookup(a);
int y = lookup(b);
int z = lookup(c);
int w = lookup(d);
return x + y + z + w;
}
The compiler must save ra (return address) plus four callee-saved registers (s0–s3) to hold the four lookup results across the four calls.
9.2 Standard RV64GC
compute:
addi sp, sp, -48 ; allocate frame (must be 16-byte aligned)
sd ra, 40(sp)
sd s0, 32(sp)
sd s1, 24(sp)
sd s2, 16(sp)
sd s3, 8(sp)
mv s0, a1 ; save args before clobbering a0
mv s1, a2
mv s2, a3
jal ra, lookup
mv s3, a0 ; x = lookup(a)
mv a0, s0
jal ra, lookup
mv s0, a0 ; y = lookup(b)
mv a0, s1
jal ra, lookup
mv s1, a0 ; z = lookup(c)
mv a0, s2
jal ra, lookup
add a0, s3, s0 ; x + y
add a0, a0, s1 ; + z
add a0, a0, s2 ; not quite — see note
ld ra, 40(sp)
ld s0, 32(sp)
ld s1, 24(sp)
ld s2, 16(sp)
ld s3, 8(sp)
addi sp, sp, 48
ret
Prologue: 6 instructions. Epilogue: 7 instructions. Frame overhead: 13 instructions.
(The a0/s0 shuffling between calls is independent of the frame management and is the same in both standard and Xstack versions; we focus on the prologue/epilogue cost.)
9.3 FireStorm (Xstack)
compute:
PUSH rlist=00100, spimm=0 ; save ra, s0–s3 (5 registers)
mv s0, a1
mv s1, a2
mv s2, a3
jal ra, lookup
mv s3, a0
mv a0, s0
jal ra, lookup
mv s0, a0
mv a0, s1
jal ra, lookup
mv s1, a0
mv a0, s2
jal ra, lookup
add a0, s3, s0
add a0, a0, s1
add a0, a0, s2
POPRET rlist=00100, spimm=0 ; restore ra, s0–s3, return
Prologue: 1 instruction. Epilogue: 1 instruction. Frame overhead: 2 instructions.
9.4 Comparison
| Implementation | Frame overhead instructions | Cycles (reference µarch) |
|---|---|---|
| Standard RV64GC | 13 (6 + 7) | ~13 |
| FireStorm Xstack | 2 (1 + 1) | ~4 (5 regs on 4-dword/cycle port: PUSH ≈ 2 cycles, POPRET ≈ 2 cycles, including the implicit return) |
85% reduction in frame-overhead instruction count. For functions where the body is also small (say, 10–20 instructions), this changes the overall function cost dramatically: a 20-instruction body with 13 instructions of frame overhead becomes nearly 40% smaller as 20 + 2 = 22 instructions total.
The cycle saving is comparable: ~13 cycles of frame overhead becomes ~3 cycles. For a small function called frequently (in a hot path), this is the largest single optimisation FireStorm offers.
10. Interrupt Service Routine
A trap handler that saves the standard caller-saved set so it can call into C-language helpers.
10.1 Pseudo-C Source (Conceptual)
void trap_handler(void) {
// Save user-mode registers t0-t6 and a0-a7
// Call C handler
c_dispatch_trap();
// Restore and return to user mode
}
10.2 Standard RV64GC
A handler saving t0–t6 (7 regs) and a0–a7 (8 regs) needs a 128-byte frame:
trap_handler:
addi sp, sp, -128
sd t0, 0(sp)
sd t1, 8(sp)
sd t2, 16(sp)
sd t3, 24(sp)
sd t4, 32(sp)
sd t5, 40(sp)
sd t6, 48(sp)
sd a0, 56(sp)
sd a1, 64(sp)
sd a2, 72(sp)
sd a3, 80(sp)
sd a4, 88(sp)
sd a5, 96(sp)
sd a6, 104(sp)
sd a7, 112(sp)
sd ra, 120(sp)
jal ra, c_dispatch_trap
ld ra, 120(sp)
ld a7, 112(sp)
; ... (14 more loads in reverse) ...
ld t0, 0(sp)
addi sp, sp, 128
mret
Save: 17 instructions (16 sd + 1 addi). Restore: 17 instructions. Total overhead: 34 instructions.
10.3 FireStorm (Xstack)
trap_handler:
PUSHS rlist=10010, spimm=0 ; save ra, t0–t6, a0–a7 (16 regs total)
jal ra, c_dispatch_trap
POPSRET rlist=10010, spimm=0 ; restore all and return
; (POPSRET is the supervisor-mode equivalent of POPRET; the actual return-from-trap
; instruction depends on the privilege level; see §10.4)
Save: 1 instruction. Restore: 1 instruction. Total overhead: 2 instructions.
10.4 A Note on Trap Returns
The example above uses POPSRET as a placeholder. In a real trap handler the return path is mret (machine mode) or sret (supervisor mode), not a plain jalr — POPSRET combines pop with ret (which is jalr x0, 0(ra)), not with mret/sret. A clean handler structure would be:
trap_handler:
PUSHS rlist=10010, spimm=0
jal ra, c_dispatch_trap
POPS rlist=10010, spimm=0 ; restore without returning
sret ; or mret per privilege
This is 3 instructions of overhead rather than 2, but still dramatically less than the standard's 34.
10.5 Comparison
| Implementation | Overhead instructions | Cycles (4-dword/cycle BSRAM port) |
|---|---|---|
| Standard RV64GC | 34 (17 + 17) | ~34 |
| FireStorm Xstack | 3 (1 + 1 + 1) | ~10 (PUSHS for 16 regs = 4 cycles, POPS = 4 cycles, sret = 1 cycle, jal = 1 cycle) |
~91% reduction in instruction count and roughly 75% reduction in cycle overhead. For interrupt-heavy workloads (timer ticks, peripheral I/O, OS scheduling), this changes the achievable interrupt rate by nearly an order of magnitude.
11. Coroutine Yield
A cooperative coroutine that yields to a scheduler. Each coroutine maintains its full register state on its own user stack; switching is a stack-pointer swap.
11.1 Conceptual Model
Each coroutine has a coroutine_t structure holding (among other things) its parked usp. When coroutine A yields to the scheduler:
- A saves its callee-saved registers to its own stack.
- A switches
uspto the scheduler's savedusp. - The scheduler restores its callee-saved registers from the (now-current) stack.
The scheduler subsequently restores from a chosen coroutine B via a symmetric sequence.
11.2 Standard RV64GC (Simplified)
A typical implementation might use a setjmp/longjmp-like mechanism, saving 12 callee-saved registers plus ra to the coroutine struct and restoring from the target:
coro_yield:
; a0 = coroutine_t* of self (caller)
; a1 = coroutine_t* of target
sd ra, 0(a0)
sd s0, 8(a0)
sd s1, 16(a0)
sd s2, 24(a0)
sd s3, 32(a0)
sd s4, 40(a0)
sd s5, 48(a0)
sd s6, 56(a0)
sd s7, 64(a0)
sd s8, 72(a0)
sd s9, 80(a0)
sd s10, 88(a0)
sd s11, 96(a0)
sd sp, 104(a0) ; save sp too (frame pointer if used)
; Now restore from target
ld ra, 0(a1)
ld s0, 8(a1)
ld s1, 16(a1)
ld s2, 24(a1)
ld s3, 32(a1)
ld s4, 40(a1)
ld s5, 48(a1)
ld s6, 56(a1)
ld s7, 64(a1)
ld s8, 72(a1)
ld s9, 80(a1)
ld s10, 88(a1)
ld s11, 96(a1)
ld sp, 104(a1)
ret
~28 instructions total (14 saves, 14 loads).
11.3 FireStorm (Xstack)
With each coroutine maintaining its full callee-saved state on its own hardware stack, yield is a save + stack swap + restore:
coro_yield:
; a0 = saved-usp register of caller's coroutine (read/write slot)
; a1 = saved-usp register of target coroutine (read)
PUSH rlist=01100, spimm=0 ; save ra, s0–s11 to current user stack
USWAP t0, a1 ; t0 = old usp; new usp = a1 (target coroutine's stack)
sd t0, 0(a0) ; store caller's parked usp into its coroutine_t
POP rlist=01100, spimm=0 ; restore ra, s0–s11 from new (target's) stack
ret
5 instructions total.
11.4 Comparison
| Implementation | Total instructions | Cycles (4-dword/cycle BSRAM port) |
|---|---|---|
| Standard RV64GC | ~28 | ~28 |
| FireStorm Xstack | 5 | ~11 (PUSH 13 regs ≈ 4 cycles, USWAP 1, sd 1, POP 13 regs ≈ 4 cycles, ret 1) |
82% reduction in instruction count, roughly 40–60% reduction in cycle count for the full yield. Combined with zero pressure on the tiny 8 KB D-cache from the saved register state (Xctx context lives in dedicated BSRAM), this makes green-thread schedulers practical at frequencies (thousands of yields per millisecond) where software fibers on FireStorm would thrash the small D-cache and serialise on DRAM accesses.
12. PC-Relative Global Address
Materialising the address of a global variable — the most basic PIC primitive.
12.1 C Source
extern int32_t global_var;
int32_t *get_addr_of_global(void) {
return &global_var;
}
12.2 Standard RV64GC (PIC)
get_addr_of_global:
.La:
auipc a0, %pcrel_hi(global_var)
addi a0, a0, %pcrel_lo(.La)
ret
2 instructions for the address materialisation plus the return.
12.3 FireStorm (Xcrisp PIC LAPC, wide-mode only)
get_addr_of_global:
LAPC a0, global_var
ret
1 instruction for the address materialisation.
12.4 Comparison
| Implementation | Address-materialisation instructions | Cycles |
|---|---|---|
| Standard RV64GC | 2 | 2 |
| FireStorm Xcrisp PIC | 1 | 1 |
50% reduction in this primitive. While the absolute saving is small per occurrence, this pattern appears so frequently in PIC code (every reference to a global) that the cumulative impact across a translation unit is substantial — typically 10–15% of total instruction count in modular PIC code can be PIC-relocation pairs.
13. Function Call Through Global Function Pointer
A call to a function whose address is in a global variable. Common in OS dispatch tables, plugin systems, and library-internal indirection.
13.1 C Source
extern void (*handler)(int);
void invoke_handler(int arg) {
handler(arg);
}
13.2 Standard RV64GC (PIC)
invoke_handler:
addi sp, sp, -16
sd ra, 8(sp)
.La:
auipc t0, %pcrel_hi(handler)
ld t0, %pcrel_lo(.La)(t0) ; load handler pointer
jalr ra, 0(t0)
ld ra, 8(sp)
addi sp, sp, 16
ret
8 instructions total.
13.3 FireStorm (Xcrisp PIC LDPC + Xstack PUSH)
invoke_handler:
PUSH rlist=00000, spimm=0 ; save just ra
LDPC t0, handler ; load handler pointer
jalr ra, 0(t0)
POPRET rlist=00000, spimm=0 ; pop ra, return
4 instructions total.
13.4 Comparison
| Implementation | Instructions | Cycles |
|---|---|---|
| Standard RV64GC | 8 | ~10 (load-use stall on jalr) |
| FireStorm | 4 | ~6 |
50% reduction in instruction count, with a further cycle saving from the eliminated load-use stall. For an OS or runtime that dispatches through global function tables on every system call, the savings compound.
14. Virtual Dispatch
The classic C++ / vtable-style virtual call: load a function pointer from a structure (the vtable), then call it.
14.1 C Source (C-flavoured equivalent)
typedef struct vtable {
void (*method0)(void*);
void (*method1)(void*);
void (*method2)(void*);
void (*method3)(void*);
} vtable_t;
typedef struct object {
const vtable_t *vt;
/* fields */
} object_t;
void call_method2(object_t *obj) {
obj->vt->method2(obj);
}
14.2 Standard RV64GC
call_method2:
addi sp, sp, -16
sd ra, 8(sp)
ld t0, 0(a0) ; obj->vt
ld t0, 16(t0) ; vt->method2 (offset 16 = 2 × 8 bytes)
jalr ra, 0(t0)
ld ra, 8(sp)
addi sp, sp, 16
ret
8 instructions total (3 for the dispatch chain, plus prologue/epilogue/return).
14.3 FireStorm (Xcrisp PIC CALLM + Xstack)
call_method2:
PUSH rlist=00000, spimm=0
ld t0, 0(a0) ; obj->vt (vtable pointer)
CALLM ra, 16(t0) ; load fn ptr at vt+16, call
POPRET rlist=00000, spimm=0
4 instructions total. The CALLM instruction collapses the second load + JALR pair into one instruction.
14.4 Comparison
| Implementation | Instructions | Cycles |
|---|---|---|
| Standard RV64GC | 8 | ~10 |
| FireStorm (Xcrisp PIC + Xstack) | 4 | ~6 |
50% reduction. Virtual dispatch is the canonical use case for CALLM; any C++ codebase or any object-oriented system in C with explicit vtables benefits proportionally to its dispatch frequency.
15. Wide-Mode Register Pressure: 8-Tap FIR Filter
A finite-impulse-response filter holds many coefficients live in registers simultaneously. In narrow mode the compiler must spill some; in wide mode all coefficients fit in the extended file.
15.1 C Source
int32_t fir8(int16_t x, int32_t *state, const int32_t *coeffs) {
// Shift state
state[7] = state[6];
state[6] = state[5];
state[5] = state[4];
state[4] = state[3];
state[3] = state[2];
state[2] = state[1];
state[1] = state[0];
state[0] = (int32_t)x;
// Convolve
return coeffs[0] * state[0]
+ coeffs[1] * state[1]
+ coeffs[2] * state[2]
+ coeffs[3] * state[3]
+ coeffs[4] * state[4]
+ coeffs[5] * state[5]
+ coeffs[6] * state[6]
+ coeffs[7] * state[7];
}
15.2 Narrow-Mode Compilation Sketch
The function takes state and coeffs as pointers, so state shift and coefficient access touch memory. The body must:
- Load coeffs[0..7]: 8 loads.
- Load state[0..6] (the old values needed for the convolve and for shifting): 7 loads.
- Multiply each coefficient by its corresponding state value: 8 muls (where state[0] in the new frame is
x, so coef[0] × x doesn't need a state load). - Accumulate: 7 adds.
- Store the shifted state values to state[0..7]: 8 stores.
- Function entry/exit.
That's approximately 32 instructions of pure body work. The challenge is register pressure during the convolve: 8 coefficients + 7 state values + 1 accumulator = 16 live values needed simultaneously. In narrow mode, the available caller-saved register set (15 registers, minus the 3 used for function arguments) is roughly 12. The compiler must spill 4+ values, adding roughly 8 instructions (one store per spill, one reload per consumption).
Narrow-mode total: ~40 instructions, with 4+ spills causing additional cycle stalls.
15.3 Wide-Mode Compilation Sketch
In wide mode the compiler has 32 additional caller-saved registers (x32–x63). The full set of 8 coefficients + 7 state values + accumulator + temporaries fits comfortably without any spill. The body work is the same ~32 instructions, but with zero spill overhead.
Wide-mode total: ~32 instructions, no spills.
15.4 Comparison
| Implementation | Body instructions | Live-register spills | Cycles (in-order, 1-cycle multiply) |
|---|---|---|---|
| Narrow-mode | ~40 | 4+ | ~50+ (with load-use stalls on spill reload) |
| Wide-mode | ~32 | 0 | ~35 |
~20% reduction in instruction count, ~30% reduction in cycle count. The exact numbers depend on the multiplier latency and the compiler's register allocator; the qualitative win is clear: wide mode eliminates the spill traffic that narrow mode is forced into by the register-pressure ceiling.
For larger FIR filters (16-tap, 32-tap) the win compounds — narrow-mode spill counts grow roughly linearly with the tap count, while wide-mode continues to fit the working set in registers up to ~24-tap depending on what else is live. A more aggressive API variant (where the caller maintains state in registers across calls, e.g., a fully-inlined biquad cascade) would expose much larger wide-mode wins, since the per-call memory traffic for state would disappear entirely.
16. Combined Extension Showcase: Audio Synth Voice Inner Loop
A combined example using Xwide, Xcrisp, and Xstack together: a simplified single-oscillator synth voice with an envelope and a one-pole low-pass filter, generating one sample.
16.1 C Source
extern const int16_t sine_table[1024]; /* shared LUT */
typedef struct {
uint32_t phase; /* 16.16 fixed-point phase accumulator */
uint32_t phase_inc; /* phase increment per sample */
int32_t env; /* envelope value, Q24 */
int32_t env_inc; /* envelope increment per sample */
int32_t filter_state; /* Q23 */
int32_t filter_coef; /* Q31 cutoff factor */
} voice_t;
int32_t voice_render_one(voice_t *v) {
/* Oscillator: phase lookup */
uint32_t idx = (v->phase >> 16) & 1023;
int32_t osc = (int32_t)sine_table[idx]; /* sign-extended */
v->phase += v->phase_inc;
/* Envelope */
int32_t enveloped = (osc * v->env) >> 15;
v->env += v->env_inc;
/* One-pole LPF: state = state + coef * (input - state), where coef ∈ [0,1) */
int32_t diff = enveloped - v->filter_state;
v->filter_state += (int32_t)(((int64_t)diff * v->filter_coef) >> 31);
return v->filter_state;
}
16.2 Standard RV64GC (leaf-optimised, no frame)
A typical compilation for the leaf function. The compiler recognises voice_render_one as a leaf (no calls), so it omits the prologue/epilogue and ra save entirely:
voice_render_one:
; --- oscillator ---
lwu t0, 0(a0) ; t0 = phase
srli t1, t0, 16
andi t1, t1, 1023 ; idx = (phase >> 16) & 1023
slli t1, t1, 1 ; idx * 2 (sine entries are 16-bit)
.La:
auipc t2, %pcrel_hi(sine_table)
addi t2, t2, %pcrel_lo(.La)
add t2, t2, t1
lh t3, 0(t2) ; osc = sine_table[idx] (sign-extended)
lwu t4, 4(a0) ; phase_inc
add t0, t0, t4 ; phase += phase_inc
sw t0, 0(a0) ; store back
; --- envelope ---
lw t5, 8(a0) ; env
mul t3, t3, t5 ; osc * env
srai t3, t3, 15 ; >> 15
lw t6, 12(a0) ; env_inc
add t5, t5, t6 ; env += env_inc
sw t5, 8(a0)
; --- filter ---
lw t1, 16(a0) ; filter_state
sub t2, t3, t1 ; diff = enveloped - state
lw t4, 20(a0) ; filter_coef
mul t2, t2, t4
srai t2, t2, 31 ; >> 31
add t1, t1, t2 ; state += diff * coef >> 31
sw t1, 16(a0) ; store new state
mv a0, t1 ; return state
ret
26 instructions.
16.3 FireStorm Single-Sample Render
For a single-sample leaf function, the available FireStorm wins are limited. The dominant Xcrisp instructions (auto-inc, load-op-store) target loop patterns and in-place memory updates — neither applies cleanly to this shape, where struct fields are loaded once, used in arithmetic, and stored back. The only clear win is LAPC replacing the standard PIC pair for the sine table base:
voice_render_one: ; leaf function, no frame management needed
; --- oscillator ---
lwu t0, 0(a0) ; phase
srli t1, t0, 16
andi t1, t1, 1023
slli t1, t1, 1
LAPC t2, sine_table ; replaces auipc + addi (saves 1)
add t2, t2, t1
lh t3, 0(t2)
lwu t4, 4(a0) ; phase_inc
add t0, t0, t4 ; phase += phase_inc
sw t0, 0(a0)
; --- envelope ---
lw t5, 8(a0)
mul t3, t3, t5
srai t3, t3, 15
lw t6, 12(a0)
add t5, t5, t6
sw t5, 8(a0)
; --- filter ---
lw t1, 16(a0)
sub t2, t3, t1
lw t4, 20(a0)
mul t2, t2, t4
srai t2, t2, 31
add t1, t1, t2
sw t1, 16(a0)
mv a0, t1
ret
25 instructions — a ~4% reduction. The two multiplies and the LUT lookup arithmetic dominate and are not addressed by the current FireStorm extensions. (A future load-op-multiply extension would be the natural next step for DSP-heavy workloads — added to the v0.2 candidate list.)
16.4 Where the Real Wins Live: Block Rendering
In practice, audio code does not call voice_render_one once per sample — it renders blocks of typically 32, 64, or 128 samples in a single call. Block rendering moves the state into registers across the inner loop and exposes auto-increment opportunities on the output buffer.
C source:
void voice_render_block(voice_t *v, int32_t *output, int n) {
uint32_t phase = v->phase;
uint32_t phase_inc = v->phase_inc;
int32_t env = v->env;
int32_t env_inc = v->env_inc;
int32_t state = v->filter_state;
int32_t coef = v->filter_coef;
for (int i = 0; i < n; i++) {
uint32_t idx = (phase >> 16) & 1023;
int32_t osc = sine_table[idx];
int32_t enveloped = (osc * env) >> 15;
int32_t diff = enveloped - state;
state += (int32_t)(((int64_t)diff * coef) >> 31);
output[i] = state;
phase += phase_inc;
env += env_inc;
}
v->phase = phase;
v->env = env;
v->filter_state = state;
}
Standard inner loop (state in callee-saved registers, sine_base preloaded outside loop):
.Lloop:
srli t0, s0, 16 ; phase = s0
andi t0, t0, 1023
slli t0, t0, 1
add t0, s6, t0 ; s6 = sine_base
lh t1, 0(t0) ; osc
mul t1, t1, s2 ; osc * env (s2 = env)
srai t1, t1, 15 ; enveloped
sub t2, t1, s4 ; diff = enveloped - state (s4 = state)
mul t2, t2, s5 ; diff * coef (s5 = coef)
srai t2, t2, 31
add s4, s4, t2 ; state += ...
sw s4, 0(a1) ; output[i] = state
addi a1, a1, 4 ; output++
add s0, s0, s1 ; phase += phase_inc (s1 = phase_inc)
add s2, s2, s3 ; env += env_inc (s3 = env_inc)
addi a2, a2, -1 ; i counter (n decrementing)
bnez a2, .Lloop
17 instructions per iteration.
Plus prologue/epilogue to save callee-saved registers s0–s6 (7 regs), and one LAPC-equivalent for the sine_table base outside the loop.
Standard prologue: 1 addi (sp adjust) + 7 sd + 2 (auipc + addi for sine_base) + 6 (load state into s-regs) ≈ 16 instructions of setup. Standard epilogue: 7 ld + 1 addi + 3 sw (write state back) + ret ≈ 12 instructions.
FireStorm inner loop with SWPI (auto-inc store):
.Lloop:
srli t0, s0, 16
andi t0, t0, 1023
slli t0, t0, 1
add t0, s6, t0
lh t1, 0(t0)
mul t1, t1, s2
srai t1, t1, 15
sub t2, t1, s4
mul t2, t2, s5
srai t2, t2, 31
add s4, s4, t2
SWPI s4, 4(a1) ; output[i] = state, output++ (saves 1)
add s0, s0, s1
add s2, s2, s3
addi a2, a2, -1
bnez a2, .Lloop
16 instructions per iteration (SWPI replaces sw + addi).
In wide mode, the compiler can keep state in caller-saved x32–x63 instead of callee-saved s-registers, avoiding the prologue/epilogue saves entirely. Combined with LAPC:
FireStorm prologue: 1 LAPC + 6 register-direct loads ≈ 7 instructions. FireStorm epilogue: 3 sw + ret = 4 instructions.
16.5 Comparison
For a block of 64 samples:
| Implementation | Setup | Inner loop (×64) | Teardown | Total |
|---|---|---|---|---|
| Standard RV64GC | 16 | 1088 (17×64) | 12 | 1116 |
| FireStorm (Xcrisp + Xwide) | 7 | 1024 (16×64) | 4 | 1035 |
~7% reduction for a 64-sample block. The win is dominated by the inner-loop saving from SWPI, plus the wide-mode elimination of the callee-saved spill in the function frame.
The remaining inner-loop cost is dominated by:
- Two multiplies per sample (no load-op-multiply yet — v0.2 candidate)
- Three additions for state update (phase, env, output write)
- LUT lookup arithmetic (4 instructions to index a 16-bit table)
These are the patterns the next FireStorm DSP extension would target. The current extensions deliver a real but modest single-digit-percent win on this workload.
16.6 Where FireStorm Really Helps in Audio: Buffer Operations
Audio pipelines spend a substantial fraction of time on simple buffer operations — mixing voices into a master bus, applying gain, summing effect sends. These are the patterns where FireStorm's MMW (load-op-store) and auto-increment instructions shine.
Two-voice mix into output buffer, with auto-increment on the source pointer and load-op-store for the in-place accumulation:
C source:
void mix_into(int32_t *bus, const int32_t *voice, int n) {
for (int i = 0; i < n; i++) {
bus[i] += voice[i];
}
}
Standard inner loop:
.Lloop:
lw t0, 0(a0) ; bus[i]
lw t1, 0(a1) ; voice[i]
add t0, t0, t1
sw t0, 0(a0) ; store back to bus[i]
addi a0, a0, 4
addi a1, a1, 4
addi a2, a2, -1
bnez a2, .Lloop
8 instructions per element.
FireStorm with LWPI and MMWADD:
.Lloop:
LWPI t0, 4(a1) ; t0 = voice[i], voice++
MMWADD [a0], [a0], t0 ; bus[i] += t0 (in-place)
addi a0, a0, 4
addi a2, a2, -1
bnez a2, .Lloop
5 instructions per element — a 37% reduction in inner-loop instruction count. For a master mix that sums 32 voices into a stereo bus across a 64-sample block (2 × 32 × 64 = 4096 elements processed), that's roughly 12 000 fewer instructions per audio block. At a 48 kHz / 64-sample block rate (750 blocks/sec), that's a measurable reduction in DSP load.
17. Conditional Execution Patterns (Xcond)
The Xcond extension eliminates short conditional branches by attaching predicates directly to R-type ALU operations. The wins are concentrated in inner loops with data-dependent conditionals (where standard code suffers from branch mispredicts) and in short arithmetic idioms that would otherwise require either a branch or a bit-twiddle sequence.
All Xcond examples in this section are wide-mode-only. The cycle figures assume a 4-cycle mispredict penalty (per §2 methodology) and a 50% mispredict rate for fully data-dependent branches on random input.
17.1 Sum of Positive Samples (Conditional Accumulate)
A common DSP / statistics primitive: sum only the positive elements of a signed array. The textbook expression of "accumulate when predicate holds" — and the inner conditional branch is exactly the case Xcond targets.
17.1.1 C Source
int64_t sum_positive(const int32_t *buf, int n) {
int64_t sum = 0;
for (int i = 0; i < n; i++) {
if (buf[i] > 0) sum += buf[i];
}
return sum;
}
17.1.2 Standard RV64GC
sum_positive:
li a2, 0 ; sum = 0
slli a3, a1, 2
add a3, a0, a3 ; end = buf + n
beq a0, a3, .Lret
.Lloop:
lw a4, 0(a0)
blez a4, .Lskip ; data-dependent branch
add a2, a2, a4
.Lskip:
addi a0, a0, 4
bne a0, a3, .Lloop
.Lret:
mv a0, a2
ret
Inner loop: 4 instructions on take-skip path, 5 on add path. For random-sign data, the average path length is 4.5 instructions. The blez is fully data-dependent — for unbiased random input it mispredicts ~50% of the time, costing 2 cycles per iteration on the reference pipeline. Effective cost: ~6.5 cycles per iteration.
17.1.3 FireStorm (Xcrisp + Xcond)
sum_positive:
li a2, 0
slli a3, a1, 2
add a3, a0, a3
beq a0, a3, .Lret
.Lloop:
LWPI t0, 4(a0) ; t0 = *a0; a0 += 4
ADD.cond a2, t0, GT_RS1 ; if (t0 > 0) sum += t0
bne a0, a3, .Lloop
.Lret:
mv a0, a2
ret
Inner loop: 3 instructions, no inner branch. The loop branch at the bottom is highly predictable (predicted-taken until the last iteration). Effective cost: ~3 cycles per iteration.
17.1.4 Comparison
| Implementation | Inner-loop instr | Cycles/iter (random-sign data) |
|---|---|---|
| Standard RV64GC | 4–5 | ~6.5 |
| FireStorm (Xcrisp + Xcond) | 3 | ~3 |
~2× speedup on random-sign data. The win is the elimination of the inner branch entirely, plus the merge of pointer-advance and load via LWPI.
For data with strong bias (e.g., 95% positive samples), the standard code's branch becomes predictable and the speedup shrinks; for unbiased data, the speedup approaches 2×.
17.2 L1 Norm (Sum of Absolute Values)
Sum-of-absolute-values is the L1 norm of a vector — used for signal energy estimation, robust statistics, simple distance metrics, and many DSP primitives.
17.2.1 C Source
int64_t l1_norm(const int32_t *buf, int n) {
int64_t sum = 0;
for (int i = 0; i < n; i++) {
int32_t v = buf[i];
sum += (v < 0) ? -v : v;
}
return sum;
}
17.2.2 Standard RV64GC (Zbb available)
l1_norm:
li a2, 0
slli a3, a1, 2
add a3, a0, a3
beq a0, a3, .Lret
.Lloop:
lw t0, 0(a0)
neg t1, t0 ; t1 = -t0
max t0, t0, t1 ; t0 = max(t0, -t0) = abs(t0) (Zbb)
add a2, a2, t0
addi a0, a0, 4
bne a0, a3, .Lloop
.Lret:
mv a0, a2
ret
Inner loop: 6 instructions. Branchless via the neg/max idiom — no mispredict cost. (The non-Zbb form using sraiw/xor/sub is 1 instruction longer.)
17.2.3 FireStorm (Xcrisp + Xcond)
l1_norm:
li a2, 0
slli a3, a1, 2
add a3, a0, a3
beq a0, a3, .Lret
.Lloop:
LWPI t0, 4(a0)
RSUB.cond t0, x0, LT_RD ; if (t0 < 0) t0 = 0 - t0 = -t0 (abs in 1 instr)
add a2, a2, t0
bne a0, a3, .Lloop
.Lret:
mv a0, a2
ret
Inner loop: 4 instructions. RSUB-cond does abs in a single instruction by predicating 0 − t0 on the test t0 < 0.
17.2.4 Comparison
| Implementation | Inner-loop instr | Cycles/iter |
|---|---|---|
| Standard RV64GC (Zbb) | 6 | ~6 |
| FireStorm (Xcrisp + Xcond) | 4 | ~4 |
~33% inner-loop reduction. This is the canonical example of where Xcond's RSUB-cond beats Zbb: abs is a 2-instruction idiom in Zbb (neg + max) but a 1-instruction operation with RSUB-cond.
17.3 Phase Accumulator with Wrap-Around
A phase accumulator that increments by a fixed step and wraps modulo a period — the fundamental building block of every oscillator, NCO, LFO, and many counter/scheduler patterns.
17.3.1 C Source
uint32_t phase_step(uint32_t phase, uint32_t inc, uint32_t period) {
phase += inc;
if (phase >= period) phase -= period;
return phase;
}
(Pre-condition: inc < period, so the wrap is at most one subtraction. This is the common case for fixed-step phase advance.)
17.3.2 Standard RV64GC
phase_step:
add a0, a0, a1 ; phase += inc
bltu a0, a2, .Lend ; if (phase < period) skip
sub a0, a0, a2 ; phase -= period
.Lend:
ret
3 instructions + branch. The branch frequency is inc/period per call — for typical audio NCOs this is a few percent (unpredictable in detail; modern predictors do reasonably well but not perfectly).
17.3.3 FireStorm (Xcond)
phase_step:
add a0, a0, a1
SUB.cond a0, a2, GEU ; if (phase >= period) phase -= period
ret
2 instructions, no branch. Mode TC condition GEU compares a0 against a2 as unsigned (correct for unsigned phase / period), and SUB-cond conditionally subtracts.
17.3.4 Comparison and Inlined Use
Function-call overhead aside, the inlined body is the relevant comparison:
| Implementation | Body instr | Branch |
|---|---|---|
| Standard RV64GC | 3 | 1 (data-dependent) |
| FireStorm (Xcond) | 2 | 0 |
For a synth oscillator updating phase 64 times per sample block, the savings compound: 64 calls × 1 instruction + 64 branches saved. For an FM synth with multiple operators per voice and many voices, the cycle reduction is real.
17.4 Saturating Decrement (Retry Counter / Time Slice)
Decrement a counter, clamped at zero. Used for retry counters, time-slice quanta, hysteresis timers — anywhere a "tick toward zero, then stop" semantic is wanted.
17.4.1 C Source
void tick(int32_t *counter) {
if (*counter > 0) (*counter)--;
}
17.4.2 Standard RV64GC (branchless with Zbb)
tick:
lw t0, 0(a0)
snez t1, t0 ; t1 = (t0 != 0) ? 1 : 0
sub t0, t0, t1 ; t0 -= t1
sw t0, 0(a0)
ret
5 instructions (including load/store). The branchless form using snez works because t0 > 0 implies t0 != 0 for any non-negative input; for a counter that never goes negative, this is correct.
Alternative branchy form:
tick:
lw t0, 0(a0)
blez t0, .Lend
addi t0, t0, -1
sw t0, 0(a0)
.Lend:
ret
5 instructions with a branch when condition false. Either form is 5 instructions total.
17.4.3 FireStorm (Xcond with Xcrisp MMW)
tick:
LWADD.cond ... -- NOT AVAILABLE in v0.1 (predicated load-op-store deferred)
Without predicated load-op-store, the cleanest Xcond form is:
tick:
li t1, 1
lw t0, 0(a0)
SUB.cond t0, t1, GT_RD ; if (t0 > 0) t0 -= 1
sw t0, 0(a0)
ret
5 instructions. The Xcond saves nothing here — Zbb's snez is already optimal for this exact pattern.
17.4.4 Where Xcond Saves: Inline in a Hot Loop
The Xcond form wins when the operation is inlined into a loop with a pre-loaded register holding 1, and the load/store are absent:
/* Inside a loop, processing 'count' items down to zero, with side effects */
for (int item = 0; item < n; item++) {
if (timeslice > 0) timeslice--;
process(items[item], timeslice);
}
Inner-loop body for the timeslice decrement:
- Standard (Zbb):
snez t1, timeslice+sub timeslice, timeslice, t1= 2 instructions - Xcond:
SUB.cond timeslice, one_reg, GT_RD= 1 instruction
Saves 1 instruction per iteration when the constant 1 is hoisted out of the loop.
This example illustrates an important Xcond limitation: single-occurrence wins are small. The big gains come from inline use in inner loops, where the predicated form replaces a multi-instruction Zbb idiom or a branch + body pair.
17.5 Where Xcond Does Not Help
Honest reporting of patterns where Xcond was considered but does not deliver a win:
17.5.1 min / max (Tie with Zbb)
a = min(a, b) is one instruction with Zbb (min a, a, b) and one instruction with Xcond (MOV.cond a, b, LT). No improvement — use Zbb for clarity and narrow-mode portability.
17.5.2 General Select (c ? a : b)
The classic three-register select cannot be expressed in Xcond's R-type predicated form: rd doubles as the test operand and destination, leaving only rs1 as the source. Use Zicond (czero.eqz + czero.nez + or) instead — 3 instructions in narrow mode, available wherever Zicond is implemented.
17.5.3 Cross-Register Predicate
Patterns of the form "test register A, modify register B" do not fit Xcond:
if (status & READY_BIT) counter++; // test 'status', modify 'counter'
The predicate must operate on rd (and optionally rs1); there is no provision for a third register to hold the test value. Use a branch:
andi t0, status, READY_BIT
beqz t0, .Lend
addi counter, counter, 1
.Lend:
If the branch is highly predictable (one direction dominates), this is fine. If it is not, Zicond can synthesise the equivalent in 3 instructions.
17.5.4 Predicated Memory Operations
v0.1 does not predicate loads, stores, or load-op-store. A "sparse gather" pattern (load only when an index is in-range) needs a branch around the load:
if (idx < table_size) result = table[idx];
This is deferred to a possible v0.2 Xcond extension (open item §11 of ee_xcond).
18. Hardware Context Switching (Xctx)
The Xctx extension replaces software register-save/restore sequences with a single hardware instruction. The instruction-count wins are dramatic on context-switch-heavy code; the cycle-count wins are more modest because hardware context switches still take real cycles (~30 — see §2 methodology). The compelling cases are:
- Fiber-heavy code where many yields per second amortise the BSRAM-backed switch cost.
- Blocking synchronisation (producer/consumer, mutex, condition variables) where HALT/RESUME beats the spin-yield pattern by orders of magnitude.
- Interrupt-driven I/O where ISR-to-wake latency matters.
- Preemptive kernels where the hardware slice timer eliminates the entire timer-tick ISR.
The single biggest non-cycle win is simplification: kernel and fiber-runtime code shrinks substantially, with downstream wins in prefetch-buffer footprint, code maintenance, and predictability.
18.1 Cooperative Fiber Yield
The fundamental operation in any fiber or coroutine library.
18.1.1 C Source
typedef struct {
void *sp;
uint64_t saved[13]; /* ra + s0..s11 */
} fiber_t;
void yield_to(fiber_t *from, fiber_t *to);
18.1.2 Standard RV64GC (Hand-Tuned)
A best-effort hand-tuned yield_to saves and restores all callee-saved registers through the fiber's saved-state block (kept in DRAM):
yield_to:
sd sp, 0(a0)
sd ra, 8(a0)
sd s0, 16(a0)
sd s1, 24(a0)
sd s2, 32(a0)
sd s3, 40(a0)
sd s4, 48(a0)
sd s5, 56(a0)
sd s6, 64(a0)
sd s7, 72(a0)
sd s8, 80(a0)
sd s9, 88(a0)
sd s10, 96(a0)
sd s11, 104(a0)
ld sp, 0(a1)
ld ra, 8(a1)
ld s0, 16(a1)
ld s1, 24(a1)
ld s2, 32(a1)
ld s3, 40(a1)
ld s4, 48(a1)
ld s5, 56(a1)
ld s6, 64(a1)
ld s7, 72(a1)
ld s8, 80(a1)
ld s9, 88(a1)
ld s10, 96(a1)
ld s11, 104(a1)
ret
29 instructions in the function body. Caller cost: 2 cycles (jal + ret).
Cycle cost on FireStorm: the saved-state block lives in DRAM. With a small 8 KB direct-mapped D-cache, the stack frame may stay in cache between successive yields if the fiber's working-set fits and is accessed regularly — in which case the cost is ~30 cycles. If the cache line gets evicted (a likely outcome for many-fiber schedulers where each fiber's saved state competes for cache slots), the cost rises to 50–100 cycles depending on DRAM latency. Realistic average for fiber-heavy workloads: 40–70 cycles per yield plus unpredictable jitter.
(For comparison, on a richly-cached RV64GC implementation with abundant L1 capacity: a consistent ~30 cycles. FireStorm's tiny D-cache helps, but doesn't fully hide the cost.)
18.1.3 FireStorm (Xctx)
yield_to: ; no longer takes any arguments
YIELD
ret ; (returns to whatever code follows the YIELD
; in the resumed context)
Actually, yield_to(from, to) in the software-fiber API specifies which fiber to switch to. Xctx uses a queue model — YIELD pushes self onto the ready queue and pulls the next ready context, without naming a target. For fiber libraries that want to mimic the explicit-target API, the wrapper is:
yield_to: ; a0 = (unused with Xctx), a1 = target ctx ID
; (compiler arranges target to be runnable)
YIELD ; single instruction
ret
1 instruction in the body, ~30 cycles to execute (on Ant64).
18.1.4 Comparison
| Implementation | Instructions | Cycles on FireStorm | DRAM bandwidth |
|---|---|---|---|
| Software fiber (yield_to) | 29 | 40–70 typical (D-cache dependent) | 13 stores + 13 loads through cache |
| FireStorm (Xctx YIELD) | 1 | ~30 (BSRAM-bounded, deterministic) | 0 (BSRAM-resident) |
The cycle win on FireStorm is real but moderate (the tiny D-cache helps somewhat); the bigger wins are predictability and footprint:
- Cycle count: 40–70 → ~30. Roughly 1.5–2.5× speedup per yield, and the Xctx path is deterministic while the software-fiber path varies with D-cache state.
- Instruction count: 29 → 1. Every yield call site is ~120 bytes of code smaller. For a fiber-heavy program with many yield points, this is meaningful prefetch-buffer footprint savings.
- D-cache pressure: 13 lines → 0. Each context switch in the software version touches over 100 bytes of stack data, evicting useful unrelated data from the small 8 KB D-cache. Xctx state lives in BSRAM and consumes zero D-cache capacity. For a hot loop that yields frequently between many fibers, this is the dominant practical win — the D-cache is small enough that fiber-state churn quickly saturates it.
- Predictability. Software yield cost varies with D-cache state (30 cycles hot, 70+ cold); Xctx YIELD is always ~30 cycles regardless of memory traffic or D-cache state.
For a runtime doing 100,000 yields/second at 200 MHz, the instruction-count win plus the D-cache-eviction avoidance compounds easily reaches 10–15% real-world speedup in fiber-heavy workloads, mainly because the D-cache stops thrashing on fiber-state churn.
18.2 Task Creation
Spawning a new fiber or task.
18.2.1 C Source
fiber_t *spawn(void (*entry)(void *), void *arg, size_t stack_size) {
fiber_t *f = malloc(sizeof(fiber_t));
void *stack = malloc(stack_size);
fiber_setup(f, stack, stack_size, entry, arg);
add_to_ready_queue(f);
return f;
}
The interesting cost is fiber_setup plus add_to_ready_queue; malloc is the same in both worlds.
18.2.2 Standard RV64GC
Software fiber setup writes a fake saved-state block so the first yield_to into the new fiber lands at entry:
fiber_setup: ; a0 = fiber_t*, a1 = stack base,
; a2 = stack_size, a3 = entry, a4 = arg
add t0, a1, a2 ; sp_top = stack + size
addi t0, t0, -16
sd a4, 0(t0) ; push arg onto stack
sd t0, 0(a0) ; f->sp = sp_top - 16
sd a3, 8(a0) ; f->saved[0] = entry (becomes ra on first switch)
sd x0, 16(a0) ; zero s0
sd x0, 24(a0)
sd x0, 32(a0)
sd x0, 40(a0)
sd x0, 48(a0)
sd x0, 56(a0)
sd x0, 64(a0)
sd x0, 72(a0)
sd x0, 80(a0)
sd x0, 88(a0)
sd x0, 96(a0)
sd x0, 104(a0) ; zero s11
ret
add_to_ready_queue: ; a0 = fiber_t*
ld t0, ready_tail ; (assume some queue structure)
sd a0, 0(t0)
addi t0, t0, 8
sd t0, ready_tail
ret
Setup: 17 instructions. Queue add: 5 instructions. Total: 22 instructions (plus malloc, which is the same in both implementations).
18.2.3 FireStorm (Xctx)
spawn_xctx: ; a0 = entry, a1 = stack_top (with arg already pushed),
; entry ABI: a0 = arg
NEW t0, a0, a1 ; t0 = new context ID (or -1 if pool exhausted)
ret
1 instruction (NEW also automatically pushes the new context onto the ready queue, per §5.4 of ee_xctx).
Note that NEW initialises the new context with all registers zeroed except PC = rs1 and x2 (sp) = rs2. The standard convention has the entry function read arg from a0. To pass an arg, the caller pushes it onto the stack before NEW, or constructs a small trampoline.
A cleaner pattern uses the new context's argument-register zeroing as a feature: the entry function can read arg from a known stack offset:
; Caller: spawn(entry, arg)
spawn:
li t0, STACK_SIZE
add a1, a1, t0 ; sp_top = stack + size
addi a1, a1, -8
sd a4, 0(a1) ; push arg
NEW t0, a0, a1
ret
; Entry function trampoline:
entry_trampoline:
ld a0, 0(sp) ; pop arg
addi sp, sp, 8
j actual_entry ; or call
Caller side: 5 instructions instead of 22. 77% reduction in spawn overhead (excluding the malloc cost which is identical).
18.2.4 Comparison
| Implementation | Setup instructions |
|---|---|
| Standard RV64GC fiber | 22 (setup + queue add) |
| FireStorm Xctx | 5 (1 NEW + arg push) |
For programs that spawn frequently (event-handling, parallel pipelines), the spawn cost is in the hot path.
18.3 Producer-Consumer Blocking
The "spinning yield" anti-pattern vs proper HALT/RESUME synchronisation.
18.3.1 C Source
A producer-consumer pair sharing a bounded buffer:
int buf[N];
int head, tail;
void produce(int item) {
while ((head + 1) % N == tail) {
wait_for_consumer(); /* block until consumer drains */
}
buf[head] = item;
head = (head + 1) % N;
signal_consumer();
}
int consume(void) {
while (head == tail) {
wait_for_producer();
}
int item = buf[tail];
tail = (tail + 1) % N;
signal_producer();
return item;
}
The interesting cost is the implementation of wait_for_* and signal_*.
18.3.2 Standard RV64GC (Spin-Yield Pattern)
Without proper blocking primitives, wait_for_* spins by yielding:
void wait_for_consumer(void) {
yield_to(scheduler); /* costs 29 instructions per call */
}
void signal_consumer(void) {
/* no-op: consumer will check on its next yield */
}
Wait cost when blocked: 29 instructions × N (where N = number of yields before consumer drains). In a tight scenario with both fibers active, N could be 1–10. Worst case (consumer not running): unbounded polling.
18.3.3 Standard RV64GC (Proper Blocking — Kernel Mutex)
A real implementation uses kernel mutexes:
wait_for_consumer:
addi sp, sp, -16
sd ra, 8(sp)
la a0, producer_cond
la a1, buffer_mutex
call cond_wait ; goes through syscall path
ld ra, 8(sp)
addi sp, sp, 16
ret
Syscall path: ~50–100 instructions for the user-to-kernel-and-back transition, plus the kernel's scheduler code. Total wait cost: ~150–300 cycles.
18.3.4 FireStorm (Xctx HALT/RESUME)
int producer_id = -1, consumer_id = -1;
void wait_for_consumer(void) {
asm volatile("CTXID %0" : "=r"(producer_id));
asm volatile("HALT"); /* will be RESUMEd by consumer */
}
void signal_consumer(void) {
if (consumer_id >= 0) {
int c = consumer_id;
consumer_id = -1;
asm volatile("RESUME %0" :: "r"(c));
}
}
In assembly:
wait_for_consumer:
CTXID t0
la t1, producer_id
sw t0, 0(t1)
HALT
ret
signal_consumer:
la t1, consumer_id
lw t0, 0(t1)
bltz t0, .Lend
li t2, -1
sw t2, 0(t1)
RESUME t0
.Lend:
ret
wait_for_consumer: 4 instructions + HALT. The HALT context-switches once (~30 cycles); the producer then sleeps until consumer issues RESUME.
signal_consumer: 5 instructions + RESUME. The RESUME just adds the woken context to the ready queue (~3 cycles); the woken context runs at the next natural scheduling point.
18.3.5 Comparison
| Implementation | Wait cost | Signal cost | Notes |
|---|---|---|---|
| Spin-yield | 29 × N instructions | 0 | N = waits before consumer drains; can be 1 to ∞ |
| Kernel cond_wait | ~150–300 cycles | ~150–300 cycles | Each goes through full syscall |
| Xctx HALT/RESUME | ~35 cycles | ~10 cycles | No syscall, no polling |
For a producer that blocks 10 times waiting for a consumer to drain:
- Spin-yield: 290 instructions of pointless yielding.
- Kernel cond_wait: 1500–3000 cycles (one wait, but expensive).
- Xctx: 35 cycles total (one HALT, one RESUME from consumer).
The Xctx pattern is 40–100× cheaper than kernel cond_wait and avoids the unbounded spin-yield trap entirely.
18.4 Interrupt-Driven I/O Completion
The classic "task waits on I/O; ISR wakes it" pattern.
18.4.1 C Source
ssize_t blocking_read(int fd, void *buf, size_t len) {
if (has_data(fd)) {
return immediate_read(fd, buf, len);
}
register_waiter(fd, current_task_id());
wait_for_io(); /* HALT-like */
return immediate_read(fd, buf, len);
}
/* ISR for device that produces data on fd */
void device_isr(int fd) {
int waiter = lookup_waiter(fd);
if (waiter >= 0) {
clear_waiter(fd);
signal_io(waiter); /* RESUME-like */
}
}
18.4.2 Standard RV64GC ISR (Preemptive Kernel)
Realistic ISR for a preemptive kernel that may reschedule on I/O completion:
device_isr:
addi sp, sp, -128
; save all caller-saved registers (~15 stores)
sd ra, 0(sp)
sd t0, 8(sp)
sd t1, 16(sp)
; ... 12 more saves ...
sd a7, 112(sp)
; ISR body
csrr a0, mcause ; or read device register
call get_irq_source
call lookup_waiter
bltz a0, .Lnowaiter
mv s0, a0
li a0, -1
call clear_waiter
mv a0, s0
call scheduler_make_ready ; queue the waiter
call scheduler_should_preempt
beqz a0, .Lnopreempt
call scheduler_yield_to ; full context switch (29 instr)
.Lnopreempt:
.Lnowaiter:
; restore caller-saved
ld ra, 0(sp)
ld t0, 8(sp)
; ... 13 more loads ...
ld a7, 112(sp)
addi sp, sp, 128
mret
ISR overhead (entry+exit save/restore): ~30 instructions. Body: ~10–40 instructions depending on preempt path. Total: 40–70 instructions per ISR.
Cycle cost: 50–100 cycles for the no-preempt path, 100–150 cycles for the preempt path.
18.4.3 FireStorm (Xctx)
The ISR doesn't need to context-switch; it just queues the woken task:
device_isr:
addi sp, sp, -32
sd ra, 0(sp)
sd t0, 8(sp)
sd t1, 16(sp)
sd a0, 24(sp)
; ISR body
csrr a0, mcause
call get_irq_source ; a0 = fd
call lookup_waiter ; a0 = waiter ctx ID or -1
bltz a0, .Lnowaiter
mv t0, a0
li a0, -1
call clear_waiter
RESUME t0 ; wake the task (single instr, ~3 cycles)
.Lnowaiter:
ld ra, 0(sp)
ld t0, 8(sp)
ld t1, 16(sp)
ld a0, 24(sp)
addi sp, sp, 32
mret
ISR overhead (save/restore): ~10 instructions (only the registers we use). Body: ~6 instructions + RESUME. Total: ~16 instructions.
Cycle cost: ~20 cycles.
18.4.4 Comparison
| Implementation | ISR instructions | ISR cycles |
|---|---|---|
| Standard RV64GC (preemptive kernel) | 40–70 | 50–150 |
| FireStorm (Xctx RESUME) | ~16 | ~20 |
ISR latency reduced by 3–7×. For interrupt-driven workloads (network I/O, audio interrupts, sensor readings), this directly improves throughput: more time for the ISR's actual purpose, less time in scheduling boilerplate.
The Xctx ISR is also smaller (fewer caller-saved registers touched), which improves prefetch-buffer locality of the trap-handler code and reduces trap-entry overhead pressure on the pipeline.
18.5 Preemptive Timer-Tick Elimination
For preemptive multitasking, the standard approach uses a hardware timer firing periodically; an ISR decides whether to preempt. With Xctx's hardware slice timer, no ISR runs at all — the hardware switches contexts directly.
18.5.1 Standard RV64GC (Software Timer ISR)
timer_isr:
addi sp, sp, -128
; save all caller-saved (~15 stores)
sd ra, 0(sp)
; ...
sd a7, 112(sp)
csrr t0, mtime
li t1, NEXT_TICK
add t1, t0, t1
csrw mtimecmp, t1 ; rearm timer
call decrement_quantum
call scheduler_should_preempt
beqz a0, .Lnopreempt
call scheduler_choose_next
call scheduler_yield_to ; 29 instr context switch
.Lnopreempt:
; restore caller-saved
ld ra, 0(sp)
; ...
ld a7, 112(sp)
addi sp, sp, 128
mret
Per-tick cost (no preempt): ~40 instructions, ~60 cycles. Per-tick cost (preempt): ~70 instructions, ~120 cycles.
At a 1 kHz tick rate (1 ms ticks), this is ~60–120k cycles per second of pure scheduling overhead at 1 GHz — 0.006–0.012% of CPU.
At a 10 kHz tick rate, it's 10× that = 0.06–0.12% of CPU.
For aggressive sub-millisecond preemption (10–100 kHz), the timer-ISR overhead becomes meaningful.
18.5.2 FireStorm (Xctx Hardware Slice Timer)
; Kernel setup
csrw mxctxslice, t0 ; t0 = slice in cycles
; ... no ISR registration needed for preemption
That's it. No timer ISR exists for slicing purposes. Hardware decrements the slice counter; on zero, it executes the YIELD-equivalent automatically. Switch cost: 30 cycles on Ant64 (the BSRAM transfer cost), with zero software instructions.
(The kernel may still have a timer ISR for other purposes — e.g., wall-clock time advancement, watchdog kicks — but those are independent of preemption and can run at much lower frequencies.)
18.5.3 Comparison
For a 10 kHz preemption rate:
| Implementation | Per-tick instructions | Per-second instruction overhead |
|---|---|---|
| Standard RV64GC | 40–70 | 400k–700k instructions/sec |
| FireStorm (slice timer) | 0 | 0 instructions/sec (only the 30-cycle BSRAM switch cost remains) |
The Xctx kernel reclaims roughly 0.06–0.1% of CPU at this preemption rate. At 100 kHz preemption (10 µs slices for highly interactive workloads), the savings scale to 0.6–1.0% of CPU — a real gain for hard-real-time and high-fan-out workloads.
Secondary wins:
- No jitter. Software timer ISRs vary in cost depending on the path taken; the hardware slice timer is exact.
- Smaller kernel. The entire timer-tick ISR can be deleted (or kept for wall-clock at a low rate, decoupled from preemption).
- No ISR-entry/exit cost. Each saved tick avoids the trap-entry/exit overhead which is itself non-trivial on aggressive pipelines.
18.6 Verification Summary
| Example | Pattern | Standard | FireStorm Xctx | Win |
|---|---|---|---|---|
| §18.1 Cooperative yield | yield_to call body | 29 instructions (40–70 cycles) | 1 instruction (~30 cycles) | 1.5–2.5× faster + deterministic |
| §18.2 Task spawn | Fiber setup + enqueue | 22 instructions | 5 instructions | Faster spawn |
| §18.3 Producer/consumer block | Wait via kernel cond_wait | ~150 cycles per wait | ~35 cycles per HALT+RESUME | 4×–100× faster sync |
| §18.4 I/O completion ISR | Wake task on interrupt | 40–70 instructions | ~16 instructions | 3×–7× lower ISR latency |
| §18.5 Preemptive timer-tick | 10 kHz ISR overhead | 400k–700k instr/sec | 0 software, ~30-cycle switch only | 0.06–1.0% CPU reclaimed |
The Xctx wins are concentrated in synchronisation-heavy code (§18.3) and kernel-internal infrastructure (§18.4, §18.5). For pure computation, Xctx neither helps nor hurts.
19. Xmath: 8-Tap FIR Filter with MADD
This complements §15 (which showed wide-mode register-pressure benefit on a FIR filter); here we focus on the arithmetic acceleration from Xmath's integer fused multiply-add. The example uses Q15 fixed-point coefficients and 16-bit input samples, producing a 32-bit output.
19.1 C Source
int32_t fir8_q15(const int16_t *samples, const int16_t *coeffs) {
int32_t acc = 0;
for (int i = 0; i < 8; i++) {
acc += (int32_t)samples[i] * (int32_t)coeffs[i];
}
return acc;
}
19.2 Standard RV64GC
The inner loop body per tap:
.Lloop:
lh t0, 0(a0) ; load sample (sign-extend)
lh t1, 0(a1) ; load coefficient
mul t2, t0, t1 ; multiply (3 cycles on DSP-backed unit)
add a2, a2, t2 ; accumulate
addi a0, a0, 2 ; advance sample pointer
addi a1, a1, 2 ; advance coefficient pointer
addi a3, a3, -1
bnez a3, .Lloop
Per tap: 8 instructions, ~10 cycles with MUL latency. Eight taps unrolled: ~80 cycles plus loop overhead. Fully unrolled and pipelined by the compiler: ~50 cycles.
19.3 FireStorm with Xmath MADD + Xcrisp Auto-Inc
Combined with Xcrisp's auto-incrementing loads, the inner loop body becomes:
.Lloop:
LHPI t0, 2(a0) ; load sample, post-increment (Xcrisp)
LHPI t1, 2(a1) ; load coefficient, post-increment (Xcrisp)
MADD a2, t0, t1, a2 ; acc = acc + sample * coef (Xmath, 1 instruction, 2-3 cycle latency)
addi a3, a3, -1
bnez a3, .Lloop
Per tap: 5 instructions, ~5 cycles with MADD's pipelined throughput (one MADD result per cycle once started). Eight taps: ~15 cycles + setup. Fully unrolled by compiler with the help of dual-issue: ~12 cycles.
19.4 Comparison
| Implementation | Instructions per tap | Cycles for 8-tap filter |
|---|---|---|
| Standard RV64GC | 8 | ~50 |
| FireStorm Xmath + Xcrisp | 5 | ~12 |
~4× speedup on FIR throughput. Per-sample cost for an 8-tap filter at 48 kHz: ~30K cycles/sec at standard speed → ~7K cycles/sec with Xmath. The savings scale linearly with filter length — a 64-tap filter (audio room reverb component) sees the same 4× ratio but on much larger absolute work.
For 128 voices each with an 8-tap filter, the savings are roughly 6% of total CPU at 380 MHz. For room-reverb (4 channels × 64-tap), savings approach 15% of CPU.
20. Xmath: 3D Vector Normalisation with VNORM3
Normalising a 3D vector is a per-pixel or per-vertex operation in lighting and rendering setup, performed many thousands of times per frame in any non-trivial 3D game.
20.1 C Source
typedef struct { float x, y, z; } vec3;
vec3 normalize(vec3 v) {
float len_sq = v.x*v.x + v.y*v.y + v.z*v.z;
float inv_len = 1.0f / sqrtf(len_sq);
return (vec3){ v.x * inv_len, v.y * inv_len, v.z * inv_len };
}
20.2 Standard RV64GC (with F/D)
normalize:
fmul.s ft0, fa0, fa0 ; x*x
fmadd.s ft0, fa1, fa1, ft0 ; + y*y
fmadd.s ft0, fa2, fa2, ft0 ; + z*z = len_sq
fsqrt.s ft1, ft0 ; sqrt (typically 20+ cycles on FPU)
fdiv.s ft2, f1, ft1 ; 1/sqrt (typically 15+ cycles)
fmul.s fa0, fa0, ft2 ; x normalised
fmul.s fa1, fa1, ft2
fmul.s fa2, fa2, ft2
ret
Cycle count: 3 (FMA chain) + 20 (sqrt) + 15 (div) + 3 (multiplies, pipelined) = ~41 cycles.
20.3 FireStorm with Xmath VNORM3
normalize: ; vec3 in f0, f1, f2 (a 3-tuple at f0)
VNORM3.S f0, f0 ; one instruction, 8 cycles
ret
VNORM3 internally:
- Computes
v.x² + v.y² + v.z²(3 internal FMAs, ~3 cycles) - Computes
1/sqrt(len_sq)via FRSQRT (~3 cycles) - Scales each component (3 parallel multiplies, ~2 cycles with multi-port writeback)
Total: 8 cycles including the internal forwarding.
20.4 Comparison
| Implementation | Cycles | Instructions |
|---|---|---|
| Standard RV64GD | ~41 | 8 (plus internal pipeline of sqrt/div) |
| FireStorm Xmath VNORM3 | 8 | 1 |
~5× speedup on per-vertex / per-pixel normalisation. For a Phong-shaded renderer doing per-pixel lighting at 1080p60 with normal interpolation, this is the difference between using ~20% of CPU on normalisations alone and ~4%.
The win comes from two sources: (1) VNORM3 is one instruction rather than 8, saving decode bandwidth and register pressure; (2) FRSQRT is a 3-cycle approximation vs. the 35+ cycles needed for separate FSQRT + FDIV at IEEE-754-correct precision. For visual rendering, the 0.1% accuracy of FRSQRT is far better than the precision of the colour pipeline; the loss is invisible.
21. Xmath: Ray-Triangle Intersection (Möller-Trumbore)
The ray-triangle test is the fundamental primitive for picking, raycasting bullets, line-of-sight checks, and BSP traversal in 3D games. The Möller-Trumbore algorithm is the standard branchless implementation.
21.1 C Source
bool ray_triangle(vec3 ro, vec3 rd, vec3 v0, vec3 v1, vec3 v2, float *out_t) {
vec3 edge1 = sub(v1, v0);
vec3 edge2 = sub(v2, v0);
vec3 h = cross(rd, edge2);
float a = dot(edge1, h);
if (fabsf(a) < EPSILON) return false;
float f = 1.0f / a;
vec3 s = sub(ro, v0);
float u = f * dot(s, h);
if (u < 0 || u > 1) return false;
vec3 q = cross(s, edge1);
float v = f * dot(rd, q);
if (v < 0 || u + v > 1) return false;
float t = f * dot(edge2, q);
if (t > EPSILON) { *out_t = t; return true; }
return false;
}
21.2 Standard RV64GD
Each operation maps to multiple instructions:
| Operation | Instructions | Cycles |
|---|---|---|
sub(v1, v0) |
3× FSUB | 3 |
sub(v2, v0) |
3× FSUB | 3 |
cross(rd, edge2) |
6× FMUL + 3× FSUB | ~9 |
dot(edge1, h) |
3× FMUL + 2× FADD (or 2× FMA + 1 FMUL) | ~3 |
1 / a |
FDIV | ~15 |
sub(ro, v0) |
3× FSUB | 3 |
dot(s, h) |
~3 | 3 |
f * dot() |
FMUL | 1 |
cross(s, edge1) |
~9 | 9 |
dot(rd, q) |
~3 | 3 |
f * dot() |
FMUL | 1 |
dot(edge2, q) |
~3 | 3 |
f * dot() |
FMUL | 1 |
Plus EPSILON compares (~6 cycles).
Total: ~62 cycles per ray-triangle test (well-pipelined; out-of-order completion via scoreboarding hides some latency).
21.3 FireStorm with Xmath
ray_triangle:
VSUB3.S edge1, v1, v0 ; 3 cycles
VSUB3.S edge2, v2, v0 ; 3 cycles
CROSS3.S h, rd, edge2 ; 10 cycles
DOT3.S a, edge1, h ; 6 cycles
; if (|a| < EPS) return false
FRECIP.S f, a ; 3 cycles
VSUB3.S s, ro, v0 ; 3 cycles
DOT3.S ft0, s, h ; 6 cycles
fmul.s u, f, ft0 ; 1 cycle
; clamp tests
CROSS3.S q, s, edge1 ; 10 cycles
DOT3.S ft0, rd, q ; 6 cycles
fmul.s v, f, ft0 ; 1 cycle
; clamp tests
DOT3.S ft0, edge2, q ; 6 cycles
fmul.s t, f, ft0 ; 1 cycle
; final EPSILON test
ret
Cycle accounting (with scoreboarding pipelining the FMA chain through the FP unit): the dependency-chain depth determines the critical path. Through forwarding and scoreboarding, the cross-products and dot-products overlap. ~45 cycles for the full test on Ant64.
21.4 Comparison
| Implementation | Instructions | Cycles |
|---|---|---|
| Standard RV64GD | ~28 (excluding compares) | ~62 |
| FireStorm Xmath | ~14 | ~45 |
~1.4× speedup, ~50% fewer instructions. For a game raycasting against 500 triangles per query (line-of-sight from one entity to another, for example), this saves ~9000 cycles per query = ~24 microseconds at 380 MHz. For 60 raycast queries per frame at 60 fps: ~1.4 ms saved, or ~9% of frame budget for that workload.
The bigger win is instruction count: half the I-cache pressure, half the decode bandwidth. In a tight raycasting inner loop where the prefetch buffer is competing with the rest of the renderer, the smaller code footprint helps disproportionately.
22. Xmath: A* Pathfinding with OCTILE2 Heuristic
A* pathfinding on a 2D grid is the workhorse algorithm for tile-based games, RTS unit movement, and dungeon-crawler enemy AI. The dominant per-node cost is heuristic computation.
22.1 C Source
// A* inner loop: expand current node, push neighbours with estimated cost
typedef struct { int32_t x, y; } point;
int32_t octile_dist(point a, point b) {
int32_t dx = abs(a.x - b.x);
int32_t dy = abs(a.y - b.y);
int32_t d_max = (dx > dy) ? dx : dy;
int32_t d_min = (dx > dy) ? dy : dx;
return d_max * 1024 + d_min * 424; // approximating (√2 - 1) * 1024
}
22.2 Standard RV64GC
octile_dist:
sub t0, a0, a2 ; dx = a.x - b.x
sra t1, t0, 31 ; sign mask
xor t0, t0, t1
sub t0, t0, t1 ; abs(dx) (3 instructions for ABS)
sub t2, a1, a3 ; dy = a.y - b.y
sra t3, t2, 31
xor t2, t2, t3
sub t2, t2, t3 ; abs(dy)
; max and min via compare-and-branch or branchless
blt t0, t2, .Lswap ; if dx < dy, swap
mv t4, t0 ; d_max = dx
mv t5, t2 ; d_min = dy
j .Lcompute
.Lswap:
mv t4, t2
mv t5, t0
.Lcompute:
li t6, 1024
mul t4, t4, t6 ; d_max * 1024
li t6, 424
mul t5, t5, t6 ; d_min * 424
add a0, t4, t5
ret
Per call: ~16 instructions, ~12 cycles (with the conditional branch potentially mispredicting).
In A*'s inner loop with 8 neighbours per expansion and ~3000 nodes per query:
3000 × 8 × 12 = 288,000 cycles per pathfind = ~0.76 ms at 380 MHz
22.3 FireStorm with OCTILE2
octile_dist:
; Inputs: a0 = a (packed x:32, y:32), a1 = b (packed)
OCTILE2 a0, a0, a1 ; 3 cycles, full computation
ret
Per call: 1 instruction, 3 cycles. In A*'s inner loop:
3000 × 8 × 3 = 72,000 cycles per pathfind = ~0.19 ms at 380 MHz
22.4 Comparison
| Implementation | Instructions per heuristic | Cycles per heuristic | Cycles per pathfind (3000 nodes × 8 neighbours) |
|---|---|---|---|
| Standard RV64GC | 16 | 12 | 288,000 |
| FireStorm Xmath | 1 | 3 | 72,000 |
*~4× speedup on A throughput.** For a strategy game with 50 AI units pathfinding once per second, this saves ~11 ms per second = ~1% of CPU. For a tower-defence with many concurrent paths, the savings scale linearly.
The cleaner story is the inner loop becomes branchless and predictable: one instruction, fixed 3-cycle latency, no branch-predictor pollution from the max/min compare. This is significant for code that runs inside scoreboarded multi-cycle operations — A*'s heuristic computation no longer competes for branch-predictor entries with the rest of the game logic.
23. Xmath: BAM-Based Sprite Rotation
A 2D sprite rotating once per frame at constant angular velocity — the classic top-down game enemy turning to face the player, or a propeller, or an animated icon.
23.1 C Source
typedef struct { float sin_a, cos_a; } rotation;
rotation update_rotation_fp(rotation r, float dt, float turn_rate) {
static float angle = 0.0f; // persistent
angle += turn_rate * dt;
if (angle >= 2.0f * 3.14159f) angle -= 2.0f * 3.14159f;
r.sin_a = sinf(angle);
r.cos_a = cosf(angle);
return r;
}
// BAM-based variant
typedef struct { uint32_t angle_bam; } rotation_bam;
rotation update_rotation_bam(rotation_bam *r, uint32_t turn_rate_bam) {
r->angle_bam += turn_rate_bam; // perfect modular wraparound
// call site uses FSINCOSBAM to get sin/cos for the matrix
}
23.2 Standard RV64GC + FP Library
update_rotation_fp:
flw ft0, angle ; load angle
fmul.s ft1, fa1, fa0 ; turn_rate * dt
fadd.s ft0, ft0, ft1 ; angle += turn_rate * dt
li a0, 0x40C90FDB ; 2π in hex
fmv.w.x ft2, a0
flt.s t0, ft0, ft2 ; angle < 2π ?
bnez t0, .Lnomod
fsub.s ft0, ft0, ft2 ; angle -= 2π
.Lnomod:
fsw ft0, angle
fmv.s fa0, ft0 ; argument for sin
call sinf ; ~50-100 cycles (library)
fsw fa0, sin_a
flw fa0, angle
call cosf ; ~50-100 cycles
fsw fa0, cos_a
ret
Per frame: ~12 instructions of inline code + 2 library calls (~150 cycles total).
23.3 FireStorm with BAM + FSINCOSBAM
update_rotation_bam:
; r->angle_bam is at 0(a0); turn_rate_bam in a1
lw t0, 0(a0) ; load angle_bam
addw t0, t0, a1 ; angle += turn_rate (automatic wraparound!)
sw t0, 0(a0) ; store
; Now compute sin/cos for rendering:
FSINCOSBAM.S fa0, t0 ; fa0 = sin, fa1 = cos (3 cycles)
ret
Per frame: 4 instructions, ~7 cycles total.
23.4 Comparison
| Implementation | Instructions | Cycles |
|---|---|---|
| FP-radians + library sin/cos | ~12 + 2 calls | ~150 |
| BAM + FSINCOSBAM | 4 | ~7 |
~20× speedup on rotation update. For a game with many rotating elements — sprite enemies, particle systems, projectile orientation, UI animation, oscillator phase in a wavetable synth — the cumulative savings are enormous.
Beyond the cycle count, two qualitative wins:
-
Perfect angular precision over time. FP angle accumulation drifts: each frame's
angle += turn_rate * dtadds a small floating-point error, and over thousands of frames the rotation phase drifts measurably. BAM accumulation is bit-exact forever —angle_bam += turn_rate_bamwraps with perfect modular semantics, no precision loss possible. -
Branchless wraparound. The FP version needs a compare-and-conditional-subtract; the BAM version is just an integer add that naturally wraps in 2's complement. No branch-predictor entry, deterministic timing.
For retro-style code (rotozoomers, wavetable oscillators, particle effects), BAM is the natural primitive. FireStorm is one of the only modern architectures with hardware BAM trig.
24. Xmath: Quaternion Bone Update
Per-character skeletal animation: walking the bone hierarchy and computing each bone's world-space orientation by composing parent and local rotations as quaternions.
24.1 C Source
typedef struct { float w, x, y, z; } quat;
// Multiply two quaternions
quat quat_mul(quat a, quat b) {
quat r;
r.w = a.w*b.w - a.x*b.x - a.y*b.y - a.z*b.z;
r.x = a.w*b.x + a.x*b.w + a.y*b.z - a.z*b.y;
r.y = a.w*b.y - a.x*b.z + a.y*b.w + a.z*b.x;
r.z = a.w*b.z + a.x*b.y - a.y*b.x + a.z*b.w;
return r;
}
// Per-bone update
void update_bone(bone *b) {
b->world_quat = quat_mul(b->parent->world_quat, b->local_quat);
}
24.2 Standard RV64GD
quat_mul compiles to 16 FMUL + 12 FADD/FSUB = 28 FP operations. With FMA, half the FMULs become FMAs but the total count and dependency depth stay similar. Best case with scoreboarding:
~28 instructions, ~25 cycles per quaternion multiplication.
24.3 FireStorm with QMUL.S
update_bone:
; a0 = bone pointer; parent's world_quat at offset 16, local_quat at offset 32, world_quat output at offset 48
addi t0, a0, 16 ; t0 → parent world_quat
addi t1, a0, 32 ; t1 → local_quat
flw.4 f0, 0(t0) ; load 4-tuple (4 FP loads — wide LD4 if available, else 4 separate)
flw.4 f4, 0(t1)
QMUL.S f8, f0, f4 ; world = parent_world × local (8 cycles)
fsw.4 f8, 48(a0) ; store 4-tuple
ret
~12 instructions, ~14 cycles per bone (mostly memory load/store; QMUL itself is 8 cycles).
24.4 Comparison
| Implementation | Instructions per bone | Cycles per bone |
|---|---|---|
| Standard RV64GD | ~36 (load + math + store) | ~32 |
| FireStorm Xmath QMUL | ~12 | ~14 |
~2.3× speedup on skeletal animation. For a character with 50 bones at 60 fps:
| Implementation | Cycles per character | % CPU at 380 MHz |
|---|---|---|
| Standard | 1,600 | 0.025% |
| Xmath | 700 | 0.011% |
For 20 characters on screen, total skeletal-animation CPU goes from ~0.5% to ~0.22%. Small absolutely, but at higher character counts (action games with 100+ characters, large-scale tactical games) the savings scale linearly and become significant.
QMUL's bigger contribution is clarity and density: animation code becomes shorter and more readable, leaving the compiler more flexibility for register allocation and the prefetch buffer less pressure from animation-code footprint.
25. Summary of Wins
The following table summarises the instruction-count savings across the examples in this document:
| Example | Pattern | Standard | FireStorm | Reduction |
|---|---|---|---|---|
| §3 Buffer sum | Inner loop | 4 | 3 | 25% |
| §4 Mask in-place | Inner loop | 5 | 3 | 40% |
| §5 Bias transform | Inner loop | 6 | 4 | 33% |
| §6 Strlen | Inner loop | 3 | 2 | 33% |
| §7 Table search | Inner loop | 4 | 3 | 25% |
| §8 Memcpy | Total | ~6/byte | 1 | dramatic |
| §9 Function frame | Overhead | 13 | 2 | 85% |
| §10 Trap handler | Overhead | 34 | 3 | 91% |
| §11 Coroutine yield | Total | ~28 | 5 | 82% |
| §12 Global address | Pattern | 2 | 1 | 50% |
| §13 Indirect call | Total | 8 | 4 | 50% |
| §14 Virtual dispatch | Total | 8 | 4 | 50% |
| §15 FIR-8 | Body | ~40 | ~32 | ~20% |
| §16.3 Synth voice (one sample) | Total | 26 | 25 | ~4% |
| §16.5 Synth block (64 samples) | Total | 1116 | 1035 | ~7% |
| §16.6 Buffer mix | Inner loop | 8 | 5 | 37% |
| §17.1 Sum positive | Inner loop | 4–5 + mispredict | 3 | ~2× speedup |
| §17.2 L1 norm | Inner loop | 6 | 4 | 33% |
| §17.3 Phase wrap | Body | 3 + branch | 2 | 33% + branch eliminated |
| §17.4 Saturating dec | Inline body | 2 | 1 | 50% (small) |
| §18.1 Fiber yield | Body | 29 | 1 | 97% |
| §18.2 Task spawn | Setup + enqueue | 22 | 5 | 77% |
| §18.3 Producer block | Per wait+wake | ~150 cycles | ~45 cycles | ~3× faster |
| §18.4 I/O ISR | Total ISR | 40–70 | ~16 | 60–77% |
| §18.5 Timer-tick ISR | Per tick | 40–70 | 0 | 100% (ISR eliminated) |
| §19 FIR-8 with MADD | 8-tap filter | ~50 cycles | ~12 cycles | ~4× speedup |
| §20 Vector normalise | Full normalise | ~41 cycles | ~8 cycles | ~5× speedup |
| §21 Ray-triangle test | Full test | ~62 cycles, ~28 instr | ~45 cycles, ~14 instr | 1.4× speed, 50% fewer instr |
| §22 A* heuristic | OCTILE2 distance | 16 instr, ~12 cycles | 1 instr, 3 cycles | ~4× speedup |
| §23 Sprite rotation | Per-frame update | ~150 cycles (FP + lib) | ~7 cycles (BAM) | ~20× speedup |
| §24 Quaternion bone | Per-bone update | ~32 cycles | ~14 cycles | ~2.3× speedup |
The patterns showing the largest wins are: control-flow and frame management (§9, §10, §11) targeted by Xstack; PIC-heavy dispatch (§13, §14) targeted by Xcrisp PIC; branchy inner loops (§17.1) targeted by Xcond; synchronisation/scheduling (§18) targeted by Xctx; and arithmetic-heavy game/audio inner loops (§19–§24) targeted by Xmath. The patterns showing the smallest wins (§16.3, §16.5) are dense arithmetic kernels where the standard instruction mix is multiply/shift/add — even those see meaningful improvement now that Xmath provides MADD and saturating arithmetic. The patterns in between (§3–§7, §16.6, §17.2, §17.3) cover loop kernels and buffer/predicated operations where Xcrisp and Xcond deliver consistent 25–50% reductions plus branch elimination.
Honest takeaway: FireStorm extensions deliver large wins across the full workload spectrum:
- Control-flow / frame management → Xstack (~50% on prologue/epilogue)
- PIC-heavy dispatch → Xcrisp PIC (~50% on indirect calls)
- Memory-loop code → Xcrisp (25–50% on buffer kernels)
- Branchy predicated loops → Xcond (2× on mispredict-prone loops)
- Scheduling / synchronisation → Xctx (60–100% on ISR / yield paths)
- Game / audio / DSP math → Xmath (4–20× on transcendentals, FMA chains, BAM rotation, A* heuristics, quaternion math)
In aggregate, on typical modular system code with a realistic mix of arithmetic, control flow, memory access, concurrency, and game-state math, the combined extensions deliver roughly 40–60% reduction in instruction count and a comparable or greater reduction in cycle count once branch-mispredict savings, ISR elimination, and Xmath's transcendental speedups are factored in. Game inner loops — physics integration, AI pathfinding, animation, audio synthesis — see proportionally larger wins thanks to Xmath's fused operations and approximations.
26. What These Examples Don't Show
This document focuses on instruction-count and cycle-count wins of individual code patterns. There are several systemic wins that are harder to quantify but real:
- Prefetch buffer pressure: smaller code means more of the working set fits in the buffer pool (§4 of ee_cpu), fewer fetch stalls in larger programs.
- D-cache pressure for Xstack and Xctx: hardware stacks and context state live in dedicated BSRAM, not DRAM, so push/pop and context-switch traffic doesn't pass through the small 8 KB D-cache. This is critical because the D-cache is tiny — every avoided cache-line eviction matters disproportionately compared to a cache-rich architecture.
- Scratchpad placement: hot data structures placed in scratchpad BSRAM (§5.3 of ee_cpu) — voice state, filter coefficients, sample LUTs, scheduler queues — get single-cycle access with no cache pressure. The compiler's
.bsramsection attribute makes this idiomatic. - Branch predictor pressure for Xcond: every branch eliminated by predication is one fewer entry competing for predictor table capacity. In branch-heavy code (interpreters, parsers, OS dispatch), this compounds across the working set.
- Translator state amortisation for Xlate: a translator configured once outside a hot loop benefits every iteration. The setup cost is paid once; the per-access cost is zero.
- Predictable preemption for Xctx: the hardware slice timer gives exact preemption boundaries — useful for hard real-time workloads where software-timer-jitter is unacceptable.
- Multi-core load balancing (future): when Xctx v0.2 specifies multi-core handoff, the ready queue is shared and load balancing is automatic — no software dispatcher needed.
- Power: fewer cycles spent in front-end fetch and decode means proportionally less switching activity.
- Compiler flexibility: with the extended register file the compiler has more options, often finding better schedules than what hand-optimised assembly would have chosen.
- Trap latency variability: Xstack's bounded prologue cost and Xctx's hardware slice timer both make interrupt and preemption latency more predictable.
These wins do not appear in any single example but compound across a system's lifetime.
End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension, FireStorm Xlate Extension, FireStorm Xctx Extension, FireStorm Xmath Extension.