FireStorm: An Engineering Overview
FireStorm is the CPU at the heart of the Ant64 platform — a 64-bit RISC-V core with a stack of custom extensions targeting the specific workloads Ant64 cares about: real-time audio synthesis, retro emulation, graphics, modular system software, and creative tools. It is implemented in a GoWin GW5AST FPGA and runs the standard RV64GC instruction set as its baseline, augmented by seven extensions that together address most of the historical pain points in classic RISC and CISC architectures.
This page explains what makes FireStorm different and how it compares to the architectures it is most often measured against: ARM (AArch64), Motorola 68000-family, MIPS R3000/R4000, and vanilla RISC-V.
Design Philosophy
FireStorm follows what we call the retro-modern approach: take a clean modern RISC base, then layer on the addressing modes and primitives that classic architectures got right, while leaving behind the things that hurt them at scale. The result has the orthogonality and pipeline-friendliness of RISC, the addressing-mode richness of CISC, and a handful of features inspired by architectures we've worked with over the decades but that nobody else has put together in one design.
Three principles drive every decision:
-
The 80% case is one instruction. If a code pattern dominates a real workload, it should compile to one instruction, not a sequence. Auto-incrementing loads, in-place memory updates, conditional accumulation, hardware push/pop — all are single instructions in FireStorm because they're things that get written millions of times in real code.
-
No hidden cost. Every instruction takes a predictable number of cycles. There are no microcoded surprises, no exception-driven slow paths for common operations, no "this works on paper but trap-and-emulate in practice." If FireStorm says PUSH ra+s0..s7 in one cycle, that's what it does.
-
Make the compiler's job easy. With 64 general-purpose registers in wide mode, hardware-managed stacks, and fused memory operations, the compiler has more options and less plumbing to emit. The same C code typically produces 25–40% fewer instructions than on standard RV64GC.
Memory Architecture: Harvard + Scratchpad + Tiny Cache
FireStorm's memory subsystem is unconventional and worth explaining. There are four data memory regions, each tuned for different workloads:
- Instruction fetch uses a pool of small BSRAM-backed prefetch buffers (8), each holding a contiguous range of recently-fetched code. There is no traditional I-cache. Each buffer has its own BSRAM port, so concurrent fetch and background refill from different buffers happen in parallel.
- Wide-mode SRAM holds code only — a Harvard restriction. Data loads/stores to the 36-bit SRAM range trap (except for M-mode code-deposit paths for JIT and loader). This simplifies the memory subsystem and prevents accidental code corruption.
- Scratchpad BSRAM (32 KB) is a directly-addressable fast region at a known address. Software places hot data structures there at compile time: audio voice state, filter coefficients, sample LUTs, scheduler tables. Single-cycle access, wide port.
- D-cache is a small 8 KB direct-mapped write-through cache covering DDR3 data accesses. It catches the patterns the scratchpad can't predict — pointer chasing through tree nodes, hash-table buckets, library data, dynamically-allocated objects.
This split is deliberate. A traditional cache hierarchy hides DRAM latency at the cost of tag RAM, replacement state, coherence logic, complex state machines around traps and DMA, and unpredictable timing. FireStorm's split does most of the work with simpler structures:
- Most hot data is in BSRAM (Xstack frames, Xctx contexts, scratchpad-resident application data). No cache needed; single-cycle by construction.
- Bulk DRAM data uses a tiny cache. Write-through direct-mapped is nearly trivial to verify and implement.
- Code fetch uses prefetch buffers. Predictable, pinnable for real-time guarantees, deterministic miss latency.
The benefits compound:
- Smaller silicon than a comparable cache hierarchy. Eight 2 KB prefetch buffers plus 8 KB direct-mapped D-cache plus 8–32 KB scratchpad — total roughly 50–80 KB of structured BSRAM, versus the ~100+ KB of cache plus tag-and-control RAM a comparable cached CPU would need.
- Predictable timing. The audio inner loop, with voice state in scratchpad and the loop body pinned in a prefetch buffer, has exactly known cycle counts. No cache replacement can introduce jitter.
- Per-buffer pinning gives deterministic real-time guarantees on top. Pinning a buffer to the trap vector means ISR fetch latency is exactly the pipeline drain.
- Simple coherence. DMA writes auto-invalidate any overlapping prefetch buffer and matching D-cache line; no software flush dance.
- Cache-bypass on demand. Every cached address has an uncached alias at
addr | (1ULL << 63). Streaming code (audio buffers, framebuffers, DMA rings) reads/writes through the uncached alias to avoid evicting useful state from the tiny 8 KB D-cache — a single bit in the pointer, no new instructions, no CSR changes.
The cost: workloads with poor locality across a working set larger than the D-cache run slower than they would on a richly-cached architecture. FireStorm is not the right choice for general-purpose computing where this pattern dominates; it is the right choice for the workloads Ant64 targets.
The Seven Extensions
FireStorm's architectural extensions stack on top of RV64GC. They are named with the X-prefix convention common in RISC-V custom extensions.
Xwide — 64 Registers and Wider Immediates in Wide Mode
When the CPU fetches code from 36-bit SRAM (rather than 32-bit DDR3), the extra 4 bits per instruction word do triple duty:
- 64-register access. Most instruction formats use 1–3 of the extra bits to extend register fields, giving access to 64 general-purpose registers and 64 floating-point registers instead of the standard 32 of each. The extra registers are all caller-saved, so they don't affect calling conventions — pure scratch space for code that needs the elbow room. DSP kernels, FIR filters, FFTs, polyphony synthesis, and compiler-intensive optimisation passes benefit.
- Wider immediates. The bits not consumed by register extension widen the immediate fields: LUI/AUIPC/JAL grow from 20-bit to 23-bit immediates (8× larger), and ADDI/loads/stores/branches grow from 12-bit to 14-bit immediates (4× larger). The compiler uses these automatically; many "constant just out of range" cases drop from 2 instructions to 1.
- Per-instruction predicates. R-type instructions reserve one bit as a predication enable, gating the Xcond predicated-execution extension (see below).
For 64-bit constants, two dedicated instructions in the wide-mode escape space (LIZ / LIK, modelled on ARM-A64's MOVZ/MOVK) build arbitrary 64-bit values from 16-bit chunks in 1–4 instructions, versus the 6–8 instructions standard RV64 needs.
The mode is determined by where the code lives: SRAM-resident hot paths get the wide register file and wider immediates; DDR3-resident bulk code uses standard RV64GC. Code compiled with +xfirestorm selects automatically per function.
Xcrisp — Memory Primitives
A collection of single-instruction memory operations that take 3–7 instructions on standard RV64:
- Auto-increment loads and stores:
LWPI rd, off(rs1)+is "load word, increment pointer" in one instruction (Z80 enthusiasts will recognise this asLDI; 68k veterans asmove.l (a0)+,d0). - Indexed addressing:
LWX rd, (base, idx, scale)for scaled-indexed loads up to ×128 stride — covers 2D matrix access, hash probes, struct-array access. - Memory-fused arithmetic:
LWADD rd, (rs1), rs2is load-then-add;MMWADD [rd], [rs1], rs2is "read memory, add, write back to a different memory location" — single-instruction in-place vector accumulation. - Block memory: BMCPY and BMSET for synchronous memcpy/memset; DMACPY and DMASET for asynchronous DMA that overlaps with CPU work, with hardware register-tagging so reading the byte-count register shows live progress and writing it stalls until the DMA finishes.
- Compare-mem-branch:
BEQM rs1, off(rs2), labeldoes "load and branch if equal" in one instruction — eliminates the load-then-compare pair for tight scan loops. - B-tree primitives:
BSRCH.W rd, key, nodefinds the first key ≥ target in a 16-key sorted array (one cache line) in 4 cycles, branchless — ~12× faster than software search and the foundation for database/index workloads. Variants for 64×8-bit, 32×16-bit, and 8×64-bit keys. Companion BSCAN (equality) and BSHIFT (insert/delete slot shift) round out the family. Lookups on a 5-level B-tree drop from ~600 cycles to ~60. - Position-independent code primitives:
LAPC(load address PC-relative),JALPC(long-range call),CALLM(vtable dispatch),JMPXPC(PC-relative indexed jump for switch tables) — every standardauipc+something pair collapses to one instruction.
Xstack — Hardware Stacks in BSRAM
Three hardware-managed stacks (user / supervisor / machine), each backed by dedicated FPGA BSRAM with wide ports. A function prologue that saves ra plus s0–s7 is one instruction:
PUSH rlist=01000, spimm=4 ; save ra, s0–s7, allocate 64 bytes of frame
The corresponding epilogue is also one instruction:
POPRET rlist=01000, spimm=4 ; restore ra, s0–s7, free frame, return
Eight registers transferred in 2 cycles on Ant64's 288-bit BSRAM port; the standard equivalent is ~18 instructions. Interrupt handlers see the same win on entry/exit, and a context switch (full callee-saved spill + USWAP + restore) is roughly 6–10 cycles instead of 30+.
The hardware stacks coexist with the standard x2/DDR3 stack; the compiler picks per function based on size and addressing patterns.
Xcond — Conditional Execution
Every R-type ALU instruction in wide mode can be predicated. When the predicate-enable bit in the wide-fetch extension nibble is set, the instruction executes only when a 6-bit condition holds (8 test modes × 8 conditions). Use cases:
ADD.cond sum, t0, GT_RS1 ; if (t0 > 0) sum += t0
SUB.cond counter, limit, GE ; if (counter >= limit) counter -= limit
RSUB.cond x, x0, LT_RD ; if (x < 0) x = 0 - x (abs in one instruction)
For data-dependent inner-loop conditionals on random inputs, this eliminates the branch mispredict entirely — typically a 1.5×–2× speedup on the affected loop. RSUB-cond gives single-instruction abs, beating even Zbb's 2-instruction neg/max sequence.
Xlate — Memory Translators
Each register has a software-configurable read translator (applied on load) and write translator (applied on store), drawn from a bank of 12 fixed transformations: identity, nibble swap, bit reverse, byteswap-16/32/64, halfword/word swap, bit-reverse-16/32/64.
No new instructions: once configured, every existing load and store transparently passes data through the translator. Endian conversion for network protocols, bit reversal for SPI peripherals, mixed-endian DSP file formats, BCD digit reorder — all become zero-overhead per access:
__xlate_rd(t0, BSWAP32); // configure once
for (i = 0; i < n; i++) {
lw t0, 0(buf++); // value loaded byte-swapped, no extra instruction
process(t0);
}
The most distinctive feature is the involutory property: setting matching read and write translators on a register gives a "private host-order view" of foreign-format memory, where the program operates on host-order values throughout and the translation happens invisibly at every load and store.
Xctx — Hardware Context Switching
A pool of hardware-resident execution contexts (32), each holding a full register file + key per-task state in dedicated BSRAM. A context switch is one instruction:
YIELD ; save current context, load next ready context (~30 cycles on Ant64)
HALT ; suspend until externally resumed
NEW rd, rs1, rs2 ; spawn new context at entry PC rs1, stack rs2
RESUME rs1 ; wake a halted context
A cooperative fiber yield is one instruction. Producer/consumer with proper blocking is HALT/RESUME (no kernel call, no syscall path). A preemptive scheduler is the slice-timer CSR plus no software ISR at all — the hardware switches contexts when the slice expires. Multi-core handoff is specified for v0.2: a context preempted on one core can resume on any other free core via a shared ready queue.
Xmath — Games, Audio, DSP Math Acceleration
The Xmath extension targets the math operations that dominate inner loops in games, audio synthesis, fixed-point DSP, and retro-style demoscene code. ~64 instructions across twelve groups:
- Integer fused MAC (MADD, MSUB, MADDH, MADDU, MADDW): the fixed-point DSP workhorse. 1-cycle throughput on DSP-block-backed hardware.
- Saturating arithmetic (ADDS, SUBS, SAT.B, SAT.H, SAT.W, SHIFTSAT): the audio mix-down primitive — clamp to type range without wrap-around. Audio without saturation produces audible distortion; with it, clean output. SHIFTSAT.H is the universal
Q15 → int16audio output conversion in one instruction. - Min/Max/Sign/Abs: branchless conditionals, clipping, bounding-box tests in 1 cycle each.
- FP approximations (FRECIP, FRSQRT, FSIN, FCOS, FSINCOS, FATAN2): game-grade transcendentals at ~0.05% relative error, 3–4 cycle latency. The Quake-
Q_rsqrtuse case becomes a single instruction. - BAM (Binary Angle Measure) trigonometry (FSINBAM, FCOSBAM, FSINCOSBAM, FRAD2BAM, FBAM2RAD): retro/demoscene-native angle representation. Perfect modular wraparound on integer angles, single-cycle range reduction, the right primitive for rotozoomers, plasma effects, wavetable oscillators, particle rotation.
- 3D Vector Math Bundles (DOT3, DOT4, CROSS3, LENSQ3, LERP): fixed-shape 3D vertex math. One instruction per dot/cross/length, with internal FMA chaining.
- Vector Componentwise Bundles (VADD3, VSUB3, VSCALE3, VMADD3, VNORM3): the physics / steering / collision-response toolkit.
position += velocity * dtbecomes a single VMADD3 instruction. - 2D Math Primitives (DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2): the navmesh / top-down / raycaster setup primitives. CROSS2 is the funnel-algorithm core.
- Game / Animation Math (CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP): UI easing, shader uniforms, AI parameter clamping. The Perlin-noise SMOOTHERSTEP for procedural generation.
- Distance Heuristics (MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2): integer A* heuristics on grid worlds. The classic admissible heuristics, one instruction each.
- Quaternion Math (QMUL, QROT): the skeletal-animation primitive. Every bone update per character per frame.
- Multi-precision integer (ADDC, SUBC, ROLC, RORC): add-with-carry / borrow chains and through-carry rotates for 128-bit+ integer math, bignum shifts and crypto. Backed by a single carry CSR (
xcarry) — the EE's only condition-code state, kept deliberately minimal so it doesn't reintroduce the dual-issue-serialising flag register the rest of the design avoids. One instruction per 64-bit limb; no branch-on-carry (the bit is read into a GPR and tested with a standard branch).
MADD acc, sample, coef, acc ; FIR filter tap, 2 cycles
SAT.H out, mix ; audio output clamp, 1 cycle
FRSQRT.S inv_len, lensq ; vector normalisation, 3 cycles
FSINCOSBAM.S sin, angle_bam ; rotation matrix per frame, 3 cycles
DOT4.S result, mrow, vertex ; matrix-times-vector row, 8 cycles
VMADD3.S pos, pos, vel, dt ; physics integration, 4 cycles
OCTILE2 h, neighbour, goal ; A* heuristic, 3 cycles
QMUL.S world_q, parent_q, local_q ; bone composition, 8 cycles
Xmath is available in both narrow and wide modes — every Xmath instruction works on vanilla RV64GC FireStorm binaries running in DDR3. Wide mode adds access to x32–x63 / f32–f63 but unlocks no additional operations.
Wide-mode-only enhancements within Xmath:
- Xcond predication on all R-type Xmath instructions (G2–G12). One bit in the wide-mode nibble (PRED-EN at bit 35) gates the writeback by a predicate register, giving conditional FRSQRT, conditional VMADD3, conditional QMUL, etc. Branchless conditional math without explicit compare-and-jump.
- Precision-mode bit on G4 FP approximations. Bit 34 of the nibble selects between approximate (3 cycles, ~0.05% error) and refined Newton-Raphson (6 cycles, ~10⁻⁹ error). Same opcode, different precision/speed tradeoff via the
.Rassembly suffix — game inner loops default to approximate; physics simulation or precision-sensitive code uses.R.
Division of labour with dedicated hardware. The Ant64 platform pairs FireStorm with dedicated drawing/audio chipset hardware that handles pixel-level work (sprite blits, texture mapping, raster operations, audio mixdown). Xmath therefore focuses on the CPU's role: setting up draw calls (transform matrices, frustum culling, visibility), game state math (physics, AI, animation), and collision/pathfinding. The CPU and chipset hardware coordinate via memory-mapped command queues and DMA.
Xmath occupies the opcode space previously reserved for the RISC-V V (vector) extension. FireStorm does not implement V; for its target workloads (games, audio, retro, demoscene), Xmath's scalar fused operations capture most of the practical benefit V would provide at vastly lower implementation complexity. V could still be added in v0.3+ at a different opcode allocation if a clear data-parallel workload emerges that Xmath doesn't address.
Microarchitecture Highlights
The seven extensions describe what FireStorm does at the ISA level. The microarchitecture that runs them adds three more wins on top:
- Dual-issue execution. In wide mode, two consecutive instructions can issue in the same cycle when they're independent (no register dependency, neither is memory or branch). Both RVC-pair and 32-bit-pair dual-issue are supported. Typical gain: 10–20% IPC on RVC-heavy code, up to 25% on ALU-dense 32-bit code.
- Register scoreboarding. Multi-cycle operations (MUL, DIV, FP arithmetic, D-cache-miss loads) issue to their functional units and mark their destination register as pending; subsequent instructions continue executing on the main pipeline until one tries to read a pending register. This hides the latency of slow operations as long as the result isn't immediately consumed. Typical gain: 10–15% on DSP code, 25–40% on DIV-heavy code, 30–50% on D-cache-miss-bound workloads.
- Hardened DSP-block multipliers. FireStorm uses the GW5AST's hardened DSP blocks (each containing a 27×18 multiplier, a 12×12 auxiliary, and a 48-bit accumulator) for integer MUL and FP FMA. This delivers 1-cycle throughput MUL and FP FMA at the full 380 MHz clock, with 2–3 cycle latency for MUL and 4–5 cycle for FP64 FMA. The GW5AST-138 has 298 DSP blocks total — the CPU uses ~16–20, leaving ~278 for future vector and DSP-extension features.
Execution model: in-order issue, out-of-order completion. This is sometimes called "shallow OoO" or "in-order superscalar with scoreboarding" in the literature. Closest industry analogues: ARM Cortex-A55, Apple Icestorm (M-series efficiency cores), SiFive U74. FireStorm does not do register renaming, branch speculation past unresolved branches, or use a reorder buffer — those are full-OoO features that would triple the verification surface for ~20% more performance. The shallow-OoO model captures most of the practical gain at a fraction of the complexity.
These compose with the ISA wins (Xwide register pressure relief, Xcrisp memory primitives, Xcond predication, etc.). On representative FireStorm workloads — audio synthesis, retro emulation, interpreter dispatch, OS event loops — the cumulative speedup vs vanilla RV64GC at the same clock can exceed 50%, half from the ISA changes and half from the microarchitecture. The target FPGA clock on GoWin GW5AST is ~380 MHz, set by the BSRAM peak rate; the pipeline (5–7 stages) is balanced to fit.
Compared to ARM (AArch64)
Modern ARM is FireStorm's closest peer in design philosophy. Both are clean 64-bit RISCs with a strong focus on compiler-friendliness and high performance. The differences are concentrated in what each treats as a primitive.
| Feature | AArch64 | FireStorm |
|---|---|---|
| GPRs | 31 (x0–x30 + sp + xzr) | 32 narrow / 64 wide |
| FPRs | 32 (v0–v31, shared with SIMD) | 32 narrow / 64 wide |
| Conditional select | csel, csneg, csinv, cset | Predicated R-type (every ALU op, including Xmath G2–G11) |
| Auto-increment addressing | pre/post-index on loads/stores | LBPI..LDPD, SBPI..SDPD families |
| Indexed addressing | LDR with shifted register offset | LBX..LDX with ×1–×128 scales |
| Load-multiple | LDP/STP (pair only) | PUSH/POP with rlist (up to 32 regs) |
| Hardware stacks | No | Yes — BSRAM-backed, 3 per privilege |
| Memory-fused arithmetic | No | LWADD, MMWADD families |
| Memory translators | No (rev/rbit instructions only) | Per-register read/write translators |
| Hardware threading | No (software pthreads) | Xctx with hardware ready queue |
| DMA in ISA | No | DMACPY/DMASET with register-tag sync |
| Immediate construction | MOVZ/MOVK 16-bit chunks; 4 instructions for 64-bit | Wide-mode imm14/imm23 + LIZ/LIK (1 instruction for many 32-bit; ≤4 for 64-bit) |
| Instruction fetch | Multi-level cache hierarchy | BSRAM prefetch buffer pool (no I-cache) |
| Data access | Multi-level cache hierarchy | Scratchpad BSRAM + 8 KB tiny D-cache + Harvard SRAM |
| Bulk SIMD | NEON / SVE / SVE2 (full vector engine) | Not in v0.1 (OP-V opcode 0x57 reallocated to Xmath) |
| Integer fused MAC | SMADDL, NEON SDOT | MADD (1 instr, same idea) |
| Saturating arithmetic | SQADD, SSAT, USAT family | ADDS, SAT.B/H/W, SHIFTSAT |
| FP reciprocal / 1/√ | FRECPE, FRSQRTE | FRECIP, FRSQRT (better accuracy: ~0.05% vs ~0.3%); .R refined to ~10⁻⁹ |
| Hardware sin/cos | Library only | FSIN, FCOS, FSINCOS (3 cycles) |
| BAM (binary angle) trig | None | FSINBAM, FCOSBAM, FSINCOSBAM |
| 3D dot product | FMLA chain (3+ instrs) | DOT3 (1 instr) |
| 3D cross product | EXT + FMUL + FSUB (~6 instrs) | CROSS3 (1 instr) |
| Quaternion math | Software | QMUL, QROT (1 instr) |
| Sorted-array search | NEON CMHS + UMINV (~3 instrs, 16 bytes) | BSRCH.W (1 instr, 16 keys per cache line, 4 cyc) |
| Octile pathfinding heuristic | Software | OCTILE2 (1 instr) |
| Memory model | Acquire/release, relaxed/SC | RVWMO inherited from RISC-V |
The areas where AArch64 unambiguously beats FireStorm are bulk SIMD (NEON/SVE have 128- to 2048-bit vector lanes; Xmath operates on scalar register tuples) and ecosystem (decades of Cortex-A optimisation; FireStorm is new). On clock speed, FireStorm is FPGA-bound — Ant64 targets a few hundred MHz, not the multi-GHz of mobile silicon.
Where FireStorm leads is in primitives ARM never added: hardware stacks, memory-fused arithmetic, per-register translators, hardware context switching, BAM trigonometry, quaternion math, single-instruction cross product / vector normalisation / LERP / smoothstep, B-tree node search, and integer pathfinding heuristics. For the workloads Ant64 targets — sustained audio synthesis, retro-emulation, game CPU work (physics / AI / pathfinding / animation), and in-memory indexed data structures — these primitives compound into substantial code-size and cycle-count wins versus equivalent AArch64 implementations.
Concrete comparison: an 8-tap FIR filter inner loop. AArch64 uses NEON for a vectorised 4-sample-at-a-time version that's hard to beat for pure throughput. FireStorm in wide mode does it scalar with Xmath MADD for the inner-product tap plus Xcrisp auto-inc loads — ~12 cycles per output sample vs ~50 on baseline RV64GC. NEON still wins on the high-end ARM, but the FireStorm version uses no SIMD hardware, fits in modest FPGA fabric, and is much simpler to verify formally.
Quaternion bone update for skeletal animation: AArch64 needs ~28 instructions of NEON or scalar FP for quat_mul; FireStorm's QMUL is 1 instruction at 8 cycles. For a 50-bone character at 60 fps, AArch64 spends ~84K cycles per frame on animation alone; FireStorm spends ~24K — a ~3× reduction that scales linearly with character count.
Compared to Motorola 68000 Family
The 68k is the architecture FireStorm shows the most direct lineage to. Anyone who wrote 68k assembly for the Amiga, Atari ST, or Macintosh will recognise multiple FireStorm primitives as direct descendants.
| Pattern | 68k | FireStorm |
|---|---|---|
| Post-increment load | move.l (a0)+,d0 |
LWPI x10, 4(x11)+ |
| Pre-decrement store | move.l d0,-(a0) |
SWPD x10, 4(x11) |
| Indexed addressing | move.l (8,a0,d1.l),d0 |
LWX x10, (x11, x12, ×4) (plus offset via separate add) |
| Register frame | LINK a6,#-N / UNLK a6 |
ENTER spimm / LEAVE spimm (Xstack) |
| Move multiple | movem.l d0-d7/a0-a6,-(sp) |
PUSH rlist=01100 (Xstack) |
| Conditional move | s instructions (set on condition) |
MOV.cond rd, rs1, condition |
| PC-relative | move.l label(pc),d0 |
LDPC x10, label |
| Bitwise tests then branch | btst #N,d0 + beq |
BEQM x10, 0(x11), label (compare-mem-branch) |
What FireStorm inherits from 68k:
- Auto-increment/auto-decrement addressing modes are first-class in both. The 68k versions are general (any data instruction can use
(An)+); FireStorm restricts to loads/stores but adds 64-bit width and explicit scale. - Frame management is built in. 68k's LINK/UNLK and MOVEM map directly to FireStorm's ENTER/LEAVE and PUSH/POP rlist patterns. The 68k version stores frames in DRAM; FireStorm stores them in BSRAM, which is the headline architectural improvement.
- Rich addressing modes. 68k's
d(An,Xn.s)and FireStorm'sLWXare conceptually the same operation. FireStorm adds wider scales (×16, ×32, ×64, ×128) for matrix and DRAM-burst stride patterns the 68k era didn't anticipate.
What FireStorm leaves behind from 68k:
- Variable-length instructions. 68k instructions are 2–10 bytes depending on addressing mode. This was great for code density but disastrous for pipelining and parallel decode. FireStorm is fixed 32-bit (with optional 16-bit compressed via RVC), trading a small density loss for substantial decode-throughput gain.
- Architectural distinction between data and address registers. 68k's split of
D0–D7andA0–A7was elegant in 1979 but became an optimisation barrier. FireStorm has a single unified register file. - Implicit condition codes. 68k operations implicitly update the CCR; FireStorm follows RISC-V in having no flags register, using explicit comparison results.
The 68k achieved beautiful assembly programming ergonomics; FireStorm aims for the same feel at modern pipeline depths. A direct port of well-written 68k C code to FireStorm typically compiles to similar instruction counts but runs an order of magnitude faster on equivalent silicon.
Compared to MIPS R3000/R4000
The MIPS R3000 (PlayStation, original Silicon Graphics) and R4000/R4300i (Nintendo 64, late SGI) defined classic 32/64-bit RISC. Anyone who hand-optimised assembly for the PS1 or N64 has direct experience with their strengths and limitations.
| Feature | MIPS R3000 | MIPS R4000 | FireStorm |
|---|---|---|---|
| Word size | 32-bit | 64-bit | 64-bit (32-bit narrow mode) |
| Pipeline | 5-stage in-order | 8-stage in-order | implementation-defined |
| Branch delay slot | Required | Required | No (RISC-V choice) |
| Load delay slot | Required (R3000) | Hidden | None |
| GPRs | 32 (r0 hardwired zero) | 32 | 32 narrow, 64 wide |
| HI/LO multiply | Separate registers | Separate registers | Standard GPR result (M extension) |
| Addressing modes | reg + offset only |
reg + offset only |
Auto-inc, indexed, PC-relative, all here |
| Conditional move | No | MOVN, MOVZ (R4000) | Full predication (Xcond) |
| Multiply/divide | Multi-cycle, async to HI/LO | Multi-cycle | Single-cycle multiply (M ext); divide TBD |
| Frame management | Software only | Software only | Hardware (Xstack) |
The R3000 was the canonical example of RISC done right for its era: tiny, fast, easy to pipeline. FireStorm follows the same lineage but addresses the things that aged badly:
- No delay slots. The MIPS branch delay slot was a clever way to extract pipeline performance from simple silicon, but it complicated everything downstream — assemblers, compilers, exception handlers, emulators. RISC-V removed delay slots; FireStorm keeps that decision. The cost is one cycle per branch on naive implementations, which modern branch prediction recovers entirely.
- No restriction to single addressing mode. R3000's
lw $t0, off($t1)was the only available addressing mode. FireStorm adds auto-inc, indexed, PC-relative, and memory-fused variants — exactly the patterns that compiled to 3–5 instruction sequences on MIPS. - Unified multiply result. R3000's HI/LO registers required separate
mflo/mfhiinstructions to retrieve multiply results. The standard RISC-V M extension uses normal GPRs for multiply outputs, which composes much better with surrounding code. FireStorm inherits this.
Where MIPS shines is simplicity and provable timing — for a small in-order CPU, R3000 is hard to beat for verifiability. FireStorm's extensions add complexity (and BSRAM area for hardware stacks and contexts), trading verifiability for performance and code density. For embedded MCU-scale work, R3000 is still defensible; for the workloads Ant64 targets, FireStorm's primitives pay back the silicon cost many times over.
A direct comparison: a function prologue saving 8 callee-saved registers and allocating a frame, on each architecture:
MIPS R3000:
addiu $sp, $sp, -68
sw $ra, 64($sp)
sw $s0, 60($sp)
sw $s1, 56($sp)
sw $s2, 52($sp)
sw $s3, 48($sp)
sw $s4, 44($sp)
sw $s5, 40($sp)
sw $s6, 36($sp)
sw $s7, 32($sp)
; ... body ...
lw $ra, 64($sp)
lw $s0, 60($sp)
; ... 7 more loads ...
jr $ra
addiu $sp, $sp, 68 ; in branch delay slot
20 instructions, plus delay-slot scheduling complications.
FireStorm:
PUSH rlist=01000, spimm=4
; ... body ...
POPRET rlist=01000, spimm=4
2 instructions. 18 instructions saved per function call. The N64's tight ROM budget would have loved this.
Compared to Vanilla RISC-V (RV64GC)
This is the most direct comparison, since FireStorm is RV64GC at its core. The question is what the extensions add over the baseline.
| Feature | RV64GC (with Zba/Zbb/Zbs/Zcmp) | FireStorm |
|---|---|---|
| GPRs | 32 | 64 in wide mode |
| Auto-increment addressing | No | Yes (Xcrisp) |
| Indexed addressing | Zba sh1add/sh2add/sh3add (×2/×4/×8 only) | LWX with ×1–×128 |
| Memory-fused arithmetic | No | Yes (LWADD, MMWADD, etc.) |
| Block memory primitive | No | BMCPY, BMSET, DMACPY, DMASET |
| Sorted-array / B-tree search | No (software loop) | BSRCH.B/H/W/D, BSCAN.B/H/W/D, BSHIFT |
| Compressed prologue/epilogue | Zcmp (DRAM stack) | Xstack (BSRAM, faster) |
| Conditional execution | Zicond (czero only) | Xcond (full predication, including Xmath G2–G11) |
| PIC primitives | auipc-based pairs | LAPC, LDPC, JALPC, CALLM (single instruction) |
| Switch dispatch | auipc + load + sh3add + jr | JMPXPC (single instruction) |
| Bit reversal | Zbkb brev8 only | Per-register translator (Xlate) |
| Hardware threading | No | Xctx |
| Immediate construction | LUI imm20 + ADDI imm12; 6–8 instr for 64-bit constants | Wide-mode imm14 / imm23; LIZ/LIK for 64-bit (≤4 instr) |
| FPRs in wide mode | 32 | 64 |
| Integer fused MAC | mul + add (2 instr) | MADD (1 instr) |
| Add-with-carry / multi-precision | software (add + sltu carry recovery per limb) |
ADDC / SUBC / ROLC / RORC (1 instr per limb, xcarry bit) |
| Saturating arithmetic | software (no Zbb support) | ADDS, SAT.B/H/W, SHIFTSAT (1 instr each) |
| FP reciprocal / 1/√x | FDIV / FSQRT (~30 cyc) | FRECIP / FRSQRT (3 cyc); .R refined (6 cyc, ~10⁻⁹) |
| Hardware sin / cos | None (library) | FSIN / FCOS / FSINCOS (3 cyc) |
| BAM trig | None | FSINBAM / FCOSBAM / FSINCOSBAM |
| 3D vector primitives | None | DOT3 / DOT4 / CROSS3 / LENSQ3 / VNORM3 / LERP |
| 2D vector primitives | None | DOT2 / LENSQ2 / CROSS2 / VNORM2 |
| Quaternion math | software | QMUL / QROT |
| Distance heuristics | software | MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2 |
| Game / animation math | software | CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP |
The Zba/Zbb/Zbs extensions are excellent and FireStorm benefits from them directly — they're part of the baseline (CLZ, CTZ, CPOP, MIN, MAX, etc. all available). What FireStorm adds is the patterns these extensions don't cover:
- Memory-fused arithmetic. Zba sh*add accelerates the address calculation but not the load itself. LWADD does both in one instruction.
- B-tree primitives. RV64GC has no equivalent — sorted-array search and shift are pure software loops with branchy compares. BSRCH gives ~10× lookup speedup on B-tree workloads.
- Hardware stacks. Zcmp compresses the DRAM-stack save/restore sequence; Xstack moves the entire stack to BSRAM. Different orders of improvement.
- Full predication. Zicond covers
c ? a : 0patterns; Xcond coversif (cond) ALU_OPfor every ALU op (including all Xmath G2–G11), with single-instruction abs via RSUB-cond. - Address materialisation in one instruction. Standard RISC-V needs auipc + addi for any non-trivial PC-relative reference. LAPC does it in one wide-mode instruction.
- Hardware context switching. Vanilla RV64 software fibers cost ~30 instructions per yield; YIELD is one.
- Wider immediates and direct 64-bit construction. RV64's 12-bit ADDI and 20-bit LUI immediates frequently need 2-instruction sequences for values that don't fit. Wide-mode imm14 / imm23 handles many of these in one instruction, and LIZ/LIK build any 64-bit constant in ≤4 instructions — versus 6–8 for vanilla RV64.
- Game / audio / DSP math. Where the standard RISC-V world relies on either software libraries (sin, cos, sqrt are software) or the optional V extension (which FireStorm does not implement), Xmath provides scalar fused operations and approximations covering most game-engine and audio-synthesis inner loops.
Cumulatively, FireStorm code is typically 40–60% fewer instructions than RV64GC for the workloads Ant64 targets (audio synthesis, retro emulation, game CPU work, system code, database/indexed structures). For pure arithmetic kernels with little control flow, the savings are smaller (often single-digit-percent); for control-flow-heavy, memory-loop-heavy, or game-math-heavy code, the savings are larger (often 50%+).
The trade-off is silicon area for the BSRAM banks (Xstack, Xctx, BSRCH comparator array) and the extra decode logic for the custom opcodes. The full FireStorm budget — including the full concurrent context and stack capacity — fits comfortably in the GW5AST-138.
Cross-Architecture Reference Table
A consolidated view of FireStorm alongside the architectures it's most often compared to. Entries are short; refer to the per-CPU sections above for context. FireStorm is shown in both its modes since they have meaningfully different capability profiles.
RISC-V compatibility. FireStorm narrow mode is object-code compatible with standard RV64GC — an unmodified RV64GC binary runs on FireStorm in DDR3 with identical semantics, no recompilation needed. Code that wants Xcrisp / Xstack / Xlate / Xctx primitives recompiles to use them, but the existing binary keeps working. FireStorm wide mode is source-code compatible with RV64GC — the same C / Rust / assembly recompiles into wide-mode sections, gaining access to 64 registers, wider immediates, Xcond predication, and the rest of the wide-mode-only features. Object code is not portable between modes (the encoding differs in the extension nibble), but source code is. This means existing RISC-V toolchains and libraries work as a starting point, and FireStorm extensions are additive opt-ins rather than a separate ISA.
Register File and Data Width
| CPU (year) | Word | Int GPRs | Float regs | Reg width |
|---|---|---|---|---|
| MOS 6502 (1975) | 8-bit | 1 + 2 (A, X, Y) | — | 8-bit |
| Zilog Z80 (1976) | 8-bit | 7 + 4 pairs | — | 8 / 16-bit |
| Motorola 68000 (1979) | 16/32-bit | 8 data + 8 address | — (FPU on 68881/68882) | 32-bit |
| MIPS R4000 (1991) | 64-bit | 32 (R0 = zero) | 32 | 64-bit |
| ARMv7 AArch32 (2005) | 32-bit | 13 + SP/LR/PC | 32 (NEON) | 32-bit |
| x86-64 (2003) | 64-bit | 16 GPR | 16 XMM/YMM/ZMM | 64-bit |
| ARMv8 AArch64 (2011) | 64-bit | 31 (X0–X30) + SP + XZR | 32 (V0–V31) | 64-bit |
| RV64GC (2014) | 64-bit | 32 (x0 = zero) | 32 | 64-bit |
| FireStorm narrow (2025) | 64-bit | 32 (x0 = zero) | 32 | 64-bit |
| FireStorm wide (2025) | 64-bit | 64 | 64 | 64-bit |
Immediates and Control Flow
| CPU | Largest 1-instr immediate | Branch range (relative) | Jump range | Conditional execution |
|---|---|---|---|---|
| 6502 | 8 bits | ±127 bytes | 16-bit absolute | Branches only (BEQ, BNE, BCC, BCS, BMI, BPL, BVC, BVS) |
| Z80 | 16 bits (LD HL,nn) | ±127 (JR) | 16-bit absolute | Conditional jumps, calls, returns |
| 68000 | 32 bits (variable-length encoding) | ±32 KB (Bcc short); ±2 GB (Bcc long, 68020+) | 32-bit | Bcc family + Scc (set on condition) |
| MIPS R4000 | 16 bits (ADDIU); 16-bit LUI for upper | ±128 KB | 256 MB (J/JAL within segment) | Branches + MOVN/MOVZ (move on non-zero/zero) |
| ARMv7 AArch32 | 12-bit rotated | ±32 MB (B/BL) | ±32 MB | Every instruction is predicated (4-bit cond field) |
| x86-64 | 32 bits (most ops); 64 bits (MOV imm64) | ±127 (short Jcc) / ±2 GB (long Jcc) | ±2 GB | Jcc + CMOVcc + SETcc |
| ARMv8 AArch64 | 16 bits per MOVZ/MOVK; up to 4 for 64-bit | ±1 MB (B.cond) / ±128 MB (B/BL) | ±128 MB | CSEL / CSNEG / CSINV / CSET family |
| RV64GC | 20 bits (LUI); 12 bits (ADDI/branches) | ±4 KB | ±1 MB (JAL) | Zicond only (czero.eqz / czero.nez) |
| FireStorm narrow | 20 / 12 bits (RV64GC) | ±4 KB | ±1 MB | Zicond (RV64GC base) |
| FireStorm wide | 23 / 14 bits + LIZ/LIK for 64-bit in ≤4 instructions | ±32 KiB | ±16 MiB (JAL, slot-aligned) | Xcond — predicated R-type on every ALU op |
Memory Access and Addressing
| CPU | Addressing modes | Indexed (scaled) | Auto-inc / dec | Hardware stack |
|---|---|---|---|---|
| 6502 | 13 (zero-page, abs, indirect, X/Y indexed) | yes (X, Y, no scale) | no | Fixed 256-byte page 1 |
| Z80 | 7 (reg, imm, abs, IX/IY ±d, indirect) | yes (IX, IY) | LDIR, LDDR, INI/IND | SP-based |
| 68000 | 12 (incl. d8(An,Xn) indexed-indirect) | yes (d8(An,Xn.s)) | (An)+, -(An) | SP plus USP / SSP for supervisor |
| MIPS R4000 | 1 (base + 16-bit offset) | no | no | SP-based, all software |
| ARMv7 | 8+ (reg, imm, scaled, pre/post-indexed) | yes (with shift) | pre/post-index on every load/store | SP-based |
| x86-64 | many (SIB + disp; full base-index-scale-disp) | yes (×1/2/4/8) | rep/movs idioms, not auto-inc per se | SP-based with PUSH/POP |
| ARMv8 | 11 (reg, imm, scaled, pre/post-indexed) | yes (×1/2/4/8) | pre/post-index | SP-based; LDP/STP for register pairs |
| RV64GC | 1 (base + 12-bit offset) | Zba sh1/2/3add only (×2/4/8) | no | SP-based, all software (Zcmp compresses) |
| FireStorm narrow | base + offset + Xcrisp auto-inc | LWX-family (×1–×128 scales) | LBPI..LDPD / SBPI..SDPD | Xstack BSRAM stacks (U/S/M) |
| FireStorm wide | + indexed addressing + PIC family (LAPC etc.) | full Xcrisp X-type with 8 scales | full Xcrisp | Xstack + Xctx contexts (8 / 32) |
Advanced Features
| CPU | SIMD / vector | Hardware threading | MMU | Atomics | Defining feature |
|---|---|---|---|---|---|
| 6502 | — | — | — | — | Cheap, simple, 1 MHz changed home computing |
| Z80 | — | — | — | — | Shadow register set; ubiquitous embedded / retro |
| 68000 | — | — | external 68851 | TAS only | Beautiful orthogonal CISC; rich addressing modes |
| MIPS R4000 | — | — | full TLB | LL / SC | Canonical classic 64-bit RISC; PS1 / N64 era |
| ARMv7 | NEON (64 / 128-bit SIMD) | — | full | LDREX / STREX | Every-instruction predication |
| x86-64 | SSE / AVX / AVX-512 | SMT (Hyper-Threading) | full paging | LL/SC + LOCK prefix | Deepest software ecosystem; high IPC |
| ARMv8 | NEON, SVE / SVE2 | optional SMT | full | LDXR / STXR + LSE atomics | Clean RISC reset of ARM; mobile + server |
| RV64GC | optional V extension | — | optional Sv39 / Sv48 | A extension (LR / SC + AMO) | Open ISA, modular |
| FireStorm narrow | Xmath (scalar fused MAC, transcendentals, BAM trig, vector bundles) | — | (TBD; not in v0.1) | RV64 A extension | Xcrisp memory + Xstack + Xlate + Xctx + Xmath on RV64GC base |
| FireStorm wide | Xmath (scalar fused MAC, transcendentals, BAM trig, vector bundles) | Xctx — 8 / 32 hardware contexts | (TBD) | RV64 A extension | + 64 registers + Xcond predication + 23-bit immediates + LIZ/LIK + indexed addressing + PIC family + RVC-pair (always) and 32-bit-pair (Ant64) dual-issue + register scoreboarding |
Math, DSP, and Game Operations
The Xmath extension targets math operations common in games, audio, and DSP. The following table shows how each operation maps to other architectures. "1 instr" indicates a single dedicated instruction; "N instr" indicates a software sequence.
| Operation | FireStorm Xmath | x86 (SSE/AVX) | ARMv8 (NEON) | MIPS | PowerPC | RV64GC base |
|---|---|---|---|---|---|---|
| Fused integer MAC | MADD (1 instr, 2 cyc) | not in base ISA; ~3 instr | SMADDL (1 instr) | MADD → HI/LO (1 instr) | not single-instr | mul+add (2 instr) |
| Saturating add | ADDS (1 instr) | PADDSB/W (SSE, packed) | SQADD (1 instr) | DSP ASE only | AltiVec vaddsws | not in base |
| Shift-and-saturate | SHIFTSAT.H (1 instr) | PSRA + PADDS (2-3 instr) | SRSHR + SQADD (2 instr) | DSP ASE multi-instr | multi-instr | software (4-5 instr) |
| FP reciprocal estimate | FRECIP (1 instr, 3 cyc, ~0.05%) | RCPSS (1 instr, ~5×10⁻⁴) | FRECPE (1 instr, ~3×10⁻³) | RECIP.S (MIPS-3D) | FRES (1 instr) | software (FDIV ~15 cyc) |
| FP 1/sqrt estimate | FRSQRT (1 instr, 3 cyc, ~0.1%) | RSQRTSS (1 instr) | FRSQRTE (1 instr) | RSQRT.S (MIPS-3D) | FRSQRTE (1 instr) | software (FSQRT+FDIV ~35 cyc) |
| Refined precision toggle | bit 34 = .R suffix (6 cyc, ~10⁻⁹) |
software refinement | software refinement | — | — | — |
| Hardware sin/cos | FSIN/FCOS (1 instr, 3 cyc) | x87 FSIN (~50–100 cyc, dropped in modern x86) | library only | library only | library only | library only |
| sin + cos paired | FSINCOS (1 instr, 3 cyc) | x87 FSINCOS (~80 cyc) | library only | library only | library only | library only |
| BAM (binary angle) trig | FSINBAM (1 instr, 2 cyc) | none — radians only | none | none | none | none |
| Perfect modular angle accum | integer ADD on BAM | FP wraparound branch | FP wraparound | FP wraparound | FP wraparound | FP wraparound |
| 3D dot product | DOT3 (1 instr, 6 cyc) | DPPS with mask (1 instr) | FMLA chain (3 instr) | 3-FMUL+2-FADD | vmaddfp+sum | 3-FMUL+2-FADD |
| 4D dot product | DOT4 (1 instr, 8 cyc) | DPPS (1 instr) | FMLA chain (4 instr) | 4-FMUL+3-FADD | vmaddfp+sum | 4-FMUL+3-FADD |
| 3D cross product | CROSS3 (1 instr, 10 cyc) | shuffles+mul+sub (~6 instr) | EXT+FMUL+FMUL+FSUB (~6 instr) | software | software | software (~9 instr) |
| Vector normalise (3D) | VNORM3 (1 instr, 8 cyc) | DPPS+RSQRTSS+MUL (~4 instr) | FMLA+FRSQRTE+FMUL chain | software | software | software (~12 instr) |
| Linear interpolation | LERP (1 instr, 3 cyc) | FMSUB+FMADD (2 instr) | FMSUB+FMADD (2 instr) | software | software | software (2-3 instr) |
| Quaternion multiply | QMUL (1 instr, 8 cyc) | software (~28 instr) | software (~28 instr) | software | software | software |
| Rotate vec by quaternion | QROT (1 instr, 10 cyc) | software (~30 instr) | software (~30 instr) | software | software | software |
| Vector componentwise | VADD3/VSUB3/VSCALE3 (1 instr) | ADDPS/SUBPS (packed) | FADD/FSUB (packed) | software (3 instr) | vaddfp+ | software (3 instr) |
| 2D vector ops | DOT2/LENSQ2/CROSS2 (1 instr) | software (~3 instr) | software | software | software | software |
| FP clamp | CLAMP (1 instr, 2 cyc) | MAXSS+MINSS (2 instr) | FMINNM+FMAXNM (2 instr) | software | software | software (3-4 instr) |
| Cubic ease (smoothstep) | SMOOTHSTEP (1 instr, 3 cyc) | CLAMP+FMUL chain (4-5 instr) | CLAMP+FMUL chain | software | software | software (5+ instr) |
Database and Search Operations
The Xcrisp B-tree primitives target operations common in databases, key-value stores, and sorted-index workloads.
| Operation | FireStorm Xcrisp | x86 (SSE 4.2 / AVX-512) | ARMv8 (NEON) | MIPS | PowerPC | RV64GC base |
|---|---|---|---|---|---|---|
| Sorted-array search (find ≥) | BSRCH.W (1 instr, 4 cyc, 16 keys) | PCMPESTRI (1 instr, 16 bytes); AVX-512 mask+TZCNT (3 instr, 64 bytes) | CMHS+UMINV (3 instr) or software | software | software | software (loop, ~50 cyc) |
| First-match scan (==) | BSCAN.W (1 instr, 4 cyc, 16 keys) | PCMPEQ+PMOVMSKB+TZCNT (3 instr) | CMEQ+UMINV (3 instr) | software | software | software (loop) |
| Block shift in node | BSHIFT (1 instr, 5 cyc, 64 bytes) | PALIGNR + REP MOVSB (2-3 instr) | EXT + EXT (NEON, 2 instr) | software | software | software (memmove loop) |
| Manhattan distance (2D) | MANHATTAN2 (1 instr, 1 cyc) | PSADBW (packed bytes only) | SABD+ADDP (2-3 instr) | software | software | ABS+ABS+ADD (3-5 instr) |
| Chebyshev distance | CHEBYSHEV2 (1 instr, 1 cyc) | software (~4 instr) | UMAX+UMAX (2 instr) | software | software | software (~5 instr) |
| Octile distance | OCTILE2 (1 instr, 3 cyc) | software (~6-10 instr) | software (~8 instr) | software | software | software (~10 instr) |
| Population count | CPOP (Zbb, 1 instr) | POPCNT (1 instr, since SSE4.2) | CNT (1 instr) | not in base | popcntb (1 instr) | CPOP if Zbb |
| Count leading zeros | CLZ (Zbb, 1 instr) | LZCNT (1 instr, BMI1) | CLZ (1 instr) | CLZ (1 instr, MIPS32r2) | cntlzw (1 instr) | CLZ if Zbb |
| Count trailing zeros | CTZ (Zbb, 1 instr) | TZCNT (1 instr, BMI1) | RBIT+CLZ (2 instr) | software | software | CTZ if Zbb |
| Find first set bit + index | CLZ-on-mask sequence | BSF (1 instr); TZCNT preferred | RBIT+CLZ (2 instr) | software | cntlzw on reverse | similar |
Notes on the Comparison
A few observations the tables make concrete:
General architecture
- Register count. Wide-mode FireStorm has more general-purpose registers than any of these architectures (64 GPRs + 64 FPRs). The closest competitors are MIPS, ARMv8, and RV64GC at 31–32 GPRs. The high register count is a deliberate trade for code that wants to hold significant state in registers across an inner loop (audio synthesis, FIR filters, polyphony).
- Immediate construction. ARMv8's MOVZ/MOVK approach is the clear ergonomic reference; FireStorm's LIZ/LIK directly borrows that pattern. The wide-mode imm23/imm14 immediates give FireStorm an extra advantage in the common case (no follow-up MOVK needed for many 32-bit values).
- Branch range. FireStorm's wide-mode branch ranges (±32 KiB / ±16 MiB) are wider than RV64GC and similar to ARMv7's, smaller than ARMv8 / x86-64. For function-internal control flow this is more than enough; for long-range calls FireStorm uses the Xcrisp PIC family (JALPC, CALLM). The doubled range over a naive imm14/imm23 design comes from FireStorm's slot-indexed PC convention — wide-mode branch/jump targets are slot-aligned (4-byte), allowing the immediate to scale by 4 rather than 2.
- Conditional execution. ARMv7's every-instruction predication remains the most aggressive of the modern architectures; ARM dropped it in v8 because it complicated out-of-order execution. FireStorm's Xcond predication on R-type ALU ops — including all of Xmath G2–G11 in wide mode via the PRED-EN bit — is a deliberate middle ground that predicates the operations benefiting most from it without the verification cost of predicating loads and stores.
- Hardware threading. FireStorm is unusual in providing first-class hardware contexts with a dedicated instruction set (YIELD / HALT / NEW / RESUME / FREE). x86 SMT and ARM optional-SMT are different beasts (transparent thread-level parallelism on shared execution units); Xctx is closer to what older mainframes called "hardware coroutines."
- Stack management. FireStorm's BSRAM hardware stacks are unique among modern CPUs — closest historical comparison is the 6502's fixed page 1 stack (which was also dedicated SRAM at a fixed location, though far smaller and not user-extensible).
- Addressing modes. FireStorm's wide mode has more addressing-mode richness than any other modern RISC, approaching CISC-class flexibility (8 scale factors, auto-inc / auto-dec / pre-dec, PC-relative materialisation, compare-mem-branch). Standard RV64GC is the most addressing-mode-spartan of the modern set; FireStorm restores the kind of expressiveness 68000 programmers took for granted.
Math, DSP, and game operations
- Fused multiply-add (integer). Mainstream since the mid-1990s (MIPS R5000 in 1996, ARMv5 in 1995). x86 is the holdout in its base ISA — even modern x86-64 has no single-instruction integer MAC except via SSE/AVX packed forms. FireStorm's MADD matches the well-established convention; nothing exotic here.
- Saturating arithmetic. ARM has the cleanest mainstream support (SQADD/SQSUB/SSAT/USAT since ARMv6); x86 has packed forms via MMX/SSE (PADDS family); RV64GC base has nothing. FireStorm's ADDS family matches ARM's ergonomics. SHIFTSAT.H — the universal Q15→int16 audio output primitive — combines two operations in one and appears to be unique to FireStorm at the CPU level; ARM and x86 require a 2-instruction shift-then-saturate sequence.
- FP reciprocal / 1/√x estimates. Standard since SSE (1999) and ARMv7 NEON. The Quake III
Q_rsqrtinteger-bit-twiddling hack that was once mandatory is now a single instruction everywhere. FireStorm's accuracy (~0.05%) is similar to SSE RCPSS (~5×10⁻⁴) and better than ARM FRECPE (~3×10⁻³). The precision-mode bit (.Rsuffix → Newton-Raphson refinement, ~10⁻⁹ error) is FireStorm-specific — most other architectures require explicit software refinement when better accuracy is needed. - Hardware sin / cos. The x87 has had FSIN/FCOS since 1987, but at ~50–100 cycles per instruction and dropped from SSE/AVX entirely. Modern x86-64 and ARM rely on library
sin/cos(~100 cycles). FireStorm's 3-cycle hardware sin/cos is unusual among modern general-purpose CPUs — comparable only to historical DSP chips (TI C6x has hardware sin via CORDIC) and game-console graphics co-processors (PS2 VU0/VU1). - BAM trigonometry. Genuinely distinctive. No mainstream CPU has hardware BAM (binary angle measure) trig. Some game consoles' graphics processors used BAM internally (PS1 GTE used 16-bit fixed-point angles; Sega Saturn VDP rotation), but always in the GPU side. Putting BAM trig in a general-purpose CPU is FireStorm-specific and reflects the retro / demoscene heritage of the project.
- 3D / 4D dot products. x86 SSE 4.1 introduced DPPS/DPPD (single instruction for 4-element FP dot product with optional mask) in 2007; FireStorm's DOT3/DOT4 are the same idea with similar latency (~6–8 cycles). ARM relies on FMLA chains, MIPS and RV64 on software sequences. 3D cross product (CROSS3) is unusual at the CPU level — virtually every architecture needs ~6 instructions of shuffles plus FMUL plus FSUB; only GPU shader ISAs typically have it as a primitive.
- Quaternion math. No mainstream CPU has hardware QMUL or QROT. Skeletal animation engines universally implement these in software (~28+ instructions per operation). PlayStation 2 VU0/VU1 had quaternion-friendly opcodes (VOPMSUB for cross-product helpers) but not single-instruction multiplies. FireStorm is genuinely distinctive here.
- Vector componentwise (VADD3, VSUB3, VMADD3). Functionally identical to packed-FP SIMD operations on x86 (ADDPS, SUBPS) and ARM (FADD with vector form). FireStorm's distinction is that they operate on explicitly-named register tuples rather than packed registers, eliminating the need for a separate vector register file.
- CLAMP / SMOOTHSTEP / LERP. Single-instruction forms are uncommon at CPU level. x86 needs 2-3 instructions for CLAMP (MAXSS+MINSS); ARM same (FMINNM+FMAXNM). SMOOTHSTEP is a software macro everywhere except FireStorm and GPU shaders. LERP via single instruction is similarly distinctive — most CPUs use FMSUB+FMADD pairs.
Database and search operations
- Parallel sorted-array search. x86 SSE 4.2 (Nehalem, 2008) introduced PCMPESTRI specifically for string and short-array searches — 16 bytes per instruction, primarily byte-granularity. AVX-512 extends this to 64 bytes via mask+TZCNT sequences (still 3 instructions). FireStorm's BSRCH is the closest analog at the CPU level, with cleaner semantics (single-instruction, supports 8/16/32/64-bit keys, returns position directly) and wider data (full cache line per instruction). For workloads that look like "find first key ≥ X in sorted array" — B-tree nodes, sorted index buckets, ordered hash chains — this is the most directly competitive feature in FireStorm's instruction set.
- Manhattan distance. x86 has had PSADBW (packed sum of absolute differences, byte data) since MMX (1996), originally for video motion estimation but useful for Manhattan distance on packed byte coordinates. ARM has SABD+ADDP. FireStorm's MANHATTAN2/3 work on standard integer registers (not packed) and are designed for pathfinding heuristics rather than image processing.
- Octile distance. Genuinely FireStorm-specific. Used in 8-directional grid pathfinding (
max(dx,dy) + (√2-1)*min(dx,dy)); software everywhere else, typically 8–10 instructions with a compare-branch. - Bit-manipulation primitives (CLZ / CTZ / CPOP). Standard across modern CPUs since ARMv5 (1995) and x86 SSE4.2 (2008). FireStorm inherits these from RISC-V's Zbb extension, which is implemented per §2 of
ee_cpu. No FireStorm-specific addition needed.
Honest Trade-offs
FireStorm is not the right CPU for every job. We are honest about what it doesn't do well:
- No bulk SIMD in v0.1. The RISC-V V opcodes have been reallocated to Xmath (see §10 of
ee_cpu), so FireStorm does not currently implement standard V or wide-vector SIMD. For workloads dominated by data-parallel array math — bulk image filtering, large mass mixing into wide vectors, dense linear algebra — code runs scalar with Xmath fused-MAC acceleration rather than 4-wide or 8-wide parallel lanes. Xmath captures most of the practical win for game / audio / DSP workloads where the inner loop is per-voice or per-vertex math, but it is not a 1-to-1 V replacement. If a clear bulk data-parallel workload emerges, V could be added in v0.3+ at a different opcode allocation. - FPGA-bound clock speeds. Ant64 targets ~380 MHz on a mid-range GoWin GW5AST (matched to the BSRAM peak rate; pipeline balanced to fit). The ISA-level wins partially offset this — 25–40% fewer instructions at half the clock is competitive in many workloads — but raw single-thread throughput is not where FireStorm competes.
- Custom toolchain bring-up required. RV64GC is supported by mainline GCC and LLVM out of the box; the FireStorm extensions require toolchain patches (in progress). Until the patches land, FireStorm code is either hand-written assembly or compiled from intrinsics-using C. A bare RV64GC compile works but leaves the FireStorm wins on the table.
- Verification surface is large. Seven extensions interacting with each other (Xctx-with-Xlate state, Xcond-with-Xcrisp loads, Xstack-with-Xctx context, Xmath-with-Xcond predication, B-tree primitives with cache coherence) means more corner cases to verify. We are working through these systematically; the spec set has cross-document references to flag interaction points.
- Memory-mapped I/O ordering for DMA is still being formalised. FireStorm reserves the entire
0xFxxx_xxxxquarter (256 MB) for hardware chip registers, accessed uncached and strongly ordered. The precise RVWMO interaction between CPU MMIO stores and DMA-to-MMIO writes (audio buffer to codec, network buffer to MAC, etc.) needs final spec — the suggested baseline is "DMA MMIO writes are strictly ordered relative to surrounding CPU MMIO stores."
Where It Shines
FireStorm is at its best on workloads that mix several of these characteristics:
- Sustained real-time audio synthesis with many concurrent voices. Wide register file holds polyphony state in registers across the inner loop; LWPI streams input; SWPI streams output; MMWADD fuses bus-mix accumulation; MADD accelerates per-tap filters and SHIFTSAT.H handles output saturation in one instruction. The 128-voice synth count target on Ant64 is achievable because every primitive in the inner loop maps to one instruction.
- Game CPU work — physics, AI, pathfinding, animation, draw-call setup. The Ant64 platform pairs FireStorm with dedicated drawing/audio chipset hardware that handles pixel-level work, so the CPU focuses on the math that sets up draws and runs game state. Xmath's VMADD3 makes physics integration one instruction; VNORM3 makes lighting normalisation 8 cycles; CROSS3 and DOT3 make 3D geometry primitives single-instruction; *OCTILE2 makes A heuristic one instruction; FSINCOSBAM gives bit-exact rotation accumulation forever; QMUL accelerates per-bone skeletal animation**. The combined effect is that the CPU runs game logic, AI, and animation comfortably while the chipset draws.
- Database and indexed data structures. The B-tree primitives (BSRCH, BSCAN, BSHIFT) turn the dominant operation in any sorted index — find first key ≥ target — from a branchy multi-cycle scan into a single-instruction parallel search. For in-memory ordered indexes (B+ trees, sorted vectors, ordered hash buckets), this delivers ~10× lookup speedup. Workloads dominated by index access (relational query engines, key-value stores, sorted-set caches) benefit substantially.
- Retro emulation cores where a CPU emulator runs dozens of guest CPUs as cooperative tasks. Xctx makes the guest-CPU dispatch nearly free; Xlate handles endian conversions for guests with foreign byte order; Xstack gives the emulator state its own BSRAM region without touching DRAM. For guest CPUs with their own math (early arcade boards, 16-bit consoles), Xmath's MADD and saturating arithmetic accelerate the guest ALU emulation.
- Modular system software where many small functions call each other through indirection tables. Xcrisp PIC's CALLM is one instruction for vtable dispatch; Xstack's PUSH/POP is one instruction for prologue/epilogue. The whole system feels lighter on dispatch overhead.
- Generated code and interpreters. The bytecode dispatch loop of an interpreter is fundamentally a switch statement plus a few state accesses. JMPXPC is one-instruction switch dispatch. Combined with Xctx-driven cooperative scheduling of multiple interpreters, this is the right architecture for language runtimes and bytecode VMs running as FireStorm applications — scripting engines embedded in games, custom interpreters, and similar workloads. (AntOS's own Luau runtime runs on DeMon, not FireStorm — see AntOS — so this strength is about interpreters a FireStorm application hosts, not the OS scripting layer.)
- Demoscene effects. BAM-based rotozoomers, plasma effects with multiple summed BAM-indexed sine waves, BAM-phase wavetable oscillators, fast normalisation for raycasters — Xmath's BAM trigonometry is the natively-suited primitive for retro / demoscene rendering techniques.
- Embedded creative tools — pixel art editors, music trackers, level designers. These mix UI dispatch, file I/O, and arithmetic in roughly equal measure. FireStorm has primitives for all three.
For pure-throughput numerical computing (climate modeling, deep learning training), FireStorm is not the answer — those workloads want bulk SIMD vectors, multi-GHz clocks, and high memory bandwidth, none of which FireStorm v0.1 prioritises. The Ant64 platform pairs FireStorm with the DeMon (ESP32-P4) and Pulse (ESP32-P4) supervisors for tasks where FireStorm isn't the right tool.
Further Reading
The full FireStorm architectural specification is split across eight documents:
- CPU base architecture — RV64GC relationship, wide-mode mechanics, calling convention.
- Xcrisp — Memory primitives: auto-inc, indexed, memory-fused, block, B-tree, PIC, compare-mem-branch, DMA.
- Xstack — Hardware BSRAM stacks for U/S/M privilege levels.
- Xcond — Predicated R-type instructions in wide mode.
- Xlate — Per-register memory translators.
- Xctx — Hardware context switching with multi-core ready architecture.
- Xmath — Games, audio, DSP math: fused MAC, saturating, transcendentals, BAM trig, vector bundles, 2D math, game/animation math, distance heuristics, quaternion math.
- Performance examples — Worked code comparisons with cycle and instruction counts.
The spec set runs ~10,000 lines of detailed technical content with cross-references throughout. We aim to be exhaustive about edge cases and honest about open items; everything described as "v0.2" or "open item" is explicitly flagged rather than glossed over.
FireStorm and the Ant64 platform are designed and developed by Deluxe Pixel Limited. The CPU specification, FPGA implementation, and surrounding system architecture are open for technical review.