FireStorm: An Engineering Overview

FireStorm is the CPU at the heart of the Ant64 platform — a 64-bit RISC-V core with a stack of custom extensions targeting the specific workloads Ant64 cares about: real-time audio synthesis, retro emulation, graphics, modular system software, and creative tools. It is implemented in a GoWin GW5AST FPGA and runs the standard RV64GC instruction set as its baseline, augmented by seven extensions that together address most of the historical pain points in classic RISC and CISC architectures.

This page explains what makes FireStorm different and how it compares to the architectures it is most often measured against: ARM (AArch64), Motorola 68000-family, MIPS R3000/R4000, and vanilla RISC-V.

Design Philosophy

FireStorm follows what we call the retro-modern approach: take a clean modern RISC base, then layer on the addressing modes and primitives that classic architectures got right, while leaving behind the things that hurt them at scale. The result has the orthogonality and pipeline-friendliness of RISC, the addressing-mode richness of CISC, and a handful of features inspired by architectures we've worked with over the decades but that nobody else has put together in one design.

Three principles drive every decision:

The 80% case is one instruction. If a code pattern dominates a real workload, it should compile to one instruction, not a sequence. Auto-incrementing loads, in-place memory updates, conditional accumulation, hardware push/pop — all are single instructions in FireStorm because they're things that get written millions of times in real code.
No hidden cost. Every instruction takes a predictable number of cycles. There are no microcoded surprises, no exception-driven slow paths for common operations, no "this works on paper but trap-and-emulate in practice." If FireStorm says PUSH ra+s0..s7 in one cycle, that's what it does.
Make the compiler's job easy. With 64 general-purpose registers in wide mode, hardware-managed stacks, and fused memory operations, the compiler has more options and less plumbing to emit. The same C code typically produces 25–40% fewer instructions than on standard RV64GC.

Memory Architecture: Harvard + Scratchpad + Tiny Cache

FireStorm's memory subsystem is unconventional and worth explaining. There are four data memory regions, each tuned for different workloads:

Instruction fetch uses a pool of small BSRAM-backed prefetch buffers (8), each holding a contiguous range of recently-fetched code. There is no traditional I-cache. Each buffer has its own BSRAM port, so concurrent fetch and background refill from different buffers happen in parallel.
Wide-mode SRAM holds code only — a Harvard restriction. Data loads/stores to the 36-bit SRAM range trap (except for M-mode code-deposit paths for JIT and loader). This simplifies the memory subsystem and prevents accidental code corruption.
Scratchpad BSRAM (32 KB) is a directly-addressable fast region at a known address. Software places hot data structures there at compile time: audio voice state, filter coefficients, sample LUTs, scheduler tables. Single-cycle access, wide port.
D-cache is a small 8 KB direct-mapped write-through cache covering DDR3 data accesses. It catches the patterns the scratchpad can't predict — pointer chasing through tree nodes, hash-table buckets, library data, dynamically-allocated objects.

This split is deliberate. A traditional cache hierarchy hides DRAM latency at the cost of tag RAM, replacement state, coherence logic, complex state machines around traps and DMA, and unpredictable timing. FireStorm's split does most of the work with simpler structures:

Most hot data is in BSRAM (Xstack frames, Xctx contexts, scratchpad-resident application data). No cache needed; single-cycle by construction.
Bulk DRAM data uses a tiny cache. Write-through direct-mapped is nearly trivial to verify and implement.
Code fetch uses prefetch buffers. Predictable, pinnable for real-time guarantees, deterministic miss latency.

The benefits compound:

Smaller silicon than a comparable cache hierarchy. Eight 2 KB prefetch buffers plus 8 KB direct-mapped D-cache plus 8–32 KB scratchpad — total roughly 50–80 KB of structured BSRAM, versus the ~100+ KB of cache plus tag-and-control RAM a comparable cached CPU would need.
Predictable timing. The audio inner loop, with voice state in scratchpad and the loop body pinned in a prefetch buffer, has exactly known cycle counts. No cache replacement can introduce jitter.
Per-buffer pinning gives deterministic real-time guarantees on top. Pinning a buffer to the trap vector means ISR fetch latency is exactly the pipeline drain.
Simple coherence. DMA writes auto-invalidate any overlapping prefetch buffer and matching D-cache line; no software flush dance.
Cache-bypass on demand. Every cached address has an uncached alias at addr | (1ULL << 63). Streaming code (audio buffers, framebuffers, DMA rings) reads/writes through the uncached alias to avoid evicting useful state from the tiny 8 KB D-cache — a single bit in the pointer, no new instructions, no CSR changes.

The cost: workloads with poor locality across a working set larger than the D-cache run slower than they would on a richly-cached architecture. FireStorm is not the right choice for general-purpose computing where this pattern dominates; it is the right choice for the workloads Ant64 targets.

The Seven Extensions

FireStorm's architectural extensions stack on top of RV64GC. They are named with the X-prefix convention common in RISC-V custom extensions.

Xwide — 64 Registers and Wider Immediates in Wide Mode

When the CPU fetches code from 36-bit SRAM (rather than 32-bit DDR3), the extra 4 bits per instruction word do triple duty:

64-register access. Most instruction formats use 1–3 of the extra bits to extend register fields, giving access to 64 general-purpose registers and 64 floating-point registers instead of the standard 32 of each. The extra registers are all caller-saved, so they don't affect calling conventions — pure scratch space for code that needs the elbow room. DSP kernels, FIR filters, FFTs, polyphony synthesis, and compiler-intensive optimisation passes benefit.
Wider immediates. The bits not consumed by register extension widen the immediate fields: LUI/AUIPC/JAL grow from 20-bit to 23-bit immediates (8× larger), and ADDI/loads/stores/branches grow from 12-bit to 14-bit immediates (4× larger). The compiler uses these automatically; many "constant just out of range" cases drop from 2 instructions to 1.
Per-instruction predicates. R-type instructions reserve one bit as a predication enable, gating the Xcond predicated-execution extension (see below).

For 64-bit constants, two dedicated instructions in the wide-mode escape space (LIZ / LIK, modelled on ARM-A64's MOVZ/MOVK) build arbitrary 64-bit values from 16-bit chunks in 1–4 instructions, versus the 6–8 instructions standard RV64 needs.

The mode is determined by where the code lives: SRAM-resident hot paths get the wide register file and wider immediates; DDR3-resident bulk code uses standard RV64GC. Code compiled with +xfirestorm selects automatically per function.

Xcrisp — Memory Primitives

A collection of single-instruction memory operations that take 3–7 instructions on standard RV64:

Auto-increment loads and stores: LWPI rd, off(rs1)+ is "load word, increment pointer" in one instruction (Z80 enthusiasts will recognise this as LDI; 68k veterans as move.l (a0)+,d0).
Indexed addressing: LWX rd, (base, idx, scale) for scaled-indexed loads up to ×128 stride — covers 2D matrix access, hash probes, struct-array access.
Memory-fused arithmetic: LWADD rd, (rs1), rs2 is load-then-add; MMWADD [rd], [rs1], rs2 is "read memory, add, write back to a different memory location" — single-instruction in-place vector accumulation.
Block memory: BMCPY and BMSET for synchronous memcpy/memset; DMACPY and DMASET for asynchronous DMA that overlaps with CPU work, with hardware register-tagging so reading the byte-count register shows live progress and writing it stalls until the DMA finishes.
Compare-mem-branch: BEQM rs1, off(rs2), label does "load and branch if equal" in one instruction — eliminates the load-then-compare pair for tight scan loops.
B-tree primitives: BSRCH.W rd, key, node finds the first key ≥ target in a 16-key sorted array (one cache line) in 4 cycles, branchless — ~12× faster than software search and the foundation for database/index workloads. Variants for 64×8-bit, 32×16-bit, and 8×64-bit keys. Companion BSCAN (equality) and BSHIFT (insert/delete slot shift) round out the family. Lookups on a 5-level B-tree drop from ~600 cycles to ~60.
Position-independent code primitives: LAPC (load address PC-relative), JALPC (long-range call), CALLM (vtable dispatch), JMPXPC (PC-relative indexed jump for switch tables) — every standard auipc+something pair collapses to one instruction.

Xstack — Hardware Stacks in BSRAM

Three hardware-managed stacks (user / supervisor / machine), each backed by dedicated FPGA BSRAM with wide ports. A function prologue that saves ra plus s0–s7 is one instruction:

PUSH rlist=01000, spimm=4    ; save ra, s0–s7, allocate 64 bytes of frame

The corresponding epilogue is also one instruction:

POPRET rlist=01000, spimm=4  ; restore ra, s0–s7, free frame, return

Eight registers transferred in 2 cycles on Ant64's 288-bit BSRAM port; the standard equivalent is ~18 instructions. Interrupt handlers see the same win on entry/exit, and a context switch (full callee-saved spill + USWAP + restore) is roughly 6–10 cycles instead of 30+.

The hardware stacks coexist with the standard x2/DDR3 stack; the compiler picks per function based on size and addressing patterns.

Xcond — Conditional Execution

Every R-type ALU instruction in wide mode can be predicated. When the predicate-enable bit in the wide-fetch extension nibble is set, the instruction executes only when a 6-bit condition holds (8 test modes × 8 conditions). Use cases:

ADD.cond  sum, t0, GT_RS1     ; if (t0 > 0) sum += t0
SUB.cond  counter, limit, GE  ; if (counter >= limit) counter -= limit
RSUB.cond x, x0, LT_RD        ; if (x < 0) x = 0 - x   (abs in one instruction)

For data-dependent inner-loop conditionals on random inputs, this eliminates the branch mispredict entirely — typically a 1.5×–2× speedup on the affected loop. RSUB-cond gives single-instruction abs, beating even Zbb's 2-instruction neg/max sequence.

Xlate — Memory Translators

Each register has a software-configurable read translator (applied on load) and write translator (applied on store), drawn from a bank of 12 fixed transformations: identity, nibble swap, bit reverse, byteswap-16/32/64, halfword/word swap, bit-reverse-16/32/64.

No new instructions: once configured, every existing load and store transparently passes data through the translator. Endian conversion for network protocols, bit reversal for SPI peripherals, mixed-endian DSP file formats, BCD digit reorder — all become zero-overhead per access:

__xlate_rd(t0, BSWAP32);            // configure once
for (i = 0; i < n; i++) {
    lw t0, 0(buf++);                 // value loaded byte-swapped, no extra instruction
    process(t0);
}

The most distinctive feature is the involutory property: setting matching read and write translators on a register gives a "private host-order view" of foreign-format memory, where the program operates on host-order values throughout and the translation happens invisibly at every load and store.

Xctx — Hardware Context Switching

A pool of hardware-resident execution contexts (32), each holding a full register file + key per-task state in dedicated BSRAM. A context switch is one instruction:

YIELD       ; save current context, load next ready context  (~30 cycles on Ant64)
HALT        ; suspend until externally resumed
NEW rd, rs1, rs2   ; spawn new context at entry PC rs1, stack rs2
RESUME rs1  ; wake a halted context

A cooperative fiber yield is one instruction. Producer/consumer with proper blocking is HALT/RESUME (no kernel call, no syscall path). A preemptive scheduler is the slice-timer CSR plus no software ISR at all — the hardware switches contexts when the slice expires. Multi-core handoff is specified for v0.2: a context preempted on one core can resume on any other free core via a shared ready queue.

Xmath — Games, Audio, DSP Math Acceleration

The Xmath extension targets the math operations that dominate inner loops in games, audio synthesis, fixed-point DSP, and retro-style demoscene code. ~64 instructions across twelve groups:

Integer fused MAC (MADD, MSUB, MADDH, MADDU, MADDW): the fixed-point DSP workhorse. 1-cycle throughput on DSP-block-backed hardware.
Saturating arithmetic (ADDS, SUBS, SAT.B, SAT.H, SAT.W, SHIFTSAT): the audio mix-down primitive — clamp to type range without wrap-around. Audio without saturation produces audible distortion; with it, clean output. SHIFTSAT.H is the universal Q15 → int16 audio output conversion in one instruction.
Min/Max/Sign/Abs: branchless conditionals, clipping, bounding-box tests in 1 cycle each.
FP approximations (FRECIP, FRSQRT, FSIN, FCOS, FSINCOS, FATAN2): game-grade transcendentals at ~0.05% relative error, 3–4 cycle latency. The Quake-Q_rsqrt use case becomes a single instruction.
BAM (Binary Angle Measure) trigonometry (FSINBAM, FCOSBAM, FSINCOSBAM, FRAD2BAM, FBAM2RAD): retro/demoscene-native angle representation. Perfect modular wraparound on integer angles, single-cycle range reduction, the right primitive for rotozoomers, plasma effects, wavetable oscillators, particle rotation.
3D Vector Math Bundles (DOT3, DOT4, CROSS3, LENSQ3, LERP): fixed-shape 3D vertex math. One instruction per dot/cross/length, with internal FMA chaining.
Vector Componentwise Bundles (VADD3, VSUB3, VSCALE3, VMADD3, VNORM3): the physics / steering / collision-response toolkit. position += velocity * dt becomes a single VMADD3 instruction.
2D Math Primitives (DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2): the navmesh / top-down / raycaster setup primitives. CROSS2 is the funnel-algorithm core.
Game / Animation Math (CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP): UI easing, shader uniforms, AI parameter clamping. The Perlin-noise SMOOTHERSTEP for procedural generation.
Distance Heuristics (MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2): integer A* heuristics on grid worlds. The classic admissible heuristics, one instruction each.
Quaternion Math (QMUL, QROT): the skeletal-animation primitive. Every bone update per character per frame.
Multi-precision integer (ADDC, SUBC, ROLC, RORC): add-with-carry / borrow chains and through-carry rotates for 128-bit+ integer math, bignum shifts and crypto. Backed by a single carry CSR (xcarry) — the EE's only condition-code state, kept deliberately minimal so it doesn't reintroduce the dual-issue-serialising flag register the rest of the design avoids. One instruction per 64-bit limb; no branch-on-carry (the bit is read into a GPR and tested with a standard branch).

MADD     acc, sample, coef, acc                ; FIR filter tap, 2 cycles
SAT.H    out, mix                              ; audio output clamp, 1 cycle
FRSQRT.S inv_len, lensq                        ; vector normalisation, 3 cycles
FSINCOSBAM.S  sin, angle_bam                   ; rotation matrix per frame, 3 cycles
DOT4.S   result, mrow, vertex                  ; matrix-times-vector row, 8 cycles
VMADD3.S pos, pos, vel, dt                     ; physics integration, 4 cycles
OCTILE2  h, neighbour, goal                    ; A* heuristic, 3 cycles
QMUL.S   world_q, parent_q, local_q            ; bone composition, 8 cycles

Xmath is available in both narrow and wide modes — every Xmath instruction works on vanilla RV64GC FireStorm binaries running in DDR3. Wide mode adds access to x32–x63 / f32–f63 but unlocks no additional operations.

Wide-mode-only enhancements within Xmath:

Xcond predication on all R-type Xmath instructions (G2–G12). One bit in the wide-mode nibble (PRED-EN at bit 35) gates the writeback by a predicate register, giving conditional FRSQRT, conditional VMADD3, conditional QMUL, etc. Branchless conditional math without explicit compare-and-jump.
Precision-mode bit on G4 FP approximations. Bit 34 of the nibble selects between approximate (3 cycles, ~0.05% error) and refined Newton-Raphson (6 cycles, ~10⁻⁹ error). Same opcode, different precision/speed tradeoff via the .R assembly suffix — game inner loops default to approximate; physics simulation or precision-sensitive code uses .R.

Division of labour with dedicated hardware. The Ant64 platform pairs FireStorm with dedicated drawing/audio chipset hardware that handles pixel-level work (sprite blits, texture mapping, raster operations, audio mixdown). Xmath therefore focuses on the CPU's role: setting up draw calls (transform matrices, frustum culling, visibility), game state math (physics, AI, animation), and collision/pathfinding. The CPU and chipset hardware coordinate via memory-mapped command queues and DMA.

Xmath occupies the opcode space previously reserved for the RISC-V V (vector) extension. FireStorm does not implement V; for its target workloads (games, audio, retro, demoscene), Xmath's scalar fused operations capture most of the practical benefit V would provide at vastly lower implementation complexity. V could still be added in v0.3+ at a different opcode allocation if a clear data-parallel workload emerges that Xmath doesn't address.

Microarchitecture Highlights

The seven extensions describe what FireStorm does at the ISA level. The microarchitecture that runs them adds three more wins on top:

Dual-issue execution. In wide mode, two consecutive instructions can issue in the same cycle when they're independent (no register dependency, neither is memory or branch). Both RVC-pair and 32-bit-pair dual-issue are supported. Typical gain: 10–20% IPC on RVC-heavy code, up to 25% on ALU-dense 32-bit code.
Register scoreboarding. Multi-cycle operations (MUL, DIV, FP arithmetic, D-cache-miss loads) issue to their functional units and mark their destination register as pending; subsequent instructions continue executing on the main pipeline until one tries to read a pending register. This hides the latency of slow operations as long as the result isn't immediately consumed. Typical gain: 10–15% on DSP code, 25–40% on DIV-heavy code, 30–50% on D-cache-miss-bound workloads.
Hardened DSP-block multipliers. FireStorm uses the GW5AST's hardened DSP blocks (each containing a 27×18 multiplier, a 12×12 auxiliary, and a 48-bit accumulator) for integer MUL and FP FMA. This delivers 1-cycle throughput MUL and FP FMA at the full 380 MHz clock, with 2–3 cycle latency for MUL and 4–5 cycle for FP64 FMA. The GW5AST-138 has 298 DSP blocks total — the CPU uses ~16–20, leaving ~278 for future vector and DSP-extension features.

Execution model: in-order issue, out-of-order completion. This is sometimes called "shallow OoO" or "in-order superscalar with scoreboarding" in the literature. Closest industry analogues: ARM Cortex-A55, Apple Icestorm (M-series efficiency cores), SiFive U74. FireStorm does not do register renaming, branch speculation past unresolved branches, or use a reorder buffer — those are full-OoO features that would triple the verification surface for ~20% more performance. The shallow-OoO model captures most of the practical gain at a fraction of the complexity.

These compose with the ISA wins (Xwide register pressure relief, Xcrisp memory primitives, Xcond predication, etc.). On representative FireStorm workloads — audio synthesis, retro emulation, interpreter dispatch, OS event loops — the cumulative speedup vs vanilla RV64GC at the same clock can exceed 50%, half from the ISA changes and half from the microarchitecture. The target FPGA clock on GoWin GW5AST is ~380 MHz, set by the BSRAM peak rate; the pipeline (5–7 stages) is balanced to fit.

Compared to ARM (AArch64)

Modern ARM is FireStorm's closest peer in design philosophy. Both are clean 64-bit RISCs with a strong focus on compiler-friendliness and high performance. The differences are concentrated in what each treats as a primitive.

Feature	AArch64	FireStorm
GPRs	31 (x0–x30 + sp + xzr)	32 narrow / 64 wide
FPRs	32 (v0–v31, shared with SIMD)	32 narrow / 64 wide
Conditional select	csel, csneg, csinv, cset	Predicated R-type (every ALU op, including Xmath G2–G11)
Auto-increment addressing	pre/post-index on loads/stores	LBPI..LDPD, SBPI..SDPD families
Indexed addressing	LDR with shifted register offset	LBX..LDX with ×1–×128 scales
Load-multiple	LDP/STP (pair only)	PUSH/POP with rlist (up to 32 regs)
Hardware stacks	No	Yes — BSRAM-backed, 3 per privilege
Memory-fused arithmetic	No	LWADD, MMWADD families
Memory translators	No (rev/rbit instructions only)	Per-register read/write translators
Hardware threading	No (software pthreads)	Xctx with hardware ready queue
DMA in ISA	No	DMACPY/DMASET with register-tag sync
Immediate construction	MOVZ/MOVK 16-bit chunks; 4 instructions for 64-bit	Wide-mode imm14/imm23 + LIZ/LIK (1 instruction for many 32-bit; ≤4 for 64-bit)
Instruction fetch	Multi-level cache hierarchy	BSRAM prefetch buffer pool (no I-cache)
Data access	Multi-level cache hierarchy	Scratchpad BSRAM + 8 KB tiny D-cache + Harvard SRAM
Bulk SIMD	NEON / SVE / SVE2 (full vector engine)	Not in v0.1 (OP-V opcode `0x57` reallocated to Xmath)
Integer fused MAC	SMADDL, NEON SDOT	MADD (1 instr, same idea)
Saturating arithmetic	SQADD, SSAT, USAT family	ADDS, SAT.B/H/W, SHIFTSAT
FP reciprocal / 1/√	FRECPE, FRSQRTE	FRECIP, FRSQRT (better accuracy: ~0.05% vs ~0.3%); `.R` refined to ~10⁻⁹
Hardware sin/cos	Library only	FSIN, FCOS, FSINCOS (3 cycles)
BAM (binary angle) trig	None	FSINBAM, FCOSBAM, FSINCOSBAM
3D dot product	FMLA chain (3+ instrs)	DOT3 (1 instr)
3D cross product	EXT + FMUL + FSUB (~6 instrs)	CROSS3 (1 instr)
Quaternion math	Software	QMUL, QROT (1 instr)
Sorted-array search	NEON CMHS + UMINV (~3 instrs, 16 bytes)	BSRCH.W (1 instr, 16 keys per cache line, 4 cyc)
Octile pathfinding heuristic	Software	OCTILE2 (1 instr)
Memory model	Acquire/release, relaxed/SC	RVWMO inherited from RISC-V

The areas where AArch64 unambiguously beats FireStorm are bulk SIMD (NEON/SVE have 128- to 2048-bit vector lanes; Xmath operates on scalar register tuples) and ecosystem (decades of Cortex-A optimisation; FireStorm is new). On clock speed, FireStorm is FPGA-bound — Ant64 targets a few hundred MHz, not the multi-GHz of mobile silicon.

Where FireStorm leads is in primitives ARM never added: hardware stacks, memory-fused arithmetic, per-register translators, hardware context switching, BAM trigonometry, quaternion math, single-instruction cross product / vector normalisation / LERP / smoothstep, B-tree node search, and integer pathfinding heuristics. For the workloads Ant64 targets — sustained audio synthesis, retro-emulation, game CPU work (physics / AI / pathfinding / animation), and in-memory indexed data structures — these primitives compound into substantial code-size and cycle-count wins versus equivalent AArch64 implementations.

Concrete comparison: an 8-tap FIR filter inner loop. AArch64 uses NEON for a vectorised 4-sample-at-a-time version that's hard to beat for pure throughput. FireStorm in wide mode does it scalar with Xmath MADD for the inner-product tap plus Xcrisp auto-inc loads — ~12 cycles per output sample vs ~50 on baseline RV64GC. NEON still wins on the high-end ARM, but the FireStorm version uses no SIMD hardware, fits in modest FPGA fabric, and is much simpler to verify formally.

Quaternion bone update for skeletal animation: AArch64 needs ~28 instructions of NEON or scalar FP for quat_mul; FireStorm's QMUL is 1 instruction at 8 cycles. For a 50-bone character at 60 fps, AArch64 spends ~84K cycles per frame on animation alone; FireStorm spends ~24K — a ~3× reduction that scales linearly with character count.

Compared to Motorola 68000 Family

The 68k is the architecture FireStorm shows the most direct lineage to. Anyone who wrote 68k assembly for the Amiga, Atari ST, or Macintosh will recognise multiple FireStorm primitives as direct descendants.

Pattern	68k	FireStorm
Post-increment load	`move.l (a0)+,d0`	`LWPI x10, 4(x11)+`
Pre-decrement store	`move.l d0,-(a0)`	`SWPD x10, 4(x11)`
Indexed addressing	`move.l (8,a0,d1.l),d0`	`LWX x10, (x11, x12, ×4)` (plus offset via separate add)
Register frame	`LINK a6,#-N` / `UNLK a6`	`ENTER spimm` / `LEAVE spimm` (Xstack)
Move multiple	`movem.l d0-d7/a0-a6,-(sp)`	`PUSH rlist=01100` (Xstack)
Conditional move	`s` instructions (set on condition)	`MOV.cond rd, rs1, condition`
PC-relative	`move.l label(pc),d0`	`LDPC x10, label`
Bitwise tests then branch	`btst #N,d0` + `beq`	`BEQM x10, 0(x11), label` (compare-mem-branch)

What FireStorm inherits from 68k:

Auto-increment/auto-decrement addressing modes are first-class in both. The 68k versions are general (any data instruction can use (An)+); FireStorm restricts to loads/stores but adds 64-bit width and explicit scale.
Frame management is built in. 68k's LINK/UNLK and MOVEM map directly to FireStorm's ENTER/LEAVE and PUSH/POP rlist patterns. The 68k version stores frames in DRAM; FireStorm stores them in BSRAM, which is the headline architectural improvement.
Rich addressing modes. 68k's d(An,Xn.s) and FireStorm's LWX are conceptually the same operation. FireStorm adds wider scales (×16, ×32, ×64, ×128) for matrix and DRAM-burst stride patterns the 68k era didn't anticipate.

What FireStorm leaves behind from 68k:

Variable-length instructions. 68k instructions are 2–10 bytes depending on addressing mode. This was great for code density but disastrous for pipelining and parallel decode. FireStorm is fixed 32-bit (with optional 16-bit compressed via RVC), trading a small density loss for substantial decode-throughput gain.
Architectural distinction between data and address registers. 68k's split of D0–D7 and A0–A7 was elegant in 1979 but became an optimisation barrier. FireStorm has a single unified register file.
Implicit condition codes. 68k operations implicitly update the CCR; FireStorm follows RISC-V in having no flags register, using explicit comparison results.

The 68k achieved beautiful assembly programming ergonomics; FireStorm aims for the same feel at modern pipeline depths. A direct port of well-written 68k C code to FireStorm typically compiles to similar instruction counts but runs an order of magnitude faster on equivalent silicon.

Compared to MIPS R3000/R4000

The MIPS R3000 (PlayStation, original Silicon Graphics) and R4000/R4300i (Nintendo 64, late SGI) defined classic 32/64-bit RISC. Anyone who hand-optimised assembly for the PS1 or N64 has direct experience with their strengths and limitations.

Feature	MIPS R3000	MIPS R4000	FireStorm
Word size	32-bit	64-bit	64-bit (32-bit narrow mode)
Pipeline	5-stage in-order	8-stage in-order	implementation-defined
Branch delay slot	Required	Required	No (RISC-V choice)
Load delay slot	Required (R3000)	Hidden	None
GPRs	32 (r0 hardwired zero)	32	32 narrow, 64 wide
HI/LO multiply	Separate registers	Separate registers	Standard GPR result (M extension)
Addressing modes	`reg + offset` only	`reg + offset` only	Auto-inc, indexed, PC-relative, all here
Conditional move	No	MOVN, MOVZ (R4000)	Full predication (Xcond)
Multiply/divide	Multi-cycle, async to HI/LO	Multi-cycle	Single-cycle multiply (M ext); divide TBD
Frame management	Software only	Software only	Hardware (Xstack)

The R3000 was the canonical example of RISC done right for its era: tiny, fast, easy to pipeline. FireStorm follows the same lineage but addresses the things that aged badly:

No delay slots. The MIPS branch delay slot was a clever way to extract pipeline performance from simple silicon, but it complicated everything downstream — assemblers, compilers, exception handlers, emulators. RISC-V removed delay slots; FireStorm keeps that decision. The cost is one cycle per branch on naive implementations, which modern branch prediction recovers entirely.
No restriction to single addressing mode. R3000's lw $t0, off($t1) was the only available addressing mode. FireStorm adds auto-inc, indexed, PC-relative, and memory-fused variants — exactly the patterns that compiled to 3–5 instruction sequences on MIPS.
Unified multiply result. R3000's HI/LO registers required separate mflo/mfhi instructions to retrieve multiply results. The standard RISC-V M extension uses normal GPRs for multiply outputs, which composes much better with surrounding code. FireStorm inherits this.

Where MIPS shines is simplicity and provable timing — for a small in-order CPU, R3000 is hard to beat for verifiability. FireStorm's extensions add complexity (and BSRAM area for hardware stacks and contexts), trading verifiability for performance and code density. For embedded MCU-scale work, R3000 is still defensible; for the workloads Ant64 targets, FireStorm's primitives pay back the silicon cost many times over.

A direct comparison: a function prologue saving 8 callee-saved registers and allocating a frame, on each architecture:

MIPS R3000:
    addiu  $sp, $sp, -68
    sw     $ra, 64($sp)
    sw     $s0, 60($sp)
    sw     $s1, 56($sp)
    sw     $s2, 52($sp)
    sw     $s3, 48($sp)
    sw     $s4, 44($sp)
    sw     $s5, 40($sp)
    sw     $s6, 36($sp)
    sw     $s7, 32($sp)
    ; ... body ...
    lw     $ra, 64($sp)
    lw     $s0, 60($sp)
    ; ... 7 more loads ...
    jr     $ra
    addiu  $sp, $sp, 68    ; in branch delay slot

20 instructions, plus delay-slot scheduling complications.

FireStorm:
    PUSH    rlist=01000, spimm=4
    ; ... body ...
    POPRET  rlist=01000, spimm=4

2 instructions. 18 instructions saved per function call. The N64's tight ROM budget would have loved this.

Compared to Vanilla RISC-V (RV64GC)

This is the most direct comparison, since FireStorm is RV64GC at its core. The question is what the extensions add over the baseline.

Feature	RV64GC (with Zba/Zbb/Zbs/Zcmp)	FireStorm
GPRs	32	64 in wide mode
Auto-increment addressing	No	Yes (Xcrisp)
Indexed addressing	Zba sh1add/sh2add/sh3add (×2/×4/×8 only)	LWX with ×1–×128
Memory-fused arithmetic	No	Yes (LWADD, MMWADD, etc.)
Block memory primitive	No	BMCPY, BMSET, DMACPY, DMASET
Sorted-array / B-tree search	No (software loop)	BSRCH.B/H/W/D, BSCAN.B/H/W/D, BSHIFT
Compressed prologue/epilogue	Zcmp (DRAM stack)	Xstack (BSRAM, faster)
Conditional execution	Zicond (czero only)	Xcond (full predication, including Xmath G2–G11)
PIC primitives	auipc-based pairs	LAPC, LDPC, JALPC, CALLM (single instruction)
Switch dispatch	auipc + load + sh3add + jr	JMPXPC (single instruction)
Bit reversal	Zbkb brev8 only	Per-register translator (Xlate)
Hardware threading	No	Xctx
Immediate construction	LUI imm20 + ADDI imm12; 6–8 instr for 64-bit constants	Wide-mode imm14 / imm23; LIZ/LIK for 64-bit (≤4 instr)
FPRs in wide mode	32	64
Integer fused MAC	mul + add (2 instr)	MADD (1 instr)
Add-with-carry / multi-precision	software (`add` + `sltu` carry recovery per limb)	ADDC / SUBC / ROLC / RORC (1 instr per limb, `xcarry` bit)
Saturating arithmetic	software (no Zbb support)	ADDS, SAT.B/H/W, SHIFTSAT (1 instr each)
FP reciprocal / 1/√x	FDIV / FSQRT (~30 cyc)	FRECIP / FRSQRT (3 cyc); `.R` refined (6 cyc, ~10⁻⁹)
Hardware sin / cos	None (library)	FSIN / FCOS / FSINCOS (3 cyc)
BAM trig	None	FSINBAM / FCOSBAM / FSINCOSBAM
3D vector primitives	None	DOT3 / DOT4 / CROSS3 / LENSQ3 / VNORM3 / LERP
2D vector primitives	None	DOT2 / LENSQ2 / CROSS2 / VNORM2
Quaternion math	software	QMUL / QROT
Distance heuristics	software	MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2
Game / animation math	software	CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP

The Zba/Zbb/Zbs extensions are excellent and FireStorm benefits from them directly — they're part of the baseline (CLZ, CTZ, CPOP, MIN, MAX, etc. all available). What FireStorm adds is the patterns these extensions don't cover:

Memory-fused arithmetic. Zba sh*add accelerates the address calculation but not the load itself. LWADD does both in one instruction.
B-tree primitives. RV64GC has no equivalent — sorted-array search and shift are pure software loops with branchy compares. BSRCH gives ~10× lookup speedup on B-tree workloads.
Hardware stacks. Zcmp compresses the DRAM-stack save/restore sequence; Xstack moves the entire stack to BSRAM. Different orders of improvement.
Full predication. Zicond covers c ? a : 0 patterns; Xcond covers if (cond) ALU_OP for every ALU op (including all Xmath G2–G11), with single-instruction abs via RSUB-cond.
Address materialisation in one instruction. Standard RISC-V needs auipc + addi for any non-trivial PC-relative reference. LAPC does it in one wide-mode instruction.
Hardware context switching. Vanilla RV64 software fibers cost ~30 instructions per yield; YIELD is one.
Wider immediates and direct 64-bit construction. RV64's 12-bit ADDI and 20-bit LUI immediates frequently need 2-instruction sequences for values that don't fit. Wide-mode imm14 / imm23 handles many of these in one instruction, and LIZ/LIK build any 64-bit constant in ≤4 instructions — versus 6–8 for vanilla RV64.
Game / audio / DSP math. Where the standard RISC-V world relies on either software libraries (sin, cos, sqrt are software) or the optional V extension (which FireStorm does not implement), Xmath provides scalar fused operations and approximations covering most game-engine and audio-synthesis inner loops.

Cumulatively, FireStorm code is typically 40–60% fewer instructions than RV64GC for the workloads Ant64 targets (audio synthesis, retro emulation, game CPU work, system code, database/indexed structures). For pure arithmetic kernels with little control flow, the savings are smaller (often single-digit-percent); for control-flow-heavy, memory-loop-heavy, or game-math-heavy code, the savings are larger (often 50%+).

The trade-off is silicon area for the BSRAM banks (Xstack, Xctx, BSRCH comparator array) and the extra decode logic for the custom opcodes. The full FireStorm budget — including the full concurrent context and stack capacity — fits comfortably in the GW5AST-138.

Cross-Architecture Reference Table

A consolidated view of FireStorm alongside the architectures it's most often compared to. Entries are short; refer to the per-CPU sections above for context. FireStorm is shown in both its modes since they have meaningfully different capability profiles.

RISC-V compatibility. FireStorm narrow mode is object-code compatible with standard RV64GC — an unmodified RV64GC binary runs on FireStorm in DDR3 with identical semantics, no recompilation needed. Code that wants Xcrisp / Xstack / Xlate / Xctx primitives recompiles to use them, but the existing binary keeps working. FireStorm wide mode is source-code compatible with RV64GC — the same C / Rust / assembly recompiles into wide-mode sections, gaining access to 64 registers, wider immediates, Xcond predication, and the rest of the wide-mode-only features. Object code is not portable between modes (the encoding differs in the extension nibble), but source code is. This means existing RISC-V toolchains and libraries work as a starting point, and FireStorm extensions are additive opt-ins rather than a separate ISA.

Register File and Data Width

CPU (year)	Word	Int GPRs	Float regs	Reg width
MOS 6502 (1975)	8-bit	1 + 2 (A, X, Y)	—	8-bit
Zilog Z80 (1976)	8-bit	7 + 4 pairs	—	8 / 16-bit
Motorola 68000 (1979)	16/32-bit	8 data + 8 address	— (FPU on 68881/68882)	32-bit
MIPS R4000 (1991)	64-bit	32 (R0 = zero)	32	64-bit
ARMv7 AArch32 (2005)	32-bit	13 + SP/LR/PC	32 (NEON)	32-bit
x86-64 (2003)	64-bit	16 GPR	16 XMM/YMM/ZMM	64-bit
ARMv8 AArch64 (2011)	64-bit	31 (X0–X30) + SP + XZR	32 (V0–V31)	64-bit
RV64GC (2014)	64-bit	32 (x0 = zero)	32	64-bit
FireStorm narrow (2025)	64-bit	32 (x0 = zero)	32	64-bit
FireStorm wide (2025)	64-bit	64	64	64-bit

Immediates and Control Flow

CPU	Largest 1-instr immediate	Branch range (relative)	Jump range	Conditional execution
6502	8 bits	±127 bytes	16-bit absolute	Branches only (BEQ, BNE, BCC, BCS, BMI, BPL, BVC, BVS)
Z80	16 bits (LD HL,nn)	±127 (JR)	16-bit absolute	Conditional jumps, calls, returns
68000	32 bits (variable-length encoding)	±32 KB (Bcc short); ±2 GB (Bcc long, 68020+)	32-bit	Bcc family + Scc (set on condition)
MIPS R4000	16 bits (ADDIU); 16-bit LUI for upper	±128 KB	256 MB (J/JAL within segment)	Branches + MOVN/MOVZ (move on non-zero/zero)
ARMv7 AArch32	12-bit rotated	±32 MB (B/BL)	±32 MB	Every instruction is predicated (4-bit cond field)
x86-64	32 bits (most ops); 64 bits (MOV imm64)	±127 (short Jcc) / ±2 GB (long Jcc)	±2 GB	Jcc + CMOVcc + SETcc
ARMv8 AArch64	16 bits per MOVZ/MOVK; up to 4 for 64-bit	±1 MB (B.cond) / ±128 MB (B/BL)	±128 MB	CSEL / CSNEG / CSINV / CSET family
RV64GC	20 bits (LUI); 12 bits (ADDI/branches)	±4 KB	±1 MB (JAL)	Zicond only (czero.eqz / czero.nez)
FireStorm narrow	20 / 12 bits (RV64GC)	±4 KB	±1 MB	Zicond (RV64GC base)
FireStorm wide	23 / 14 bits + LIZ/LIK for 64-bit in ≤4 instructions	±32 KiB	±16 MiB (JAL, slot-aligned)	Xcond — predicated R-type on every ALU op

Memory Access and Addressing

CPU	Addressing modes	Indexed (scaled)	Auto-inc / dec	Hardware stack
6502	13 (zero-page, abs, indirect, X/Y indexed)	yes (X, Y, no scale)	no	Fixed 256-byte page 1
Z80	7 (reg, imm, abs, IX/IY ±d, indirect)	yes (IX, IY)	LDIR, LDDR, INI/IND	SP-based
68000	12 (incl. d8(An,Xn) indexed-indirect)	yes (d8(An,Xn.s))	(An)+, -(An)	SP plus USP / SSP for supervisor
MIPS R4000	1 (base + 16-bit offset)	no	no	SP-based, all software
ARMv7	8+ (reg, imm, scaled, pre/post-indexed)	yes (with shift)	pre/post-index on every load/store	SP-based
x86-64	many (SIB + disp; full base-index-scale-disp)	yes (×1/2/4/8)	rep/movs idioms, not auto-inc per se	SP-based with PUSH/POP
ARMv8	11 (reg, imm, scaled, pre/post-indexed)	yes (×1/2/4/8)	pre/post-index	SP-based; LDP/STP for register pairs
RV64GC	1 (base + 12-bit offset)	Zba sh1/2/3add only (×2/4/8)	no	SP-based, all software (Zcmp compresses)
FireStorm narrow	base + offset + Xcrisp auto-inc	LWX-family (×1–×128 scales)	LBPI..LDPD / SBPI..SDPD	Xstack BSRAM stacks (U/S/M)
FireStorm wide	+ indexed addressing + PIC family (LAPC etc.)	full Xcrisp X-type with 8 scales	full Xcrisp	Xstack + Xctx contexts (8 / 32)

Advanced Features

CPU	SIMD / vector	Hardware threading	MMU	Atomics	Defining feature
6502	—	—	—	—	Cheap, simple, 1 MHz changed home computing
Z80	—	—	—	—	Shadow register set; ubiquitous embedded / retro
68000	—	—	external 68851	TAS only	Beautiful orthogonal CISC; rich addressing modes
MIPS R4000	—	—	full TLB	LL / SC	Canonical classic 64-bit RISC; PS1 / N64 era
ARMv7	NEON (64 / 128-bit SIMD)	—	full	LDREX / STREX	Every-instruction predication
x86-64	SSE / AVX / AVX-512	SMT (Hyper-Threading)	full paging	LL/SC + LOCK prefix	Deepest software ecosystem; high IPC
ARMv8	NEON, SVE / SVE2	optional SMT	full	LDXR / STXR + LSE atomics	Clean RISC reset of ARM; mobile + server
RV64GC	optional V extension	—	optional Sv39 / Sv48	A extension (LR / SC + AMO)	Open ISA, modular
FireStorm narrow	Xmath (scalar fused MAC, transcendentals, BAM trig, vector bundles)	—	(TBD; not in v0.1)	RV64 A extension	Xcrisp memory + Xstack + Xlate + Xctx + Xmath on RV64GC base
FireStorm wide	Xmath (scalar fused MAC, transcendentals, BAM trig, vector bundles)	Xctx — 8 / 32 hardware contexts	(TBD)	RV64 A extension	+ 64 registers + Xcond predication + 23-bit immediates + LIZ/LIK + indexed addressing + PIC family + RVC-pair (always) and 32-bit-pair (Ant64) dual-issue + register scoreboarding

Math, DSP, and Game Operations

The Xmath extension targets math operations common in games, audio, and DSP. The following table shows how each operation maps to other architectures. "1 instr" indicates a single dedicated instruction; "N instr" indicates a software sequence.

Operation	FireStorm Xmath	x86 (SSE/AVX)	ARMv8 (NEON)	MIPS	PowerPC	RV64GC base
Fused integer MAC	MADD (1 instr, 2 cyc)	not in base ISA; ~3 instr	SMADDL (1 instr)	MADD → HI/LO (1 instr)	not single-instr	mul+add (2 instr)
Saturating add	ADDS (1 instr)	PADDSB/W (SSE, packed)	SQADD (1 instr)	DSP ASE only	AltiVec vaddsws	not in base
Shift-and-saturate	SHIFTSAT.H (1 instr)	PSRA + PADDS (2-3 instr)	SRSHR + SQADD (2 instr)	DSP ASE multi-instr	multi-instr	software (4-5 instr)
FP reciprocal estimate	FRECIP (1 instr, 3 cyc, ~0.05%)	RCPSS (1 instr, ~5×10⁻⁴)	FRECPE (1 instr, ~3×10⁻³)	RECIP.S (MIPS-3D)	FRES (1 instr)	software (FDIV ~15 cyc)
FP 1/sqrt estimate	FRSQRT (1 instr, 3 cyc, ~0.1%)	RSQRTSS (1 instr)	FRSQRTE (1 instr)	RSQRT.S (MIPS-3D)	FRSQRTE (1 instr)	software (FSQRT+FDIV ~35 cyc)
Refined precision toggle	bit 34 = `.R` suffix (6 cyc, ~10⁻⁹)	software refinement	software refinement	—	—	—
Hardware sin/cos	FSIN/FCOS (1 instr, 3 cyc)	x87 FSIN (~50–100 cyc, dropped in modern x86)	library only	library only	library only	library only
sin + cos paired	FSINCOS (1 instr, 3 cyc)	x87 FSINCOS (~80 cyc)	library only	library only	library only	library only
BAM (binary angle) trig	FSINBAM (1 instr, 2 cyc)	none — radians only	none	none	none	none
Perfect modular angle accum	integer ADD on BAM	FP wraparound branch	FP wraparound	FP wraparound	FP wraparound	FP wraparound
3D dot product	DOT3 (1 instr, 6 cyc)	DPPS with mask (1 instr)	FMLA chain (3 instr)	3-FMUL+2-FADD	vmaddfp+sum	3-FMUL+2-FADD
4D dot product	DOT4 (1 instr, 8 cyc)	DPPS (1 instr)	FMLA chain (4 instr)	4-FMUL+3-FADD	vmaddfp+sum	4-FMUL+3-FADD
3D cross product	CROSS3 (1 instr, 10 cyc)	shuffles+mul+sub (~6 instr)	EXT+FMUL+FMUL+FSUB (~6 instr)	software	software	software (~9 instr)
Vector normalise (3D)	VNORM3 (1 instr, 8 cyc)	DPPS+RSQRTSS+MUL (~4 instr)	FMLA+FRSQRTE+FMUL chain	software	software	software (~12 instr)
Linear interpolation	LERP (1 instr, 3 cyc)	FMSUB+FMADD (2 instr)	FMSUB+FMADD (2 instr)	software	software	software (2-3 instr)
Quaternion multiply	QMUL (1 instr, 8 cyc)	software (~28 instr)	software (~28 instr)	software	software	software
Rotate vec by quaternion	QROT (1 instr, 10 cyc)	software (~30 instr)	software (~30 instr)	software	software	software
Vector componentwise	VADD3/VSUB3/VSCALE3 (1 instr)	ADDPS/SUBPS (packed)	FADD/FSUB (packed)	software (3 instr)	vaddfp+	software (3 instr)
2D vector ops	DOT2/LENSQ2/CROSS2 (1 instr)	software (~3 instr)	software	software	software	software
FP clamp	CLAMP (1 instr, 2 cyc)	MAXSS+MINSS (2 instr)	FMINNM+FMAXNM (2 instr)	software	software	software (3-4 instr)
Cubic ease (smoothstep)	SMOOTHSTEP (1 instr, 3 cyc)	CLAMP+FMUL chain (4-5 instr)	CLAMP+FMUL chain	software	software	software (5+ instr)

Database and Search Operations

The Xcrisp B-tree primitives target operations common in databases, key-value stores, and sorted-index workloads.

Operation	FireStorm Xcrisp	x86 (SSE 4.2 / AVX-512)	ARMv8 (NEON)	MIPS	PowerPC	RV64GC base
Sorted-array search (find ≥)	BSRCH.W (1 instr, 4 cyc, 16 keys)	PCMPESTRI (1 instr, 16 bytes); AVX-512 mask+TZCNT (3 instr, 64 bytes)	CMHS+UMINV (3 instr) or software	software	software	software (loop, ~50 cyc)
First-match scan (==)	BSCAN.W (1 instr, 4 cyc, 16 keys)	PCMPEQ+PMOVMSKB+TZCNT (3 instr)	CMEQ+UMINV (3 instr)	software	software	software (loop)
Block shift in node	BSHIFT (1 instr, 5 cyc, 64 bytes)	PALIGNR + REP MOVSB (2-3 instr)	EXT + EXT (NEON, 2 instr)	software	software	software (memmove loop)
Manhattan distance (2D)	MANHATTAN2 (1 instr, 1 cyc)	PSADBW (packed bytes only)	SABD+ADDP (2-3 instr)	software	software	ABS+ABS+ADD (3-5 instr)
Chebyshev distance	CHEBYSHEV2 (1 instr, 1 cyc)	software (~4 instr)	UMAX+UMAX (2 instr)	software	software	software (~5 instr)
Octile distance	OCTILE2 (1 instr, 3 cyc)	software (~6-10 instr)	software (~8 instr)	software	software	software (~10 instr)
Population count	CPOP (Zbb, 1 instr)	POPCNT (1 instr, since SSE4.2)	CNT (1 instr)	not in base	popcntb (1 instr)	CPOP if Zbb
Count leading zeros	CLZ (Zbb, 1 instr)	LZCNT (1 instr, BMI1)	CLZ (1 instr)	CLZ (1 instr, MIPS32r2)	cntlzw (1 instr)	CLZ if Zbb
Count trailing zeros	CTZ (Zbb, 1 instr)	TZCNT (1 instr, BMI1)	RBIT+CLZ (2 instr)	software	software	CTZ if Zbb
Find first set bit + index	CLZ-on-mask sequence	BSF (1 instr); TZCNT preferred	RBIT+CLZ (2 instr)	software	cntlzw on reverse	similar

Notes on the Comparison

A few observations the tables make concrete:

General architecture

Register count. Wide-mode FireStorm has more general-purpose registers than any of these architectures (64 GPRs + 64 FPRs). The closest competitors are MIPS, ARMv8, and RV64GC at 31–32 GPRs. The high register count is a deliberate trade for code that wants to hold significant state in registers across an inner loop (audio synthesis, FIR filters, polyphony).
Immediate construction. ARMv8's MOVZ/MOVK approach is the clear ergonomic reference; FireStorm's LIZ/LIK directly borrows that pattern. The wide-mode imm23/imm14 immediates give FireStorm an extra advantage in the common case (no follow-up MOVK needed for many 32-bit values).
Branch range. FireStorm's wide-mode branch ranges (±32 KiB / ±16 MiB) are wider than RV64GC and similar to ARMv7's, smaller than ARMv8 / x86-64. For function-internal control flow this is more than enough; for long-range calls FireStorm uses the Xcrisp PIC family (JALPC, CALLM). The doubled range over a naive imm14/imm23 design comes from FireStorm's slot-indexed PC convention — wide-mode branch/jump targets are slot-aligned (4-byte), allowing the immediate to scale by 4 rather than 2.
Conditional execution. ARMv7's every-instruction predication remains the most aggressive of the modern architectures; ARM dropped it in v8 because it complicated out-of-order execution. FireStorm's Xcond predication on R-type ALU ops — including all of Xmath G2–G11 in wide mode via the PRED-EN bit — is a deliberate middle ground that predicates the operations benefiting most from it without the verification cost of predicating loads and stores.
Hardware threading. FireStorm is unusual in providing first-class hardware contexts with a dedicated instruction set (YIELD / HALT / NEW / RESUME / FREE). x86 SMT and ARM optional-SMT are different beasts (transparent thread-level parallelism on shared execution units); Xctx is closer to what older mainframes called "hardware coroutines."
Stack management. FireStorm's BSRAM hardware stacks are unique among modern CPUs — closest historical comparison is the 6502's fixed page 1 stack (which was also dedicated SRAM at a fixed location, though far smaller and not user-extensible).
Addressing modes. FireStorm's wide mode has more addressing-mode richness than any other modern RISC, approaching CISC-class flexibility (8 scale factors, auto-inc / auto-dec / pre-dec, PC-relative materialisation, compare-mem-branch). Standard RV64GC is the most addressing-mode-spartan of the modern set; FireStorm restores the kind of expressiveness 68000 programmers took for granted.

Math, DSP, and game operations

Fused multiply-add (integer). Mainstream since the mid-1990s (MIPS R5000 in 1996, ARMv5 in 1995). x86 is the holdout in its base ISA — even modern x86-64 has no single-instruction integer MAC except via SSE/AVX packed forms. FireStorm's MADD matches the well-established convention; nothing exotic here.
Saturating arithmetic. ARM has the cleanest mainstream support (SQADD/SQSUB/SSAT/USAT since ARMv6); x86 has packed forms via MMX/SSE (PADDS family); RV64GC base has nothing. FireStorm's ADDS family matches ARM's ergonomics. SHIFTSAT.H — the universal Q15→int16 audio output primitive — combines two operations in one and appears to be unique to FireStorm at the CPU level; ARM and x86 require a 2-instruction shift-then-saturate sequence.
FP reciprocal / 1/√x estimates. Standard since SSE (1999) and ARMv7 NEON. The Quake III Q_rsqrt integer-bit-twiddling hack that was once mandatory is now a single instruction everywhere. FireStorm's accuracy (~0.05%) is similar to SSE RCPSS (~5×10⁻⁴) and better than ARM FRECPE (~3×10⁻³). The precision-mode bit (.R suffix → Newton-Raphson refinement, ~10⁻⁹ error) is FireStorm-specific — most other architectures require explicit software refinement when better accuracy is needed.
Hardware sin / cos. The x87 has had FSIN/FCOS since 1987, but at ~50–100 cycles per instruction and dropped from SSE/AVX entirely. Modern x86-64 and ARM rely on library sin / cos (~100 cycles). FireStorm's 3-cycle hardware sin/cos is unusual among modern general-purpose CPUs — comparable only to historical DSP chips (TI C6x has hardware sin via CORDIC) and game-console graphics co-processors (PS2 VU0/VU1).
BAM trigonometry. Genuinely distinctive. No mainstream CPU has hardware BAM (binary angle measure) trig. Some game consoles' graphics processors used BAM internally (PS1 GTE used 16-bit fixed-point angles; Sega Saturn VDP rotation), but always in the GPU side. Putting BAM trig in a general-purpose CPU is FireStorm-specific and reflects the retro / demoscene heritage of the project.
3D / 4D dot products. x86 SSE 4.1 introduced DPPS/DPPD (single instruction for 4-element FP dot product with optional mask) in 2007; FireStorm's DOT3/DOT4 are the same idea with similar latency (~6–8 cycles). ARM relies on FMLA chains, MIPS and RV64 on software sequences. 3D cross product (CROSS3) is unusual at the CPU level — virtually every architecture needs ~6 instructions of shuffles plus FMUL plus FSUB; only GPU shader ISAs typically have it as a primitive.
Quaternion math. No mainstream CPU has hardware QMUL or QROT. Skeletal animation engines universally implement these in software (~28+ instructions per operation). PlayStation 2 VU0/VU1 had quaternion-friendly opcodes (VOPMSUB for cross-product helpers) but not single-instruction multiplies. FireStorm is genuinely distinctive here.
Vector componentwise (VADD3, VSUB3, VMADD3). Functionally identical to packed-FP SIMD operations on x86 (ADDPS, SUBPS) and ARM (FADD with vector form). FireStorm's distinction is that they operate on explicitly-named register tuples rather than packed registers, eliminating the need for a separate vector register file.
CLAMP / SMOOTHSTEP / LERP. Single-instruction forms are uncommon at CPU level. x86 needs 2-3 instructions for CLAMP (MAXSS+MINSS); ARM same (FMINNM+FMAXNM). SMOOTHSTEP is a software macro everywhere except FireStorm and GPU shaders. LERP via single instruction is similarly distinctive — most CPUs use FMSUB+FMADD pairs.

Database and search operations

Parallel sorted-array search. x86 SSE 4.2 (Nehalem, 2008) introduced PCMPESTRI specifically for string and short-array searches — 16 bytes per instruction, primarily byte-granularity. AVX-512 extends this to 64 bytes via mask+TZCNT sequences (still 3 instructions). FireStorm's BSRCH is the closest analog at the CPU level, with cleaner semantics (single-instruction, supports 8/16/32/64-bit keys, returns position directly) and wider data (full cache line per instruction). For workloads that look like "find first key ≥ X in sorted array" — B-tree nodes, sorted index buckets, ordered hash chains — this is the most directly competitive feature in FireStorm's instruction set.
Manhattan distance. x86 has had PSADBW (packed sum of absolute differences, byte data) since MMX (1996), originally for video motion estimation but useful for Manhattan distance on packed byte coordinates. ARM has SABD+ADDP. FireStorm's MANHATTAN2/3 work on standard integer registers (not packed) and are designed for pathfinding heuristics rather than image processing.
Octile distance. Genuinely FireStorm-specific. Used in 8-directional grid pathfinding (max(dx,dy) + (√2-1)*min(dx,dy)); software everywhere else, typically 8–10 instructions with a compare-branch.
Bit-manipulation primitives (CLZ / CTZ / CPOP). Standard across modern CPUs since ARMv5 (1995) and x86 SSE4.2 (2008). FireStorm inherits these from RISC-V's Zbb extension, which is implemented per §2 of ee_cpu. No FireStorm-specific addition needed.

Honest Trade-offs

FireStorm is not the right CPU for every job. We are honest about what it doesn't do well:

No bulk SIMD in v0.1. The RISC-V V opcodes have been reallocated to Xmath (see §10 of ee_cpu), so FireStorm does not currently implement standard V or wide-vector SIMD. For workloads dominated by data-parallel array math — bulk image filtering, large mass mixing into wide vectors, dense linear algebra — code runs scalar with Xmath fused-MAC acceleration rather than 4-wide or 8-wide parallel lanes. Xmath captures most of the practical win for game / audio / DSP workloads where the inner loop is per-voice or per-vertex math, but it is not a 1-to-1 V replacement. If a clear bulk data-parallel workload emerges, V could be added in v0.3+ at a different opcode allocation.
FPGA-bound clock speeds. Ant64 targets ~380 MHz on a mid-range GoWin GW5AST (matched to the BSRAM peak rate; pipeline balanced to fit). The ISA-level wins partially offset this — 25–40% fewer instructions at half the clock is competitive in many workloads — but raw single-thread throughput is not where FireStorm competes.
Custom toolchain bring-up required. RV64GC is supported by mainline GCC and LLVM out of the box; the FireStorm extensions require toolchain patches (in progress). Until the patches land, FireStorm code is either hand-written assembly or compiled from intrinsics-using C. A bare RV64GC compile works but leaves the FireStorm wins on the table.
Verification surface is large. Seven extensions interacting with each other (Xctx-with-Xlate state, Xcond-with-Xcrisp loads, Xstack-with-Xctx context, Xmath-with-Xcond predication, B-tree primitives with cache coherence) means more corner cases to verify. We are working through these systematically; the spec set has cross-document references to flag interaction points.
Memory-mapped I/O ordering for DMA is still being formalised. FireStorm reserves the entire 0xFxxx_xxxx quarter (256 MB) for hardware chip registers, accessed uncached and strongly ordered. The precise RVWMO interaction between CPU MMIO stores and DMA-to-MMIO writes (audio buffer to codec, network buffer to MAC, etc.) needs final spec — the suggested baseline is "DMA MMIO writes are strictly ordered relative to surrounding CPU MMIO stores."

Where It Shines

FireStorm is at its best on workloads that mix several of these characteristics:

Sustained real-time audio synthesis with many concurrent voices. Wide register file holds polyphony state in registers across the inner loop; LWPI streams input; SWPI streams output; MMWADD fuses bus-mix accumulation; MADD accelerates per-tap filters and SHIFTSAT.H handles output saturation in one instruction. The 128-voice synth count target on Ant64 is achievable because every primitive in the inner loop maps to one instruction.
Game CPU work — physics, AI, pathfinding, animation, draw-call setup. The Ant64 platform pairs FireStorm with dedicated drawing/audio chipset hardware that handles pixel-level work, so the CPU focuses on the math that sets up draws and runs game state. Xmath's VMADD3 makes physics integration one instruction; VNORM3 makes lighting normalisation 8 cycles; CROSS3 and DOT3 make 3D geometry primitives single-instruction; *OCTILE2 makes A heuristic one instruction; FSINCOSBAM gives bit-exact rotation accumulation forever; QMUL accelerates per-bone skeletal animation**. The combined effect is that the CPU runs game logic, AI, and animation comfortably while the chipset draws.
Database and indexed data structures. The B-tree primitives (BSRCH, BSCAN, BSHIFT) turn the dominant operation in any sorted index — find first key ≥ target — from a branchy multi-cycle scan into a single-instruction parallel search. For in-memory ordered indexes (B+ trees, sorted vectors, ordered hash buckets), this delivers ~10× lookup speedup. Workloads dominated by index access (relational query engines, key-value stores, sorted-set caches) benefit substantially.
Retro emulation cores where a CPU emulator runs dozens of guest CPUs as cooperative tasks. Xctx makes the guest-CPU dispatch nearly free; Xlate handles endian conversions for guests with foreign byte order; Xstack gives the emulator state its own BSRAM region without touching DRAM. For guest CPUs with their own math (early arcade boards, 16-bit consoles), Xmath's MADD and saturating arithmetic accelerate the guest ALU emulation.
Modular system software where many small functions call each other through indirection tables. Xcrisp PIC's CALLM is one instruction for vtable dispatch; Xstack's PUSH/POP is one instruction for prologue/epilogue. The whole system feels lighter on dispatch overhead.
Generated code and interpreters. The bytecode dispatch loop of an interpreter is fundamentally a switch statement plus a few state accesses. JMPXPC is one-instruction switch dispatch. Combined with Xctx-driven cooperative scheduling of multiple interpreters, this is the right architecture for language runtimes and bytecode VMs running as FireStorm applications — scripting engines embedded in games, custom interpreters, and similar workloads. (AntOS's own Luau runtime runs on DeMon, not FireStorm — see AntOS — so this strength is about interpreters a FireStorm application hosts, not the OS scripting layer.)
Demoscene effects. BAM-based rotozoomers, plasma effects with multiple summed BAM-indexed sine waves, BAM-phase wavetable oscillators, fast normalisation for raycasters — Xmath's BAM trigonometry is the natively-suited primitive for retro / demoscene rendering techniques.
Embedded creative tools — pixel art editors, music trackers, level designers. These mix UI dispatch, file I/O, and arithmetic in roughly equal measure. FireStorm has primitives for all three.

For pure-throughput numerical computing (climate modeling, deep learning training), FireStorm is not the answer — those workloads want bulk SIMD vectors, multi-GHz clocks, and high memory bandwidth, none of which FireStorm v0.1 prioritises. The Ant64 platform pairs FireStorm with the DeMon (ESP32-P4) and Pulse (ESP32-P4) supervisors for tasks where FireStorm isn't the right tool.