FireStorm Execution Engine (EE) — Instruction Set Architecture

Document version: 0.1 (draft) Status: Initial design capture Target: GoWin GW5AST FPGA (Ant64 platform)

1. Overview

FireStorm is the soft CPU implemented in the GoWin GW5AST FPGA on the Ant64 platform. It is referred to throughout the Ant64 documentation as the FireStorm Execution Engine, or simply the EE. This document is the detailed instruction-set reference for it; see the EE engineering overview for the higher-level picture. It is binary-compatible with the RV64GC baseline of RISC-V, so any standard RISC-V toolchain (gcc, llvm/clang, binutils, rustc) targeting riscv64-unknown-elf or riscv64-linux-gnu will produce code that runs on FireStorm without modification.

On top of the base ISA, FireStorm adds two layers of extension:

CRISP custom extensions. A set of CRISP-influenced instructions (auto-increment load/store, load-op and op-store fusion, compare-mem-branch, block memory primitives) encoded in the standard RISC-V custom-0/1/2/3 opcode space. These are available to all code, regardless of which memory it executes from, and improve density and performance of compiler-generated C code.
Wide register file (36-bit memory mode). FireStorm's hot-path memory is external 36-bit SRAM (the Ant64 platform fits a single 1M×36 bank, ~4.5 MB; the 36-bit width is a property of the SRAM chips themselves, supporting either four 9-bit bytes per word or 32 data bits plus 4 parity/tag bits, depending on the system's choice). FireStorm uses all 36 bits as data: 32 bits of standard RV64 instruction encoding plus 4 extra bits interpreted as a register-extension prefix that doubles the architectural register file from 32 to 64 GPRs and 32 to 64 FPRs. This is transparent to the rest of the system: code fetched from DDR3 executes as plain RV64GC; code fetched from 36-bit SRAM executes as RV64GC with extended register indices. Mode is determined entirely by the physical fetch address — there is no MSTATUS flag, no mode switch instruction, and no ABI break.

The combination gives a single ISA story (RV64GC + custom extension), one compiler target, one ABI, and a smooth performance gradient: ordinary code runs unmodified; hot code placed in 36-bit SRAM gains register-pressure relief without recompilation against a different ISA.

2. Relationship to Standard RISC-V

Layer	Status
RV64I	Base integer ISA, fully implemented
M	Integer multiply/divide, fully implemented
A	Atomics, fully implemented
F	Single-precision float, fully implemented
D	Double-precision float, fully implemented
C	16-bit compressed instructions, fully implemented
Zicsr	Control/status registers, fully implemented
Zifencei	Instruction-fetch fence, fully implemented
Zba, Zbb, Zbs	Bit-manipulation subsets, implemented
Zcmp, Zcmt	Code-size reduction (push/pop multiple), implemented
V	Vector extension — not implemented; OP-V opcode (`0x57`) reallocated to Xmath (§10)
Xcond	FireStorm conditional execution extension (separate document)
Xcrisp	FireStorm custom CRISP extension (§11)
Xctx	FireStorm hardware context switching extension (separate document)
Xlate	FireStorm memory translator extension (separate document)
Xstack	FireStorm hardware stack extension (separate document)
Xwide	FireStorm wide-register-file extension, active only in 36-bit memory (§7, §8)
Examples	See FireStorm Performance Examples for worked comparisons

Anything not listed is unimplemented. The Xwide extension is not visible in misa because it is not a mode the program selects — it is a property of the memory the instruction was fetched from.

The compiler target triple is riscv64-firestorm-elf. A vanilla riscv64-unknown-elf build also works; it simply will not emit Xcrisp instructions and will not place code in the wide section.

2.1 Compatibility with Standard RISC-V

FireStorm preserves RISC-V compatibility at two distinct levels, depending on which fetch mode is in play:

Narrow mode is object-code compatible with RV64GC. A binary produced by a vanilla riscv64-unknown-elf toolchain runs unchanged on FireStorm when placed in the DDR3 narrow region. There is no need to recompile, relink, or adapt — the standard RV64GC instruction encodings decode identically, and the standard ABI (lp64d) is preserved bit-for-bit. The custom-0/1/2/3 opcode slots used by Xcrisp are unused in vanilla RV64GC, so existing code never collides with them. Pre-built libraries, OS kernels, and prebuilt distributions can be dropped into a FireStorm system without modification.
Wide mode is source-code compatible with RV64GC. Existing RV64GC C, C++, Rust, or assembly source compiles into wide-mode sections (.text.wide) using the FireStorm toolchain, and the result behaves the same way semantically — same ABI, same register names for x0–x31, same instruction mnemonics. The wide-mode code can additionally name x32–x63, use Xcond predication, exploit the 23/14-bit extended immediates, and use LIZ/LIK and the Xcrisp PIC family. Object code is not portable between narrow and wide modes (a 36-bit slot's extension nibble does not exist in narrow DDR3), but source code is — so existing RISC-V libraries can be recompiled into either mode as needed.

The practical consequence: a FireStorm system can host an existing RV64GC ecosystem unchanged in narrow mode, while applications recompile selectively into wide mode for the parts that benefit. FireStorm is additive over standard RV64 — extensions are opt-in, never compulsory, and never break compatibility with code that doesn't use them.

3. Memory Architecture and Mode Selection

FireStorm distinguishes two fetch modes by physical address range:

Memory	Width	Fetch mode	Notes
DDR3 (external)	32 bits / 4 bytes	Narrow (standard RV64GC)	Main code and data store
36-bit SRAM (external)	36 bits / 4 bytes + 4 extra	Wide (RV64GC + Xwide)	Hot-path code only — Harvard restriction (§5.1)
Scratchpad BSRAM	32 bits	Narrow	Data only; single-cycle (§5.3)
Hardware chip registers	32 bits	Narrow	`0xFxxx_xxxx` MMIO region (§5.4); not normally executed from

Mode is fixed by the physical address of the instruction fetch:

An instruction fetched from a 36-bit-SRAM-mapped address is decoded as wide: the standard 32-bit RV64 encoding occupies bits [31:0] of the 36-bit word, and bits [35:32] form the extension nibble (§7).
An instruction fetched from any other address is decoded as narrow: the 4 extra bits do not exist and the decoder behaves as a standard RV64GC implementation.

A branch, jump, or call that crosses between memory regions changes mode transparently on the next fetch. The architectural state (register file, CSRs, PC) is identical in both modes — only the interpretation of the next fetched instruction word differs. There is no pipeline stall and no architectural visible effect beyond which registers the next instruction can name.

Implications:

A wide function may call a narrow function and vice versa with no special prologue.
A trap from wide code into a narrow trap handler works without any mode bookkeeping in mstatus.
Self-modifying code that copies an instruction from SRAM to DDR3 (or vice versa) will reinterpret the instruction's extension bits: this is not a binary-compatible move and is the programmer's responsibility.

Harvard restriction on 36-bit SRAM. The 36-bit SRAM region holds wide-mode code exclusively. Ordinary data loads and stores to this region trap; the only way to access SRAM contents is M-mode privileged code-loading paths. This means wide-mode programs operate in a Harvard configuration: code in SRAM, data in DDR3 (cached) or scratchpad BSRAM (uncached, single-cycle). The data memory subsystem is described in §5; the trap is documented at §5.1.

3.1 Narrow vs Wide Mode Comparison

The table below summarises every architectural difference between the two modes. Anything not listed is identical.

Physical and Encoding

Property	Narrow Mode	Wide Mode
Memory region	DDR3 (`0x0000_0000`–`0x7FFF_FFFF`)	36-bit SRAM (`0x8000_0000`–`0xBFFF_FFFF`)
Memory width	32 bits per fetch slot	36 bits per fetch slot (32 + 4 nibble)
Mode determined by	Physical address of instruction fetch	Physical address of instruction fetch
Extension nibble	Does not exist	bits [35:32], present in every 36-bit slot (§7.1)
Data accessible from this code	All data regions	All data regions except 36-bit SRAM itself (Harvard, §5.1)

Register File

Property	Narrow Mode	Wide Mode
Integer GPRs nameable	x0–x31	x0–x63
Floating-point regs nameable	f0–f31	f0–f63
Extra registers (x32–x63, f32–f63)	not addressable	caller-saved scratch (§13.1)
ABI	lp64d unchanged	lp64d + caller-saved extended bank
Wide-dirty tracking	n/a	per-hart bit (§6.3)

Immediate Widths

Instruction class	Narrow imm	Wide imm	Wide range
LUI / AUIPC	20 bits	23 bits	upper bits [34:12]
JAL	20 bits (±1 MB)	23 bits	±16 MB (×4 slot scaling, §7.3.2)
ADDI / ANDI / ORI / XORI / SLTI / SLTIU / ADDIW / JALR	12 bits (±2 KB)	14 bits	±8 KB
Loads (LB, LH, LW, LD, LBU, LHU, LWU)	12 bits	14 bits	±8 KB
Stores (SB, SH, SW, SD)	12 bits	14 bits	±8 KB
Branches (BEQ, BNE, BLT, BGE, BLTU, BGEU)	12 bits (±4 KB)	14 bits	±32 KB (×4 slot scaling, §7.3.2)
Shift amount (SLLI, SRLI, SRAI, SLLIW, SRLIW, SRAIW)	6 bits (64-bit) / 5 bits (32-bit)	unchanged	(bounded by register width)

Instructions Available

Instruction family	Narrow	Wide	Notes
Standard RV64GC (RV64I + M + A + F + D + C + Zicsr + Zifencei)	✓	✓	Identical decode
Xcrisp custom-0/1/2/3 (auto-inc, fused, BMC/BMS, compare-mem-branch, DMA)	✓	✓	Custom opcodes 0x0B, 0x2B, 0x5B, 0x7B
Xstack (PUSH, POP, POPRET, USWAP, PEEK, POKE)	✓	✓	Mode-agnostic
Xlate (per-register translators)	✓	✓	Translators are register state, not instructions
Xctx (YIELD, HALT, NEW, RESUME, etc.)	✓	✓	Mode-agnostic
Xcond predicated R-type	—	✓	Uses PRED-EN bit in R-type nibble
Xcrisp PIC (LDPC, LWPC, LAPC, JALPC, JALXPC, CALLM, JMPM)	—	✓	`0x7F` escape (§7.5)
Xcrisp X-type indexed loads (LBX, LHX, LWX, LDX, LBUX, LHUX, LWUX)	—	✓	`0x7F` escape
LIZ / LIK 16-bit immediate construction	—	✓	`0x7F` escape, funct3=111 (§7.6)
Wide-mode immediate extensions (imm14, imm23)	—	✓	Uses formerly-spare nibble bits (§7.3)

Toolchain Markers

Property	Narrow Mode	Wide Mode
Compatibility with standard RV64GC	Object-code compatible — unmodified RV64GC binaries run unchanged (§2.1)	Source-code compatible — RV64GC source recompiles for wide; object code is not portable between modes (§2.1)
Default section names	`.text`, `.text.crisp`	`.text.wide`, `.text.wide.*`
ELF section flag	(none — default)	`SHF_FIRESTORM_WIDE` (FireStorm-specific bit)
Mode directive (FireStorm-friendly)	`.fsmode narrow`	`.fsmode wide`
Mode directive (RISC-V canonical)	`.option arch -xwide`	`.option arch +xwide`
Target feature flag	(none required)	`+xwide` (`+xfirestorm` implies it)
Function attribute	(default)	`__attribute__((target("xwide")))`
Linker placement requirement	Section must land at address < `0x8000_0000`	Section must land in 36-bit SRAM range

Cross-mode calls (narrow→wide or wide→narrow) work transparently — the CPU switches mode on the next fetch based on the target address. There is no special call instruction, no mode-switch overhead beyond what a normal call would cost, and no ABI difference: arguments and return values follow lp64d in both modes.

4. Instruction Fetch and Prefetch Buffers

FireStorm uses a multi-buffer rolling prefetch system for instruction fetch in place of a traditional instruction cache. There is also no data cache: data accesses go directly to DRAM (or memory-mapped resources), and the most frequent data structures — Xstack hardware stacks, Xctx context state, BSRAM-resident sample tables — bypass DRAM entirely via dedicated BSRAM banks. This section describes the fetch system; everything else FireStorm does about memory access is per-instruction semantics handled in the relevant extension docs.

The design point is simplicity, predictability, and silicon efficiency over peak hit-rate. FireStorm targets workloads — audio synthesis, retro emulation, OS event loops, interpreter dispatch — where the working set of code at any moment is small enough to fit in 4–8 small buffers. The compelling properties:

No tag RAM, no associativity logic. Hit detection is a handful of parallel range comparisons.
Predictable latency. A buffer hit is one cycle; a miss is a deterministic DRAM read.
Concurrent refill. Multiple buffers fill in parallel from independent BSRAMs while the CPU executes from another.
Per-buffer pinning. Trap-handler latency is one cycle of pipeline drain plus zero fetch stall when the trap vector is pinned.
Simple DMA coherence. DMA writes auto-invalidate any overlapping buffer; no cache flush dance.
Mode-aware. Narrow- and wide-mode buffers coexist in the pool; mode-switching alternation pays only LRU eviction cost.

4.1 Buffer Configuration

Per-variant sizing:

FireStorm variant	Buffers	Per-buffer	BSRAM per buffer	Total fetch BSRAM
All models (GW5AST-138)	8	2 KB	1 × 18-Kib block (512×36)	16 KB

Each buffer is implemented as one GoWin 18-Kib BSRAM block configured as 512 × 36 bits, holding 512 wide-mode 32-bit instructions (the 4 extension-nibble bits and the 32-bit RV64 word together fit one 36-bit slot). In narrow mode the same BSRAM holds 512 standard 32-bit instructions, or 1024 RVC 16-bit instructions when the buffer is used in RVC-decode mode.

Each BSRAM block is independently addressed with its own port. All buffers can read and refill concurrently — there is no port contention between buffers. The CSR mxbuf (§4.6) reports the actual variant configuration.

4.2 Buffer State

Each buffer has the following architectural state:

Field	Bits	Meaning
`valid`	1	Buffer contains valid instructions
`base_pc`	64	Physical address of the first instruction in the buffer
`length`	12	Number of valid instructions (0..1024)
`mode`	1	Narrow (`0`) or wide (`1`)
`pinned`	1	Excluded from LRU eviction when set
`lru_rank`	log₂(N)	LRU ranking among non-pinned buffers (3 bits for 8 buffers)

Total per-buffer state: ~85 bits. Held in a small register file alongside the BSRAM array. The state is M-mode-visible via mxbuf_state_N CSRs (§4.6).

4.3 Fetch Behaviour

On each instruction fetch at PC X, all valid buffers compare in parallel. A buffer hits if:

buffer[i].valid
AND buffer[i].mode == current_mode
AND buffer[i].base_pc ≤ X < buffer[i].base_pc + buffer[i].length × inst_width

Hit Path

The instruction is read from the matching buffer's BSRAM in one cycle. The hardware updates the buffer's lru_rank to "most recently used." Fetch throughput is one instruction per cycle from a hit buffer; aggressive implementations with 72-bit-wide BSRAM ports can deliver two instructions per cycle.

Miss Path — Buffer Allocation

A miss selects a victim buffer to refill. Selection priority:

The least-recently-used non-pinned buffer with valid = 0 (if any).
The least-recently-used non-pinned valid buffer.
If all buffers are pinned, the fetch traps with cause BUFFER_EXHAUSTED (§4.7). M-mode kernels are responsible for ensuring at least one buffer is available for eviction; the hardware does not silently violate pinning.

Miss Path — Fill Geometry

The new buffer's base_pc is computed as (X − BUFFER_SIZE / 3), rounded down to instruction-width alignment. This places the missing PC roughly 1/3 from the buffer start, giving:

2/3 of the buffer for forward execution — typical for loop bodies and function continuations.
1/3 of the buffer for backward branches — covers loops up to BUFFER_SIZE/3 = 683 bytes (~170 narrow instructions or ~170 wide instructions) without needing a second buffer.

The DRAM fill streams BUFFER_SIZE bytes starting at base_pc, critical-word-first: the requested instruction at PC X streams first to minimise time-to-first-instruction, then the surrounding range fills around it. The CPU stalls only until the requested instruction is written; subsequent instructions become available as the fill progresses.

Miss latency ≈ DRAM_first_access_cycles + 1 / DRAM_burst_rate for the requested word, plus the rest of the fill happening in parallel with execution. On a DDR3-1600 system at 200 MHz CPU clock, this is roughly 15–25 cycles to first instruction, then full speed.

Backward Branches

A backward branch from PC X to PC Y (Y < X) hits in the current buffer if Y ≥ base_pc. For a freshly-allocated buffer with X at 1/3 from start, backward reach is BUFFER_SIZE / 3. As execution proceeds forward through the buffer, the back-margin grows; at the buffer's end, backward reach is the full BUFFER_SIZE.

For typical inner loops well under 2 KB — voice render loops (~30–100 instructions), Z-machine dispatch (~50), retro-emulator handlers (~5–30) — backward branches within the loop pay zero fetch latency. This is the dominant case the buffer system optimises for.

Forward Jumps

A forward jump to PC Y (Y > X) hits if Y < base_pc + length × inst_width. Short forward jumps within the loop body hit; long forward jumps (calls to unrelated code) typically miss and allocate a new buffer for the target region.

4.4 Concurrent Refill and Speculative Prefetch

With per-buffer BSRAM ports, multiple buffers may fill from DRAM in parallel — the architecturally-visible bottleneck is DRAM bandwidth, not BSRAM port contention. Two fill triggers:

Demand fill (§4.3 miss path): a fetch miss starts a fill into the chosen victim. Highest priority.
Speculative fill: a buffer is allocated and filled for a PC the CPU is predicted to fetch soon. Two sub-triggers:
- Branch predictor: when a forward branch is predicted taken to a target outside any current buffer, hardware may allocate and pre-fill the target buffer.
- Software hint: M-mode code writes a PC to mxbuf_prefetch (§4.6), allocating a buffer and starting a fill.

Multiple speculative fills can proceed in parallel, limited by DRAM bandwidth. The DRAM controller serialises requests; demand fills preempt speculative fills.

If a speculative prediction is wrong, the buffer simply ages out via LRU. No architectural state is corrupted; the cost is wasted DRAM bandwidth on the speculative fill.

4.5 Pinning

The pinned bit excludes a buffer from LRU eviction. Pinned buffers remain in the pool until explicitly unpinned; M-mode software manages pinning policy via the mxbuf_pin CSR.

Suggested pinning uses:

Trap vector: pin a buffer to the trap-handler entry region for deterministic ISR fetch latency — a critical property for hard real-time audio interrupts.
Audio inner loop: pin a buffer to the most critical real-time DSP path; eliminates the worst-case fetch stall from the inner loop.
Xctx scheduler core: pin a buffer to the context-switch path so YIELD/HALT/FREE never trigger a fetch stall.

User code cannot pin. Privileged kernels decide pinning policy; this prevents user code from monopolising the buffer pool.

4.6 Mode Switching

Each buffer carries its mode tag. When the CPU transitions modes — typically through a cross-memory call (§12.1) — fetch looks for a buffer matching the new mode covering the target PC. Behaviours:

Both-mode hit: both buffers (narrow and wide) covering the target region exist; fetch picks the matching-mode one. Mode-alternating code (e.g., a wide-mode hot path that occasionally calls into narrow-mode utility code) pays only the mode-tag-compare cost.
New-mode miss: no buffer covers the target in the new mode. Hardware allocates a fresh buffer in the new mode.

The pool is mode-mixed: there is no fixed "narrow buffers" or "wide buffers" partition. LRU naturally evicts whichever mode's buffer is coldest.

4.7 DMA Coherence and Self-Modifying Code

Automatic DMA invalidation. When a DMA write (DMACPY or DMASET; Xcrisp §5.5.2) completes to physical address range [A, A+L), the hardware compares this range against each valid buffer's [base_pc, base_pc + length × inst_width). Any buffer overlapping the DMA write is invalidated (valid set to 0). The hardware cost is N parallel range-overlap comparators per DMA write — trivial for N ≤ 8.

Manual flush. Software may invalidate buffers explicitly via M-mode CSRs:

mxbuf_flush_addr: write a PC; all buffers overlapping that address are invalidated.
mxbuf_flush: write a buffer-mask; named buffers are invalidated.
mxbuf_flush_all: write any value; all non-pinned buffers invalidated.

fence.i semantics. The standard RV64 fence.i instruction is interpreted as "flush all non-pinned buffers." Existing code using fence.i after self-modifying writes works correctly without modification.

Self-modifying code recommendations:

If the modification is large (>1 KB), use DMACPY/DMASET — auto-invalidation handles coherence.
If the modification is small, use standard stores, then issue fence.i or mxbuf_flush_addr.
Pinned buffers are not auto-invalidated by DMA. Code that intentionally protects a buffer from automatic invalidation must unpin before SMC writes to that region.

4.8 CSR Allocation

CSR	Address (suggested)	Privilege	Description
`mxbuf`	`0xFC6`	MRO	Detection, buffer count, size
`mxbuf_pin`	`0xBD0`	MRW	Bitmap: 1-bit per buffer indicating pinned status
`mxbuf_flush`	`0xBD1`	MRW	Write buffer-mask to invalidate those buffers
`mxbuf_flush_addr`	`0xBD2`	MRW	Write a PC; all buffers overlapping that address invalidated
`mxbuf_flush_all`	`0xBD3`	MRW	Write any value to flush all non-pinned buffers
`mxbuf_prefetch`	`0xBD4`	MRW	Write a PC; allocate a buffer and speculatively fill
`mxbuf_state_0`	`0xBE0`	MRO	State of buffer 0 (composite: base_pc/length/mode/pinned/valid)
`mxbuf_state_N`	`0xBE0`+N	MRO	State of buffer N (N = 0..7)

Bit layout of mxbuf:

Bits	Field	Meaning
`[0]`	PRESENT	1 if multi-buffer prefetch implemented
`[7:1]`	VERSION	Version (1 = v0.1)
`[11:8]`	NUM_BUFFERS	Number of buffers (4 or 8)
`[19:12]`	BUFFER_SIZE_LOG2	log₂ of buffer size in bytes (11 = 2 KB)
`[20]`	HAS_AUTO_DMA_INVAL	1 if DMA writes auto-invalidate overlapping buffers
`[21]`	HAS_SPECULATIVE_FILL	1 if `mxbuf_prefetch` is implemented
`[22]`	HAS_BRANCH_PREDICT_PREFETCH	1 if the branch predictor triggers speculative fills
`[63:23]`	reserved	—

4.9 Trap Causes

Cause	Mnemonic	Trigger
`48`	BUFFER_EXHAUSTED	Fetch miss with all buffers pinned (no eviction victim)

Cause number is suggested; final assignment coordinates with the other FireStorm trap causes (Xstack 24–31, Xcond 32–34, Xctx 40–42, Xlate 32–34 are currently allocated).

4.10 Reset State

On reset, all buffers are invalid (valid = 0, pinned = 0). The first instruction fetch causes a miss, allocates buffer 0, and begins filling at the reset vector. Reset behaviour is not observable to U-mode code; M-mode initialisation completes long before user code runs.

4.11 Why Not a Cache?

The prefetch buffer system is a deliberate alternative to a conventional I-cache. The trade-offs are:

Advantages of the buffer system:

Smaller silicon footprint. A 16 KB cache with 4-way associativity requires ~2 KB of tag RAM, replacement-state RAM, and substantial cache-coherence logic. The buffer system needs ~85 bits of state per buffer and a few range comparators.
Predictable timing. Cache replacement is data-dependent; buffer hit/miss depends only on whether the target PC is in one of N visible ranges.
Simpler verification. Cache state machines are complex around traps, page faults, and DMA snoops. The buffer system has a small finite state space.
Per-buffer pinning gives deterministic real-time guarantees that caches cannot match without dedicated SRAM regions.

Disadvantages versus a cache:

Lower hit rate for code with poor locality (random function-pointer dispatch through a large code base, JIT-compiled hot regions scattered across memory). For these workloads, a cache wins.
Less efficient use of memory bandwidth. A cache loads exactly the lines it needs; a buffer fill always loads BUFFER_SIZE bytes around the requested PC, some of which may never be executed.

For Ant64's workload mix (audio, retro emulation, OS event loops, modular software), the trade-off favours the buffer system. Variants with different workload targets — large-scale general-purpose computing, code generators, dynamic-loader-heavy applications — would benefit from a conventional cache instead.

5. Data Memory Architecture

FireStorm provides a layered data memory subsystem balancing predictability (BSRAM-resident hot data) and convenience (cached DRAM for general-purpose access). There are four distinct data memory regions, each with its own caching, latency, and use-case profile.

5.1 Harvard Restriction on 36-bit SRAM

The 36-bit SRAM region (0x8000_0000–0xBFFF_FFFF) holds wide-mode instructions exclusively. Ordinary data loads and stores to this range trap with cause DATA_IN_CODE_MEMORY (cause 49). This makes wide-mode programs Harvard-architecture: code lives in SRAM, data lives in DDR3 (cached) or scratchpad BSRAM.

M-Mode Exceptions

Two M-mode-only paths exist for code-loading scenarios:

Code-deposit stores. M-mode code may write to the SRAM range via standard stores to deposit wide-mode instructions (loading a JIT-compiled function, paging in code from external storage, copying a relocated function image into the wide code region). Each written value has its extension nibble taken from bits [35:32]. Writes are coherent with the prefetch buffer system: any overlapping prefetch buffer is auto-invalidated (§4.7).
Debug inspection. M-mode debug code may read SRAM via the mxsram_read CSR. Ordinary load instructions from M-mode still trap; the explicit CSR-mediated path prevents accidental code-as-data reads in normal kernel code.

Rationale

Simpler memory subsystem. With no data accesses to SRAM, the D-cache does not need to cover that range. Address-overlap logic for DMA invalidation is simpler. The SRAM's wide-port BSRAM design is dedicated to instruction fetch without contention.
Loud failure on bad pointers. Software that confuses code pointers with data pointers traps at the first access rather than returning garbage values or silently corrupting executable code.
No mode confusion for data. Data in SRAM would either carry an extension nibble (raising the question of what to do with bits [35:32] on read) or not (wasting the 4 extra bits per word). The Harvard restriction sidesteps the question.

Programming Model

Wide-mode programs look identical to narrow-mode programs from the data-access standpoint. Code is loaded by the OS into SRAM at link time (or paged in at runtime via the M-mode code-deposit path); data structures are allocated in DDR3 (cached) or scratchpad (uncached, fast). Pointers between code and data are ordinary 32-bit-address-space pointers — they cannot point into SRAM as data, but the 64-register file gives plenty of scratch space without needing to.

5.2 DDR3 Main Memory and D-Cache

The DDR3 range (0x0000_0000–0x7FFF_FFFF) is the primary working memory for general data. Capacity is platform-defined (typically several gigabytes on Ant64). Access latency without caching would be ~15–30 cycles per random access; the D-cache reduces this to one cycle on hit.

Cache Configuration

Parameter	Value
Total size	8 KB
Associativity	Direct-mapped
Line size	32 bytes
Lines	256
Write policy	Write-through, no-write-allocate
Replacement	N/A (direct-mapped)
BSRAM cost	4 × 18-Kib blocks for data, small tag register file

Write-Through Rationale

Write-through eliminates the write-back state machine: no dirty bit, no eviction-from-dirty handling, no partial-write races, no fence-around-store complications. Stores update the cache and stream to DRAM through a write-combining store FIFO. Each retired store (SB/SH/SW/SD, Xcrisp auto-increment stores) pushes one entry; from the pipeline's perspective the store retires in one cycle, and the FIFO drains to DRAM in the background, coalescing address-contiguous stores into BL8 burst transactions. The CPU stalls only when the FIFO is full — and since DDR3 sustained write bandwidth (~4–5 GB/s on the 64-bit DDR3 bus) comfortably exceeds the rate at which the store stream can push entries, a depth of 16–32 entries makes FIFO-full stalls essentially absent in realistic workloads.

The FIFO is snoopable from the load path: a load checks pending FIFO entries (alongside the cache) and returns the freshest value rather than waiting for the DRAM commit — the same store-to-load forwarding used for top-of-stack caching in the Xstack path. The coalescing matters most for the no-write-allocate streaming cases below, where contiguous single-word stores would otherwise each cost a separate DRAM transaction.

(The original design used a 4-entry buffer with no combining. The deeper, BL8-coalescing FIFO is a refinement carried back from the eZX RV32 write path — that core has no D-cache, so the FIFO is its sole write path and proved out the deeper-plus-coalescing design under sustained store load.)

The cost of write-through is higher DRAM write bandwidth. For FireStorm's target workloads this is acceptable: the high-frequency stores (Xstack frames, Xctx state, audio voice updates in scratchpad) all go to BSRAM, not DRAM, so the cache write-through path sees only the lower-frequency general-data stores.

No-write-allocate means a store missing the cache writes directly to DRAM without bringing the line in. This avoids fetching a 32-byte line for a single-byte write — common in initialisation, output streaming, and ring-buffer producer code.

Cache Coherence with DMA

DMA writes auto-invalidate matching cache lines. For each DMA write to address A, the cache index (A >> 5) & 0xFF is computed and that line's valid bit is cleared. No software flush is needed before or after DMA.

DMA reads pass through DRAM directly (write-through guarantees DRAM is always current).

External DMA agents that bypass the FireStorm DMA engine must coordinate via software flush (mxdcache_flush_addr).

Cache State

Per-line field	Bits	Meaning
`valid`	1	Line contains valid data
`tag`	19	Upper bits of physical address (31 − 5 − 8 + spare)
`data`	256	The 32 bytes of cached data

Total: ~32 bytes data per line × 256 lines = 8 KB data + ~1 KB tag register file.

Cache Control CSRs

CSR	Address (suggested)	Privilege	Description
`mxdcache`	`0xFC7`	MRO	Detection, size, line count
`mxdcache_flush`	`0xBD8`	MRW	Write any value to invalidate the entire cache
`mxdcache_flush_addr`	`0xBD9`	MRW	Write an address; the matching cache line is invalidated

Cache-Bypass Address Aliasing

Each cacheable physical address has two virtual aliases that differ only in bit 63 of the 64-bit address:

Bit 63 = 0 → cached access. Load/store goes through the D-cache normally.
Bit 63 = 1 → uncached access. Load/store bypasses the D-cache and goes directly to physical memory.

The lower 63 bits identify the physical location; both views map to the same memory. For a DDR3 address 0x0000_0000_4000_0000:

Alias	Behaviour
`0x0000_0000_4000_0000`	Cached view — hits/misses through the D-cache.
`0x8000_0000_4000_0000`	Uncached view — direct to DRAM, no cache lookup, no cache pollution.

Coherence between the two views. A bypassed store to address X automatically invalidates any cached line covering address X (cost: one cache-index compute + one valid-bit clear, identical to the DMA auto-invalidation logic of §5.2). A bypassed load reads the current DRAM value. After a bypassed store followed by a cached load to the same address, the cached load misses (line was invalidated), refills from DRAM, and sees the just-written value. Software does not need explicit fences between the two views in the common case.

Scope. The bit-63 bypass applies only to cacheable physical regions — currently DDR3 (0x0000_0000–0x7FFF_FFFF). For non-cacheable regions:

Scratchpad BSRAM: bit 63 is a no-op (BSRAM is direct-access by construction).
MMIO: bit 63 is a no-op (MMIO is uncached and strongly ordered by construction).
36-bit SRAM: bit 63 is a no-op (Harvard restriction blocks data access regardless).

Instruction fetch. Instruction fetch is unaffected by bit 63. Fetches always use the lower 63 bits of PC, and the mode (narrow / wide) is determined by the physical fetch address as described in §3.

DMA. DMA addresses are interpreted as physical addresses; bit 63 is ignored on DMA operations (DMA is uncached by construction).

Software idiom. A simple macro materialises the bypass alias for any pointer:

#define UNCACHED(p) ((void *)((uintptr_t)(p) | (1ULL << 63)))

// Stream a large sample buffer without polluting the 8 KB D-cache:
for (int i = 0; i < big_size; i++)
    process_sample(*UNCACHED(buf + i));

Use cases. The bypass is most valuable when the access pattern doesn't benefit from caching:

Streaming reads/writes of buffers much larger than the 8 KB cache (audio samples, framebuffers, network buffers).
One-shot initialisation of large data structures.
DMA ring producer/consumer where coherence with a DMA partner is more important than reuse latency.
Predictable real-time paths where the variable D-cache miss latency is undesirable.

Rationale. Borrowed from MIPS's kseg0/kseg1 scheme and similar address-aliasing conventions used in many embedded SoCs. The technique avoids dedicating a new instruction encoding (no LW.NT or similar) and requires no new CSR — the bypass is encoded in the address itself, which composes naturally with pointer arithmetic.

5.3 Scratchpad BSRAM

A user-addressable BSRAM region for software-managed hot data.

Configuration

Variant	Size	Address range	BSRAM blocks
All models	32 KB	`0xC000_0000`–`0xC000_7FFF`	16

Properties

Single-cycle access for any load or store size (byte, halfword, word, doubleword).
Wide BSRAM ports — 576-bit — so multi-dword accesses pipeline at one per cycle.
DMA-accessible: DMACPY/DMASET can copy bulk data in and out (e.g., loading a sample LUT from DRAM to scratchpad at start-up).
Uncached: there is no separate cache layer; the BSRAM is the storage.
No translation, no permission beyond M/S/U physical address protection: scratchpad is treated as ordinary memory at a known fast address.
Global across contexts in v0.1: all Xctx contexts share the same scratchpad region. Per-context partitioning is a v0.2 candidate.

Typical Use

__attribute__((section(".bsram"))) static int16_t sine_lut[1024];
__attribute__((section(".bsram"))) static voice_t active_voices[32];
__attribute__((section(".bsram"))) static int scheduler_priorities[16];

The linker places .bsram symbols in the scratchpad region. A small runtime allocator may hand out remaining scratchpad space dynamically:

void *bsram_alloc(size_t size);
void  bsram_free(void *ptr);

For an audio synth, putting the active-voice array and filter coefficients in scratchpad means each per-sample state access is one cycle. The 8 KB / 32 KB sizes are tuned for these workloads — comfortably larger than the typical audio-rendering working set plus scheduler state plus small LUTs.

Memory Ordering

Scratchpad accesses follow RVWMO like any other memory. Stores are immediately visible to subsequent loads from the same hart; cross-context visibility (when contexts can run on different cores in v0.2) requires fences as usual.

5.4 Hardware Chip Registers (MMIO)

The entire address range 0xF000_0000–0xFFFF_FFFF (256 MB) is reserved for memory-mapped hardware registers. This hosts all chipset peripherals: the DMA engines, audio codec (WM8958/WM8960), video controllers, network controllers, storage interfaces, Sticky-controlled joystick I/O, FireStorm-internal control registers exposed to software, and any future hardware additions.

Uncached: every load and store goes directly to the target hardware. Neither the D-cache nor the prefetch buffer system covers this range.
Strongly ordered: MMIO accesses are not reordered by the pipeline; programmers see them in source order without explicit fences.
Not normally executed from: instruction fetches from this range are permitted but pointless — hardware registers do not generally contain instructions.
DMA permitted: DMACPY/DMASET targeting MMIO is allowed for streaming use cases (e.g., audio buffer to codec FIFO). The DMA engine treats MMIO writes as strictly ordered, identical to CPU MMIO stores.

Specific device addresses within this range are defined by the platform — the Ant64 chipset documentation lists individual peripheral register assignments. The CPU specification reserves the entire 0xFxxx_xxxx quarter for this purpose so that platform variants have room to add peripherals without crowding the architectural address space.

The 256 MB allocation is much larger than any realistic peripheral set requires (typical chipset register footprint is 1–10 MB total across all peripherals), but reserving the whole top sixteenth of the address space gives clean address decode (one nibble compare: addr[31:28] == 0xF) and avoids future address-space contention.

5.5 Address Map Summary

Address range	Region	Caching	Latency	Data	Code
`0x0000_0000`–`0x7FFF_FFFF`	DDR3 main memory	8 KB D-cache	1/20 cyc	✓	✓ (narrow)
`0x8000_0000`–`0xBFFF_FFFF`	36-bit SRAM	None	N/A for data	✗ Harvard	✓ (wide)
`0xC000_0000`–`0xC000_xFFF`	Scratchpad BSRAM	Direct (uncached)	1 cyc	✓	✗
`0xC001_0000`–`0xDFFF_FFFF`	Reserved	—	trap	—	—
`0xE000_0000`–`0xE001_BFFF`	Xstack BSRAM	(Xstack-managed)	1 cyc	✗ (Xstack only)	✗
`0xE002_0000`–`0xE003_FFFF`	Xctx context BSRAM	(Xctx-managed)	1 cyc	✗ (M-mode debug)	✗
`0xE004_0000`–`0xEFFF_FFFF`	Reserved	—	trap	—	—
`0xF000_0000`–`0xFFFF_FFFF`	Hardware chip registers (MMIO)	Uncached, ordered	Device	✓ (load/store)	✗

Accesses to reserved ranges trap with cause ACCESS_FAULT (standard RV64 cause 5 for loads, 7 for stores, 1 for instruction fetch). Future architectural extensions may allocate regions out of the reserved space without breaking existing software.

Code lives in DDR3 (narrow) or 36-bit SRAM (wide). Data lives in DDR3 (cached), scratchpad BSRAM (uncached fast), or hardware registers (uncached ordered). The Harvard restriction prevents data accesses to 36-bit SRAM (§5.1).

Cache-bypass aliasing. Every address in the cacheable DDR3 range has a corresponding uncached alias at addr | (1ULL << 63) — bit 63 set means "bypass D-cache." See §5.2 for the full semantics; software materialises uncached views via the UNCACHED(p) macro idiom.

5.6 Trap Causes

Cause	Mnemonic	Trigger
`49`	DATA_IN_CODE_MEMORY	Data load (or non-M-mode store) to the 36-bit SRAM range (§5.1)

5.7 Why a Tiny D-Cache?

The 8 KB D-cache is intentionally small. The reasoning:

Most hot data lives in BSRAM. Xstack frames, Xctx contexts, audio voice state, filter coefficients, scheduler tables — these go in BSRAM (dedicated, single-cycle, no cache complexity). The D-cache catches only what's left.
What's left is unpredictable. Pointer-chasing through B-tree nodes, hash-table buckets, library data, JIT structures, runtime-allocated objects — patterns where the programmer can't or shouldn't predict locality ahead of time. For these, a small reactive cache helps; a larger cache wouldn't help much more (the working set is irregular).
Write-through direct-mapped is nearly trivial. Verification surface is small, FPGA area is modest, and behaviour is predictable.

A larger cache (32 KB, set-associative) would help workloads with poor scratchpad-locality (large-scale enterprise software, complex compilers), but those are not Ant64's targets. The 8 KB / direct-mapped choice is calibrated for FireStorm's workload mix.

6. Register File

6.1 Integer Registers

FireStorm provides 64 general-purpose registers, x0–x63, each 64 bits wide. Narrow-mode instructions can address only x0–x31. Wide-mode instructions can address all 64.

Range	ABI role	Narrow-mode reachable	Wide-mode reachable	Compressed-reachable
x0	Hardwired zero	yes	yes	yes (via CR/CI special forms)
x1	`ra` — return address	yes	yes	yes (implicit in C.JAL/C.JALR)
x2	`sp` — stack pointer	yes	yes	yes (implicit in stack-relative forms)
x3	`gp` — global pointer	yes	yes	—
x4	`tp` — thread pointer	yes	yes	—
x5–x7	`t0–t2` — temporaries (caller-saved)	yes	yes	—
x8–x9	`s0–s1` — saved (callee-saved)	yes	yes	yes (bank 0)
x10–x15	`a0–a5` — arg/return (caller-saved)	yes	yes	yes (bank 0)
x16–x17	`a6–a7` — args (caller-saved)	yes	yes	—
x18–x27	`s2–s11` — saved (callee-saved)	yes	yes	—
x28–x31	`t3–t6` — temporaries (caller-saved)	yes	yes	—
x32–x39	extended scratch (caller-saved)	no	yes	—
x40–x47	extended scratch (caller-saved)	no	yes	yes (bank 1)
x48–x63	extended scratch (caller-saved)	no	yes	—

ABI policy for x32–x63. All extended registers are caller-saved. They are invisible to the calling convention: they are not used for argument passing, return values, or any cross-call contract. They exist only as additional allocatable scratch within a function. This makes vanilla-RV64 code automatically interoperable with wide-aware code — any call is conservatively assumed to clobber x32–x63, and any vanilla function physically cannot touch them, so the assumption is always safe.

6.2 Floating-Point Registers

FireStorm provides 64 floating-point registers, f0–f63, each 64 bits wide. The mirror of the GPR scheme applies exactly:

Range	ABI role	Narrow	Wide	Compressed
f0–f7	`ft0–ft7` — temporaries	yes	yes	—
f8–f9	`fs0–fs1` — saved	yes	yes	yes (bank 0)
f10–f15	`fa0–fa5` — args/return	yes	yes	yes (bank 0)
f16–f17	`fa6–fa7` — args	yes	yes	—
f18–f27	`fs2–fs11` — saved	yes	yes	—
f28–f31	`ft8–ft11` — temporaries	yes	yes	—
f32–f47	extended scratch (caller-saved)	no	yes	yes for f40–f47 (bank 1)
f48–f63	extended scratch (caller-saved)	no	yes	—

6.3 Wide-Dirty Tracking

Because extended registers are caller-saved, the function-call boundary requires no extra context save. However, asynchronous traps (timer interrupts, page faults, externally-vectored interrupts) must preserve whatever wide register state was live at the point of interruption.

To avoid paying 512 bytes of GPR save plus 512 bytes of FPR save on every trap whether or not wide registers are in use, FireStorm maintains a per-hart wide-dirty bit:

The bit is set by hardware on any write to a register in x32–x63 or f32–f63.
The bit is cleared by software (typically the trap return / context restore path).
The trap handler's save logic tests the bit and saves the wide register banks only if set.

The bit is exposed as a read-write CSR at a FireStorm-allocated number (TBD; suggest 0x7C0, in the machine-mode custom region). A task that lives entirely in narrow code never sets the bit and never pays the save cost.

7. 32-bit Instructions in Wide Mode

In wide mode, every 36-bit fetch contains a standard 32-bit RV64 instruction in bits [31:0] plus an extension nibble in bits [35:32]. The nibble extends the register-index fields of the instruction; the encoding of the 32-bit portion itself is unchanged.

7.1 Extension Nibble Layout per Format

Format	Examples	rd ext	rs1 ext	rs2 ext	rs3 ext	Other-purpose bits
R-type	ADD, SUB, AND, SLL, MUL	bit[32]	bit[33]	bit[34]	—	bit[35] = Xcond PRED-EN
I-type (imm/load/JALR)	ADDI, LW, JALR, ANDI, ORI	bit[32]	bit[33]	—	—	bits[35:34] = imm extension (§7.3)
I-type (shift)	SLLI, SRLI, SRAI, SLLIW	bit[32]	bit[33]	—	—	bits[35:34] reserved
S-type	SW, SD, SB, SH	—	bit[33]	bit[34]	—	bits[35], [32] = offset extension (§7.3)
B-type	BEQ, BNE, BLT, BGE	—	bit[33]	bit[34]	—	bits[35], [32] = imm extension (§7.3)
U-type	LUI, AUIPC	bit[32]	—	—	—	bits[35:33] = imm extension (§7.3)
J-type	JAL	bit[32]	—	—	—	bits[35:33] = imm extension (§7.3)
R4-type	FMADD, FMSUB, FNMADD, FNMSUB	bit[32]	bit[33]	bit[34]	bit[35]	none

The full register index for a field is formed by concatenating its extension bit with the standard 5-bit field: reg_index[5:0] = {ext_bit, field[4:0]}. With the extension bit clear, the instruction names x0–x31 exactly as in narrow mode.

The "Other-purpose bits" column shows how the previously-spare bits in each format are used by FireStorm wide mode. Most go to immediate extension (§7.3), giving wider immediate fields without changing instruction semantics. R-type's bit[35] is consumed by the Xcond PRED-EN mechanism (see FireStorm Xcond Extension). Shift-instruction nibble bits remain reserved because shift amounts are already bounded by register width.

7.2 Reserved Bits

After accounting for register extension (§7.1), immediate extension (§7.3), and Xcond PRED-EN, the only nibble bits that remain truly reserved in standard RV64 instruction encodings are the 2 bits in I-type shift instructions (SLLI, SRLI, SRAI, SLLIW, SRLIW, SRAIW). These bits must be written as zero by assemblers; FireStorm implementations may treat non-zero values as reserved-for-future-use, raise illegal-instruction, or ignore them.

Future revisions of the Xwide extension may assign meaning to these residual bits — for example, a non-temporal load/store hint, a branch-prediction hint, or a fence-flavour modifier.

7.3 Extended Immediates in Wide Mode

In wide mode, the previously-spare extension-nibble bits in each immediate-bearing format are used as high-order extension bits of the standard RV64 immediate. Instruction semantics are unchanged — sign extension, immediate position, and arithmetic behaviour all follow the standard RV64 rules — but the immediate range is wider.

7.3.1 Per-Format Summary

Format	Standard imm width	Wide-mode imm width	Wide-mode range	Improvement
U-type (LUI, AUIPC)	20 bits	23 bits	upper bits [34:12], ±4 GB shifted	8× larger
J-type (JAL)	20 bits (target ±1 MB)	23 bits	±16 MiB jump range (×4 slot scaling)	16× longer
I-type imm ops	12 bits	14 bits	±8 KB signed	4× larger
I-type loads (offset)	12 bits	14 bits	±8 KB offset	4× larger
S-type stores (offset)	12 bits	14 bits	±8 KB offset	4× larger
B-type branches	12 bits (target ±4 KB)	14 bits	±32 KiB branch range (×4 slot scaling)	8× longer

Shift instructions (SLLI, SRLI, SRAI, SLLIW, SRLIW, SRAIW) do not extend their shift-amount immediate: the shift amount is already constrained by register width (6 bits for 64-bit shifts, 5 bits for 32-bit shifts), so the extra bits would be meaningless. The shift-instruction nibble spare bits remain reserved (§7.2).

7.3.2 Encoding Rules

The widened immediate value is constructed by concatenating the nibble extension bits as the most significant bits of the standard immediate, then sign-extending to 64 bits as usual:

U-type (LUI, AUIPC):

imm23 = {nibble[35:33], slot[31:12]}    // 3 nibble bits + 20 slot bits
result = sign_extend_to_64(imm23 << 12)

Effective placement: imm23 << 12 puts the immediate at bit positions [34:12] of rd, sign-extended from bit 34.

I-type immediate operations (ADDI, ADDIW, ANDI, ORI, XORI, SLTI, SLTIU, JALR offset):

imm14 = {nibble[35:34], slot[31:20]}    // 2 nibble bits + 12 slot bits
result = rs1 + sign_extend_to_64(imm14)    // (or AND/OR/XOR/SLT for others)

I-type loads (LB, LH, LW, LD, LBU, LHU, LWU):

offset14 = {nibble[35:34], slot[31:20]}    // 2 nibble bits + 12 slot bits
addr = rs1 + sign_extend_to_64(offset14)

S-type stores (SB, SH, SW, SD):

offset14 = {nibble[35], nibble[32], slot[31:25], slot[11:7]}    // 2 nibble + 12 slot bits
addr = rs1 + sign_extend_to_64(offset14)

B-type branches (BEQ, BNE, BLT, BGE, BLTU, BGEU):

imm14 = {nibble[35], nibble[32], slot[31], slot[7], slot[30:25], slot[11:8], 00}
       // bits[1:0] implicit zero in wide mode — slot-aligned targets
target = PC + sign_extend_to_64(imm14)

In wide mode, B-type targets are slot-aligned (4-byte aligned). The RV-standard branch immediate has an implicit zero at bit[0]; in wide mode, bit[1] is also implicit zero, giving ×4 scaling. The 14 stored bits become a 16-bit signed slot offset, multiplied by 4 to yield the byte offset. Effective branch reach: ±32 KiB, double the previous spec.

A label that would naturally land at the second-RVC position of a slot (byte offset 2 from a slot boundary) is pushed to the next slot boundary by an assembler-inserted c.nop. See §8.5 for the slot model and padding rules.

J-type (JAL):

imm23 = {nibble[35:33], slot[31], slot[19:12], slot[20], slot[30:21], 00}
       // bits[1:0] implicit zero in wide mode — slot-aligned targets
target = PC + sign_extend_to_64(imm23)

JAL reach: ±16 MiB in wide mode (was ±8 MiB), with the same ×4 slot scaling as B-type. Function entry points are slot-aligned by convention, so this rarely costs padding.

7.3.3 Narrow-Mode Behaviour

In narrow mode, the extension nibble does not exist (32-bit fetches from DDR3 have no extra bits). Immediates are exactly 12/20 bits as in standard RV64. Code that uses wide-mode-only immediate ranges will not assemble correctly for narrow targets; the toolchain emits an error if a value out of the standard range is used in a narrow section.

7.3.4 Sign-Extension Behaviour and the LUI Hazard

RV64's LUI sign-extends its immediate to 64 bits. With imm20 (standard), lui rd, 0x80000 produces 0xFFFFFFFF_80000000 rather than the often-desired 0x00000000_80000000. This "LUI sign-extension hazard" forces software to use multi-instruction sequences for many positive 32-bit addresses (notably MMIO addresses in the 0xFxxx_xxxx range).

In wide mode, LUI imm23 covers bits [34:12]. For positive values of bit 34 (the new high sign bit), the sign extension is positive — so lui rd, 0x000F0000 in wide mode produces 0x00000000_F0000000 directly, no follow-up correction needed. The hazard remains at bit 34 in wide mode, but is shifted three bits higher than in narrow mode, allowing direct one-instruction construction of the entire 36-bit positive address range when combined with ADDI's wider imm14.

For constants where this isn't sufficient (rare in FireStorm's address space), the LIZ/LIK instructions of §7.6 sidestep sign-extension entirely.

7.3.5 Toolchain Support

The assembler recognises the extended ranges automatically when assembling into a wide section:

.section .text.wide
lui  t0, 0x123456         # imm23 = 0x123456, fits in wide mode
                          # In narrow mode this would error
addi t0, t0, 4096         # imm14 = 4096, fits (was out of range for imm12)

The compiler exploits extended immediates automatically when the function is compiled for wide mode (+xfirestorm and target memory is SRAM). It falls back to the standard imm12/imm20 + ADDI sequences for narrow code.

7.4 Worked Examples

ADD x40, x10, x44 — R-type, all three operands in extended range:

extension nibble [35:32] = 0_1_0_1 = 0x5   (rs2_ext=1, rs1_ext=0, rd_ext=1; bit[35]=Xcond PRED-EN=0)
ADD x40,x10,x44 standard encoding  = 0x00B502B3  (with rd=8, rs1=10, rs2=12)
36-bit slot:                         0x500B502B3

The base encoding uses rd=8, rs1=10, rs2=12 because the low 5 bits of x40, x10, x44 are 8, 10, 12. The extension bits supply the missing high bit for rd (x40 = x32+8) and rs2 (x44 = x32+12). rs1 = x10 needs no extension.

FMADD.D f50, f40, f44, f48 — R4-type, all four fields extended:

extension nibble [35:32] = 1_1_1_1 = 0xF   (rs3_ext=1, rs2_ext=1, rs1_ext=1, rd_ext=1)

This is the only encoding shape that consumes the entire nibble.

LUI x60, 0x12345 — U-type, rd extended, imm fits in standard 20 bits:

imm23 = 0x012345 (high 3 bits of imm23 = 000)
nibble [35:32] = 0_0_0_1 = 0x1  (imm extension bits = 000, rd_ext = 1)

LUI x10, 0x4F0000 — U-type, wide-mode-extended immediate (would not fit in narrow):

imm23 = 0x4F0000 — bits [22:20] = 100, bits [19:0] = 0xF0000
nibble [35:32] = 1_0_0_0 = 0x8  (nibble[35:33] = 100 = high 3 bits of imm23)
slot[31:12] = 0xF0000 (low 20 bits of imm23)
result: rd = sign_extend(0x4F0000 << 12) = 0x00000004_F0000000 (positive — no LUI hazard)

ADDI x12, x10, 4000 — I-type, wide-mode-extended immediate:

imm14 = 4000 = 0x0FA0
nibble[35:34] = 00 (high 2 bits of imm14)
slot[31:20] = 0xFA0 (low 12 bits)
Standard imm12 max is 2047; 4000 requires imm14.

7.5 Wide-Mode-Only Extension Escape (`0x7F`)

In addition to the extension-nibble scheme for standard RV64 instructions described above, wide mode reserves the opcode 0x7F (bits[6:0] = 1111111) as a wide-mode-only escape mechanism.

In standard RV64GC, bits[6:2] = 11111 is reserved for instructions ≥48 bits wide; a 32-bit slot with bits[6:0] = 0x7F is therefore unallocated and traps as illegal-instruction. FireStorm preserves this behaviour in narrow mode — a 0x7F instruction fetched from DDR3 traps as expected.

In wide mode, the FireStorm decoder repurposes 0x7F as a marker for an extended encoding family:

 35 34                                7 6           0
+---+----------------------------------+-------------+
| F |       29-bit custom payload      |  1111111    |
+---+----------------------------------+-------------+

Bit [35] is a top-level format selector.
Bits [34:7] are a 29-bit payload available for custom sub-encoding.
Bits [6:0] mark the slot as a wide-mode extension.

This mechanism is a general-purpose lane for any wide-mode-only instruction family — instructions that need more encoding room than the 32-bit-with-nibble scheme provides, or that would not fit cleanly inside the custom-0/1/2/3 opcode slots used by narrow-compatible extensions.

The first user of the escape is the Xcrisp PIC family (§7 of the Xcrisp doc), which uses both top-level format values: bit[35]=0 for PC-relative addressing instructions (LDPC, LWPC, LAPC, JALPC) with a 19-bit immediate, and bit[35]=1 for register-indirect calls (CALLM, JMPM) with 6-bit register fields directly addressing x0–x63 without an extension nibble.

Future wide-mode-only extensions (DSP primitives, graphics accelerators, cryptographic primitives, etc.) may allocate further sub-encodings within the 0x7F payload, coordinating with the format bit and choosing dispatch sub-fields that do not conflict with existing allocations.

Discipline. Any FireStorm extension using the 0x7F escape must:

Define its format-bit value ([35] = 0 or 1).
Allocate a dispatch sub-field within the 29-bit payload (typically funct3, funct4, or a similar small field) that does not collide with allocations made by other extensions sharing the same format bit.
Document the layout in the relevant extension specification.
Mark used encodings in the Xcrisp Reserved Spaces table (or equivalent) for any extension that follows.

This is a softer discipline than the custom-0/1/2/3 opcode space (which is RISC-V-standard reserved), but the FireStorm architecture treats it with the same rigour: the 0x7F escape is a finite resource and its allocations are spec-tracked.

7.6 Direct Immediate Construction: LIZ and LIK

Two instructions in the wide-mode escape (0x7F, §7.5) provide direct construction of 64-bit constants from 16-bit chunks. The design is borrowed from ARM-A64's MOVZ/MOVK family — a familiar pattern that builds any 64-bit value in at most four instructions and many useful values in one or two.

These instructions occupy the W-type funct3 = 111 slot that the Xcrisp PIC family currently leaves reserved (see Xcrisp §7.3), so they coexist cleanly with the Xcrisp allocations.

7.6.1 Instruction Semantics

LIZ rd, imm16, shift — Load Immediate, Zero-extend.

rd = zero_extend_to_64(imm16) << (shift × 16)

All other bits of rd are cleared. shift ∈ {0, 1, 2, 3} selects the 16-bit chunk position {bits [15:0], [31:16], [47:32], [63:48]} of rd.

LIK rd, imm16, shift — Load Immediate, Keep.

rd[(shift × 16 + 15) : (shift × 16)] = imm16

All other bits of rd are unchanged. Same shift positions as LIZ.

In both instructions, imm16 is treated as a raw 16-bit unsigned value (no sign extension applied within the chunk).

7.6.2 Encoding Format

LIZ and LIK occupy the W-type subspace (bit[35] = 0) at funct3 = 111. The 19-bit immediate field of the W-type format is repurposed:

 35  34   33 32   31              16   15 13   12     7   6           0
+---+----+-------+------------------+--------+---------+-------------+
| 0 |  L | shift |    imm[15:0]     |  111   |   rd    |  1111111    |
+---+----+-------+------------------+--------+---------+-------------+

Field	Bits	Meaning
W-format selector	[35]	`0` (W-type)
L	[34]	`0` = LIZ, `1` = LIK
shift	[33:32]	Chunk position: 00=bits[15:0], 01=bits[31:16], 10=bits[47:32], 11=bits[63:48]
imm16	[31:16]	16-bit immediate
funct3	[15:13]	`111` — selects the LI-family within W-type
rd	[12:7]	6-bit destination register (full x0–x63 access)
opcode	[6:0]	`0x7F`

The encoding is byte-clean: the immediate field aligns to bits [31:16] of the slot for trivial decoding, and the shift + LIZ/LIK selector live in 3 of the 4 nibble bits. No extension nibble bits are reserved for register-file extension because the W-type rd field is already 6 bits wide (§7.5).

7.6.3 Use Patterns

Build any 64-bit constant K in 1–4 instructions:

Determine the non-zero 16-bit chunks of K. The first non-zero chunk is loaded with LIZ (zeroing the rest of rd); each subsequent chunk uses LIK (preserving the previous chunks).
If only one chunk is non-zero, only LIZ is needed. If all four are non-zero, all four positions get LIZ + 3×LIK.

7.6.4 Examples

Build 0xDEADBEEF (32-bit constant):

LIZ  t0, 0xBEEF, 0    # t0 = 0x00000000_0000BEEF
LIK  t0, 0xDEAD, 1    # t0 = 0x00000000_DEADBEEF

2 instructions; clean across the bit-31 boundary, no LUI sign-extension hazard.

Build 0x80000000 (the LUI-hazard mid-range address):

LIZ  t0, 0x8000, 1    # t0 = 0x00000000_80000000

1 instruction; LUI would need 2 instructions with explicit sign correction.

Build the MMIO base 0xF0000000:

LIZ  t0, 0xF000, 1    # t0 = 0x00000000_F0000000

1 instruction.

Build a 64-bit hash seed 0x12345678_9ABCDEF0:

LIZ  t0, 0xDEF0, 0    # t0 = 0x00000000_0000DEF0
LIK  t0, 0x9ABC, 1    # t0 = 0x00000000_9ABCDEF0
LIK  t0, 0x5678, 2    # t0 = 0x00005678_9ABCDEF0
LIK  t0, 0x1234, 3    # t0 = 0x12345678_9ABCDEF0

4 instructions for an arbitrary 64-bit constant; standard RV64 typically needs 6–8 instructions or a literal-pool LD.

Build a sparse bitmask 0x80000000_00000001:

LIZ  t0, 0x0001, 0    # t0 = 0x00000000_00000001
LIK  t0, 0x8000, 3    # t0 = 0x80000000_00000001

2 instructions.

7.6.5 Comparison with Standard RV64 Sequences

Constant	Standard RV64	Wide-mode LUI/ADDI (§7.3)	LIZ/LIK
`0x000FFC` (small)	1 (ADDI)	1 (ADDI)	1 (LIZ)
`0x12345678` (32-bit)	2 (LUI+ADDI)	2 (LUI+ADDI)	2 (LIZ+LIK)
`0xF0000000` (high-32)	3 (LUI+ADDI+special-case)	2 (LUI+ADDI)	1 (LIZ)
Arbitrary 48-bit	4–6	4–5	3 (LIZ+2×LIK)
Arbitrary 64-bit	6–8 (or LD from pool)	6–8	4 (LIZ+3×LIK)

For 32-bit constants the wins are modest (Proposal A in §7.3 already covered most of this ground). LIZ/LIK shine for 64-bit constants and for high-bit-set 32-bit addresses, where they avoid the LUI sign-extension gymnastics entirely.

7.6.6 Narrow-Mode Behaviour

LIZ and LIK are wide-mode-only (they live in the 0x7F escape, which traps as illegal in narrow mode per §7.5). Narrow-mode code uses LUI + ADDI sequences as standard.

7.6.7 Toolchain Support

The assembler exposes liz and lik mnemonics directly. The compiler uses them automatically for constant materialisation in wide-mode functions:

16-bit constants: 1 LIZ.
32-bit constants: prefer wide-mode LUI imm23 + ADDIW imm14 (§7.3) when the immediate fits; fall back to LIZ + LIK when bit-pattern alignment doesn't suit LUI's shift.
64-bit constants: LIZ + LIK sequence as needed.
Constants for which a literal-pool load is cheaper than register construction (rare): retain the LD pattern.

8. Compressed Instructions (RVC) in Wide Mode

A 36-bit SRAM word may contain two RVC instructions in bits [31:0] (lower half at [15:0], upper half at [31:16]), with the 4 extension bits in [35:32] split evenly between them:

Bits	Belongs to
[33:32]	First compressed instruction (lower half, at slot [15:0])
[35:34]	Second compressed instruction (upper half, at slot [31:16])

Each compressed instruction has 2 extension bits available. How those bits map to register-field extension depends on the RVC format.

The standard RVC op and funct3 fields are sufficient for the decoder to identify the format, so per-format bit assignment is free in hardware.

8.1 RVC Formats with 5-bit Register Fields

Formats CR, CI, and CSS use full 5-bit register fields (matching the RV32I encoding). The extension bit prefixes the field to form a 6-bit index, exactly as in §7.

Format	Examples	rd/rs1 ext	rs2 ext	Spare
CR (2 fields active)	C.MV, C.ADD	bit[a]	bit[b]	none
CR (1 field active)	C.JR, C.JALR	bit[a]	—	bit[b]
CR (no fields)	C.EBREAK	—	—	bits[b:a]
CI	C.LI, C.LUI, C.ADDI, C.SLLI, C.LWSP, C.LDSP	bit[a]	—	bit[b]
CSS	C.SWSP, C.SDSP	—	bit[a]	bit[b]

(Where a and b are the two bits allocated to that compressed instruction: {a,b} = {32,33} for the first compressed insn, {34,35} for the second.)

C.JR / C.JALR use only the rd/rs1 field as the jump target; the rs2 field must be zero. C.EBREAK has no operand fields. Spare bits are reserved as in §7.2.

8.2 RVC Formats with 3-bit Register Fields — Bank Select

Formats CIW, CL, CS, CA, and CB use 3-bit register fields mapping to a 8-register subset. In narrow mode that subset is x8–x15 (the standard RVC window, comprising s0, s1, a0–a5). In wide mode the extension bit acts as a bank-select:

Extension bit	Field maps to
0	x8–x15 (standard RVC window) — backward compatible
1	x40–x47 (wide-RVC window — extended scratch)

The full reachable set from a wide-RVC 3-bit field is therefore {x8, x9, x10, x11, x12, x13, x14, x15, x40, x41, x42, x43, x44, x45, x46, x47} — sixteen registers, all caller-saved or ABI-defined, all in the "hot" register-allocator preference band.

Note: the bank select does not expose x0–x7 (which would include ra, sp, and other registers that RVC's 3-bit forms intentionally exclude). Existing RVC instruction semantics rely on x8–x15 being unprivileged scratch/argument registers; the wide-RVC window preserves that property.

Format	Examples	rs1' / rd' ext	rd' ext	rs2' ext	Spare
CIW	C.ADDI4SPN	—	bit[a]	—	bit[b]
CL	C.LW, C.LD	bit[a] (base)	bit[b] (dest)	—	none
CS	C.SW, C.SD	bit[a] (base)	—	— (rs2' narrow)	bit[b]
CA	C.AND, C.OR, C.XOR, C.SUB, C.ADDW, C.SUBW	bit[a] (dest=src1 merged)	—	— (rs2' narrow)	bit[b]
CB (branch)	C.BEQZ, C.BNEZ	bit[a] (rs1')	—	—	bit[b]
CB (immediate)	C.SRLI, C.SRAI, C.ANDI	bit[a] (rd'=rs1' merged)	—	—	bit[b]

3-register RVC rule. For RVC instructions with two distinct register operands (CL, CS, CA), the policy is:

The destination (or read-modify-write merged dest/src1) is extended.
The base address (in load/store) is extended.
The second source (rs2' — the value being stored, or the operand being added) is not extended; it remains in the standard x8–x15 window.

This matches typical compiler patterns where the destination and the base pointer are long-lived (likely allocated in the wide pool), while a freshly-loaded operand is naturally placed near the standard ABI registers.

8.3 RVC Formats with No Register Fields

Format	Examples	Ext bits used	Spare
CJ	C.J, C.JAL	none	bits[b:a]

Both bits are reserved.

8.4 RVC Worked Example

A 36-bit slot containing C.ADD x44, x40 (first half) and C.LW x42, 0(x45) (second half):

First compressed insn  (C.ADD x44, x40):
  bit[32] (rd/rs1 ext) = 1  (x44 = x32+12, low 5 bits = 12; high bit = 1)
  bit[33] (rs2 ext)    = 1  (x40 = x32+8, low 5 bits = 8; high bit = 1)
  CR encoding with rd/rs1=12, rs2=8

Second compressed insn (C.LW x42, 0(x45)):
  CL format: rd' = x42 → bank 1, low 3 bits = 2 (since x40 + 2 = x42)
  rs1' = x45 → bank 1, low 3 bits = 5 (since x40 + 5 = x45)
  bit[34] (rs1' ext, base) = 1
  bit[35] (rd' ext, dest)  = 1
  CL encoding with rd'=010, rs1'=101

Extension nibble [35:32] = 1_1_1_1 = 0xF

8.5 Extended Immediates in RVC

RVC formats that have a spare nibble bit (after register extension and bank-select consume their bits) use that bit to extend the standard RVC immediate by one bit at the high end. For CJ (C.J, C.JAL), both nibble bits are spare and extend the immediate by two bits. In wide mode, branch and jump immediates additionally use ×4 slot scaling (§8.6) instead of the standard ×2 byte scaling, multiplying the effective range by another factor of 2. The net result is a ±1 KiB range for compressed conditional branches (4× standard) and a ±16 KiB range for compressed unconditional jumps (8× standard).

Format	Instructions	Standard imm	Wide imm	Range
CI	C.LI, C.ADDI, C.ADDIW, C.ANDI (in CB)	6 bits (±32)	7 bits	±64
CI	C.LUI	6 bits	7 bits	1 extra bit of upper immediate
CI	C.ADDI16SP	6 bits × 16 (±512)	7 bits × 16	±1024-byte stack-frame adjust
CI	C.LWSP	6 bits × 4 (0–252)	7 bits × 4	0–508 SP-relative load reach
CI	C.LDSP	6 bits × 8 (0–504)	7 bits × 8	0–1016 SP-relative load reach
CSS	C.SWSP	6 bits × 4 (0–252)	7 bits × 4	0–508 SP-relative store reach
CSS	C.SDSP	6 bits × 8 (0–504)	7 bits × 8	0–1016 SP-relative store reach
CIW	C.ADDI4SPN	8 bits × 4 (0–1020)	9 bits × 4	0–2044-byte stack allocation
CS	C.SW	5 bits × 4 (0–124)	6 bits × 4	0–252 struct-offset store reach
CS	C.SD	5 bits × 8 (0–248)	6 bits × 8	0–504 struct-offset store reach
CB	C.BEQZ, C.BNEZ	8 bits × 2 (±256)	9 bits × 4	±1 KiB branch range (slot-aligned, §8.6)
CJ	C.J, C.JAL	11 bits × 2 (±2 KB)	13 bits × 4	±16 KiB jump range (slot-aligned, §8.6)

The extra immediate bit is placed as the most significant bit of the immediate, before sign extension or scaling. Sign extension and scaling behaviour are unchanged.

CL Asymmetry (Loads)

The CL format (C.LW, C.LD) has zero spare nibble bits: both bits are consumed by bank-select on the base register (rs1') and the destination register (rd'). Compressed loads therefore cannot extend their offset in wide mode — they keep the standard 5-bit-scaled range (0–124 for C.LW, 0–248 for C.LD).

Code that needs a larger struct-field offset on a load falls back to the standard 32-bit LW/LD, which gets the wide-mode imm14 from §7.3. The cost is 2 extra bytes per access. The asymmetry between compressed stores (which do extend) and compressed loads (which do not) is structural — fixing it would require redesigning the CL bank-select scheme and breaking the wide-RVC window guarantees.

Shift-Amount Formats

C.SLLI (CI format) and C.SRLI / C.SRAI (CB format) use their immediate field as a shift amount, already bounded by register width (6 bits for 64-bit, 5 bits for 32-bit). Their spare bits remain reserved (§7.2), matching the standard-width shift-instruction policy.

Toolchain Behaviour

The wide-mode RVC assembler accepts the extended immediate ranges automatically. The compiler exploits them when targeting wide sections — typical wins:

Stack frame access: deeper frames stay in compressed form. A frame of 500 bytes can use C.LWSP/C.LDSP throughout instead of falling back to 32-bit LW/LD past the 252/504-byte threshold.
Struct stores: 5-bit offset isn't enough for medium-sized structs; the extra bit covers 256-byte and 512-byte structs without 32-bit fallback.
Branch density: C.BEQZ / C.BNEZ ±1 KiB range catches almost all in-function conditional branches, reducing 32-bit BEQ/BNE usage.
Local jumps: C.J / C.JAL ±16 KiB range is enough for nearly any function-internal goto or call to a nearby helper.

In typical wide-mode code, the extended RVC immediates eliminate 5–15% of 32-bit fallback encodings versus standard RVC. The exact win depends on function size and struct-access patterns.

8.6 The Slot Model

In wide mode, the slot is the atomic unit of instruction memory and PC arithmetic. A slot is 36 bits wide (32-bit instruction + 4-bit extension nibble), addresses are 4 bytes apart, and a valid wide-mode PC is always a multiple of 4. Each slot holds either:

One 32-bit RV64 instruction (occupies the entire slot), or
Two 16-bit compressed (RVC) instructions (occupy the lower and upper halves; §8.1–§8.4).

Although PC values remain byte-addressed for software-visible purposes (AUIPC, mepc, return-address register, function pointers), in wide mode they always carry zeros in bits[1:0]. The slot model can be thought of as: "wide-mode memory is a sequence of 36-bit slots, indexed from 0 by slot number, where slot N starts at byte address N × 4."

Branch and Jump Target Alignment

In wide mode, all branch and jump targets are slot-aligned (4-byte aligned). This applies to every control-flow-changing instruction:

Instruction class	Wide-mode target alignment
B-type branches (BEQ, BNE, BLT, BGE, BLTU, BGEU)	4-byte (slot-aligned)
JAL	4-byte (slot-aligned)
JALR	Computed at runtime; hardware masks bit[0] per RV convention. In wide mode, bit[1] of the computed target must also be zero, or an `instruction address misaligned` trap is raised.
C.BEQZ, C.BNEZ	4-byte (slot-aligned)
C.J, C.JAL	4-byte (slot-aligned)
C.JR, C.JALR	Same constraint as JALR — slot-aligned at runtime.

The slot-aligned target rule lets the encoding take ×4 scaling for branch/jump immediates instead of the standard ×2, doubling the effective branch range in wide mode (see §3.1 and §7.3.2 for the resulting ranges).

Padding with C.NOP

When a label needs to land at a slot boundary but the natural code flow would place it at slot+2 (the second RVC of a slot), the assembler inserts a c.nop (encoded as c.addi x0, 0 = 0x0001) in the first half of that slot, pushing the label to the next slot boundary. The c.nop executes as a single-cycle no-op and costs 2 bytes per padding event.

slot N:    [some_rvc_1]  [some_rvc_2]
slot N+1:  [c.nop]       [label_target]   ← natural placement
                                            (label at slot+2: not slot-aligned)

After padding:
slot N:    [some_rvc_1]  [some_rvc_2]
slot N+1:  [c.nop]       [c.nop]          ← pad slot N+1 entirely
slot N+2:  [label_target] [next_rvc]      ← label now slot-aligned

A simpler case: padding just the first half of the destination slot:

slot N:    [some_rvc]    [c.nop]          ← second half padded
slot N+1:  [label_target] [next_rvc]      ← label slot-aligned

The assembler chooses the form that minimises total padding bytes. Function entry points are slot-aligned by convention everywhere, so they never need padding; the cost falls only on mid-function labels that happen to land at slot+2 in natural code flow.

Padding Overhead

Realistic measurements on typical wide-mode code:

Code type	Padding overhead
Audio / DSP inner loops (few branches)	~0%
Sequential numeric code	~0.5%
OS / dispatch / control-flow-heavy code	~1–3%
Densely-branched interpreter or compiler code	~2–4%

The trade-off — slightly larger code in branchy regions for doubled branch range — is overwhelmingly favourable for FireStorm's targets, where branch ranges matter more than code size at the margin.

Cross-Mode Calls

A call from narrow-mode code (DDR3) to wide-mode code (SRAM) targets a slot-aligned address in SRAM — typically a function entry, which is slot-aligned by convention. A ret from wide back to narrow returns to the saved return-address register, which holds a byte address that may be either 2-byte aligned (narrow caller) or 4-byte aligned (wide caller). Hardware handles both cases transparently — narrow-mode fetches don't impose slot alignment.

8.7 Dual-Issue Execution

FireStorm permits dual-issue execution of two consecutive instructions when they meet independence criteria. The slot model makes the common dual-issue case natural: two RVCs sharing a slot are already fetched together, decoded together, and can frequently execute together.

RVC-Pair Dual-Issue

A slot containing two 16-bit RVC instructions presents both to the decoder simultaneously. If they are independent (criteria below), they execute in parallel in a single cycle. If they are dependent, they execute sequentially over two cycles, with the second using forwarded results from the first.

Independence criteria for RVC-pair dual-issue:

No RAW hazard: rd of the first RVC is not rs1 or rs2 of the second RVC.
Neither is a memory operation (no C.LW / C.LD / C.SW / C.SD / C.LWSP / C.LDSP / C.SWSP / C.SDSP).
Neither is a branch, jump, or system instruction (no C.BEQZ / C.BNEZ / C.J / C.JAL / C.JR / C.JALR / C.EBREAK).
Neither is an FP operation (in v0.1).
Neither is multiply or divide (RVC does not include these; reserved for future safety).

WAW (both writing the same register) and WAR (first reading a register the second writes) are architecturally safe: hardware ensures program-order semantics (second's write wins; first's read sees the pre-write value).

RVC-pair dual-issue is architecturally permitted on all FireStorm variants. All models support it; the implementation cost is modest (~700 LUTs for the second integer ALU and register-file ports, since the BSRAM register file already supports 4R/2W in modern FPGA fabrics).

32-bit-Pair Dual-Issue

Two consecutive 32-bit instructions in adjacent slots can also dual-issue if they meet the same independence criteria. Since they're already in the prefetch buffer when the first executes, no additional fetch latency is involved — the question is purely whether decode and execute pipelines can handle both in one cycle.

Independence criteria for 32-bit-pair dual-issue:

Identical to the RVC-pair criteria above, applied to the 32-bit instructions in the two adjacent slots:

No RAW hazard between first and second.
Neither is a memory operation.
Neither is a branch, jump, or system instruction.
Neither is an FP operation (v0.1).
Neither is multiply or divide (v0.1).

Additional constraint: the buffer port must be wide enough to deliver both slots in the same cycle. A 256×72-bit BSRAM port delivers two 36-bit slots per cycle; a 512×36-bit port delivers one. The choice of buffer port width is per-implementation.

32-bit-pair dual-issue is architecturally permitted, implementation-defined whether actually performed. Implementations may execute paired 32-bit instructions in parallel, or strictly sequentially. Software sees only performance variation, never functional difference.

Implementation Choices for v0.1

Variant	RVC-pair dual-issue	32-bit-pair dual-issue	Buffer port width
All models	Yes (always)	Yes (when independence permits)	72-bit

Both RVC-pair and 32-bit-pair dual-issue are implemented on the 72-bit buffer port; there is no per-model dual-issue difference to design around.

Trap Semantics

Dual-issue does not require any new architectural state. Each instruction in a pair has its own distinct PC, and the standard RV64 trap mechanism (mepc, mcause, mtval) handles dual-issue cleanly:

If the first instruction in a pair traps, mepc points to the first instruction's address. The second instruction was squashed (did not execute). After mret, execution resumes from mepc — the first instruction re-executes.
If the second instruction in a pair traps, mepc points to the second instruction's address. The first instruction's writeback has committed; its results are visible in the register file. After mret, execution resumes from mepc — the second instruction re-executes.

No "trap-half" or "trap-in-pair" CSR is needed; standard mepc identifies the trapping instruction precisely.

Compiler Cooperation

A wide-mode-aware compiler can maximise dual-issue rate by:

Packing independent RVCs into the same slot. When two consecutive RVCs are independent and would both fit, place them in one slot rather than separated by a c.nop.
Scheduling for 32-bit-pair issue. Position independent 32-bit ALU operations in adjacent slots where possible.
Avoiding false dependencies. Use the wide register file (x32–x63) as scratch to break artificial RAW chains.

These optimisations are entirely transparent to source code; they affect only emitted instruction order and slot packing. A compiler unaware of dual-issue still produces correct code that will dual-issue opportunistically when independence happens to occur — the optimised version simply produces more independence-creating arrangements.

Expected Performance Impact

Measured estimates from microbenchmark modelling:

Workload	Expected speedup vs single-issue
Audio synth voice loop (RVC-heavy)	~15–20%
DSP kernel (32-bit MAC chain)	~20–25%
Interpreter dispatch (branch-heavy)	~5–10%
Bulk integer code	~10–15%
Code-size-optimised builds (RVC-dominant)	~12–18% (RVC-pair only is fine)

The wins compose with the ISA-level wins from §11 (CRISP), Xcond (§3 of ee_xcond), Xstack, and the rest. A wide-mode FireStorm at full implementation can be 50%+ faster than a vanilla RV64GC core at the same clock on representative workloads — half from the ISA, half from microarchitecture.

9. Floating-Point

The floating-point register file (f0–f63) follows the GPR scheme exactly. All RV-F and RV-D instructions in wide mode extend their FPR operands via the same extension-nibble bit assignments as integer instructions (§7.1). The R4-type FMADD family consumes all four nibble bits in wide mode, providing full f0–f63 access to all four operands.

Mixed FPR/GPR instructions (FCVT, FMV.X.W, FMV.W.X, FSGNJ, FCLASS, FLT, FLE, FEQ) extend FPR fields and GPR fields uniformly using the same nibble bits — the format remains the same as the underlying RV64GC encoding.

Wide-mode RVC FPR instructions (C.FLW, C.FSW, C.FLD, C.FSD where applicable) follow the 3-bit-field bank-select rule from §8.2: bit=0 maps to f8–f15, bit=1 maps to f40–f47.

The FPU rounding mode CSR (frm) and flags (fflags, fcsr) are unchanged and behave identically in narrow and wide modes.

10. Xmath Extension

The Xmath extension accelerates games, audio synthesis, fixed-point DSP, and retro-style demoscene code through fused multiply-add, saturating arithmetic, min/max/sign/abs, fast transcendental approximations, BAM-based trigonometry, and 3D vector math bundles. All Xmath instructions are available in both narrow and wide modes (wide mode adds access to the extended register file but no additional operations).

Full specification: see FireStorm Xmath Extension.

10.1 Summary

Xmath comprises eleven groups of instructions:

Group	Instructions	Latency	Throughput
G1 — Integer Fused MAC	MADD, MSUB, RMSUB, MADDH, MADDU, MADDW	2–3 cycles	1/cycle
G2 — Saturating Arithmetic	ADDS, SUBS, ADDSU, SUBSU, MULSAT, SAT.{B,H,W}, SHIFTSAT.{B,H,W}	1 cycle	1/cycle
G3 — Min/Max/Sign/Abs	MIN, MAX, MINU, MAXU, ABS, SIGN	1 cycle	1/cycle
G4 — FP Approximations	FRECIP.{S,D}, FRSQRT.{S,D}, FSIN.{S,D}, FCOS.{S,D}, FSINCOS.{S,D}, FATAN2.{S,D}	3–4 cycles	1/cycle
G5 — BAM Trigonometry	FSINBAM.{S,D}, FCOSBAM.{S,D}, FSINCOSBAM.{S,D}, FRAD2BAM, FBAM2RAD	2–3 cycles	1/cycle
G6 — 3D Vector Math Bundles	DOT3, DOT4, CROSS3, LENSQ3, LERP	3–10 cycles	depends on FMA unit
G7 — Vector Componentwise Bundles	VADD3, VSUB3, VSCALE3, VMADD3, VNORM3	3–8 cycles	1/cycle on Ant64
G8 — 2D Math Primitives	DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2	2–6 cycles	1/cycle
G9 — Game / Animation Math	CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP	1–5 cycles	1/cycle
G10 — Distance Heuristics	MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2	1–3 cycles	1/cycle
G11 — Quaternion Math	QMUL.{S,D}, QROT.{S,D}	8–10 cycles	1/8 cycles on Ant64

10.2 Encoding Allocation

Xmath uses two opcodes:

Opcode	Standard meaning	Type	Purpose
`0x57`	OP-V (vector, not implemented)	R-type	G2–G11 (all R-type Xmath instructions)
`0x6B`	Reserved (no standard claim)	R4-type	G1 — Integer fused multiply-add

This means FireStorm v0.1 does not implement the standard RISC-V V extension, and the OP-V opcode (0x57) has been allocated to Xmath instead. For FireStorm's target workloads — games, audio synthesis, retro emulation, DSP — Xmath's fast scalar fused operations capture most of the practical performance benefit V would provide, without V's substantial implementation and verification complexity. V remains a possible v0.3+ addition if a clear workload need emerges, but it would need a different opcode allocation at that point.

The integer fused multiply-add at 0x6B (R4-type) is the parallel-format counterpart to the FP FMA family (0x43/0x47/0x4B/0x4F, all R4-type) — same encoding shape, separate opcode. 0x6B is a genuinely reserved slot in the standard opcode map (inst[6:5]=11, inst[4:2]=010) with no standard claim, so it does not displace any RV64GC feature. In particular, the standard scalar floating-point opcode (0x53 OP-FP, used by F/D for FADD/FSUB/FMUL/FDIV/FSQRT/FMV/FCVT/FSGNJ/FCLASS) is completely untouched.

The standard FP opcodes 0x07 (LOAD-FP) and 0x27 (STORE-FP) are unchanged. The existing FP FMA opcodes 0x43 / 0x47 / 0x4B / 0x4F are unchanged.

10.3 Mode Availability

Group	Narrow	Wide	Notes
G1 – G5	✓	✓	All instructions; wide mode adds access to x32–x63 / f32–f63
G6	✓	✓	Wide mode allows tuples to start at any register; narrow mode constrains starting register so the tuple fits in f0–f31

Every Xmath instruction works in both modes. The wide-mode benefit is exclusively the larger register file, not additional operations.

Wide-mode-only features within Xmath:

Xcond predication on all R-type Xmath instructions via PRED-EN bit (bit 35) — see §14 of Xmath spec.
Precision-mode bit on G4 FP approximations (bit 34) — selects between approximate (3 cycles, ~0.05% error) and refined (6 cycles, ~10⁻⁹ error) using the .R assembly suffix. See §6.7 of Xmath spec.

10.4 Detection

Xmath presence is signalled in mxfeatures (0xFC0), bit 6. See FireStorm Xmath Extension §13.

10.5 Implementation Cost

Total area: ~8,000–10,000 LUTs + 4 BSRAM blocks (for FP approximation tables) + ~10–15 dedicated DSP blocks beyond the M and F/D extension hardware. Fits comfortably on the GW5AST-138. See FireStorm Xmath Extension §16 for the implementation breakdown.

11. CRISP Custom Extensions (Xcrisp)

CRISP-influenced instructions are encoded in the standard RISC-V custom-0/1/2/3 opcode slots (0x0B, 0x2B, 0x5B, 0x7B). They are available in both narrow and wide mode: vanilla 32-bit RV64 code in DDR3 may use them just as wide code in SRAM may. In wide mode, their register fields extend via the §7.1 nibble scheme. Detection is via the Xcrisp target feature flag in the compiler; runtime detection (if needed) is via a FireStorm implementation-defined CSR.

11.1 Design Rationale

CRISP-style ISAs (AT&T Hobbit) achieved C-language performance gains principally through:

Memory-to-memory ALU ops — mem[a] = mem[b] + mem[c] style, eliminating spill-reload round-trips.
Top-of-stack register caching — frame-local variables transparently held in registers.
Stacked operand addressing — frame-pointer-relative addressing without explicit base setup.

RISC-V's 32-bit instruction format cannot fit three (base+displacement) memory operands in one instruction (~50 bits required), so true three-operand memory-to-memory operations are out of reach. But the format can fit two memory operands when both are register-indirect (zero offset), which is the closest a 32-bit slot can get to CRISP/Hobbit memory-to-memory and is exposed via the load-op-store family below.

The Xcrisp extension's family set:

Load-op fusion — combine a load with an ALU op (one memory operand + two register operands).
Op-store fusion — combine an ALU op with a store (one memory operand + two register operands, no register destination).
Load-op-store fusion — two memory operands plus one register operand, no register destination. The "true memory-to-memory" variant.
Auto-increment load/store — collapse *p++ and *++p patterns into single instructions.
Compare-mem-branch — branch on a memory comparison.
Block memory ops — accelerated memcpy/memset primitive.

Top-of-stack caching, where useful, is implemented microarchitecturally (forwarding from store buffer into register-file reads) and requires no ISA-visible instruction.

11.2 Instruction Categories

Bit-level encodings for every Xcrisp instruction are in the companion document FireStorm Xcrisp Extension. The summary below captures the design intent of each family.

9.2.1 Auto-increment Load/Store

LWPI  rd, (rs1)+imm    ; rd = mem[rs1]; rs1 += imm
LDPI  rd, (rs1)+imm    ; 64-bit load with post-increment
SWPI  rs2, (rs1)+imm   ; mem[rs1] = rs2; rs1 += imm
SDPI  rs2, (rs1)+imm   ; 64-bit store with post-increment
LWPD  rd, -imm(rs1)    ; rs1 -= imm; rd = mem[rs1]   (pre-decrement)
... and the SD/SW/LD/LW pre-decrement family

Collapses the common *p++ / *--p pattern and removes the explicit addi rs1, rs1, imm that follows every standard load/store in inner loops. Full encoding: §3 (loads) and §4 (stores) of the Xcrisp doc.

9.2.2 Load-Op Fusion

LWADD  rd, (rs1), rs2  ; rd = mem[rs1] + rs2
LWAND  rd, (rs1), rs2  ; rd = mem[rs1] & rs2
LWOR   rd, (rs1), rs2  ; rd = mem[rs1] | rs2
LWXOR  rd, (rs1), rs2  ; rd = mem[rs1] ^ rs2

Two-operand: one memory, one register. Targets accumulator-style C patterns like sum += array[i]. Full table including dword, unsigned-word, and shift/compare variants: Xcrisp §5.2.

9.2.3 Op-Store Fusion

ADDSW  [rs1], rs2, rs3  ; mem[rs1] = rs2 + rs3

Eliminates an intermediate temporary. Useful in store-after-compute patterns where the result is not needed in a register afterwards. The R-type rd field is repurposed as rs3 — see Xcrisp §5.3 for the full encoding and the rationale.

9.2.4 Load-Op-Store Fusion

MMWADD  [rd], [rs1], rs2   ; mem32[rd] = mem32[rs1] + rs2
MMDOR   [rd], [rd], rs2    ; mem64[rd] |= rs2 (in-place when rd == rs1)

The most aggressive fusion in the extension: two memory operands plus one register operand in a single instruction. Replaces lw t0, (rs1); add t0, t0, rs2; sw t0, (rd) (three instructions, one architectural temporary) with one instruction and no temporary. The R-type rd field names the destination memory base (read but not written). Performance and density both compound: one fetch, one decode, no register-file traffic for the intermediate value. Full table and pipeline guidance: Xcrisp §5.4.

9.2.5 Compare-Mem-Branch

BEQM  rs1, (rs2), offset   ; branch if rs1 == mem[rs2]
BNEM  rs1, (rs2), offset   ; branch if rs1 != mem[rs2]

Targets sentinel scans (while (*p != 0)) and lookup-table walks without a separate load. Full table including ordered comparisons and dword variants: Xcrisp §6.

9.2.6 Block Memory Primitives

BMCPY  rd, rs1, rs2     ; copy rs2 bytes from rs1 to rd, advancing pointers
BMSET  rd, rs1, rs2     ; set rs2 bytes at rd to byte value in rs1

Restartable on interrupt (state held in the named registers). Allows libc memcpy/memset to be a small intrinsic rather than an inlined loop, and the FPGA implementation may issue wide DMA-style accesses internally. Restart semantics and encoding: Xcrisp §5.5.

11.3 Encoding Discipline

All Xcrisp instructions are 32 bits wide. In wide mode they participate in the standard extension-nibble scheme: any register field they declare is extended by the corresponding nibble bit per §7.1, following the format (R / I / S / B) they most closely resemble.

Xcrisp does not introduce compressed (16-bit) forms. The opcode density of RVC already covers the common case; a 16-bit fused load-op would not have enough operand space to be useful.

12. Mode Transitions

12.1 Cross-Memory Calls

A call from narrow code to wide code:

DDR3 region                      SRAM region
   ...                              wide_function:
   call wide_function   ───────►       addi x40, x10, 1
   ...                                 ret
   ; (continues in narrow mode)

The call (JAL or JALR) writes the return address in DDR3 to ra. The target instruction is fetched from SRAM and decoded as wide. The wide function may freely use x32–x63. On ret, control returns to DDR3 and the next instruction is decoded as narrow. Since x32–x63 are caller-saved and invisible to the calling convention, no save/restore is needed; the caller already assumed they may be clobbered.

A call from wide code to narrow code is symmetric. The narrow callee physically cannot reference x32–x63, so the caller's contents are preserved by inaction.

12.2 Traps

A trap from wide code transfers to the trap vector (in narrow memory by convention) and decodes as narrow. The trap handler must:

Save x1–x31 and f0–f31 (as in a standard RV64 trap).
Read the wide-dirty bit (§6.3).
If set, save x32–x63 and f32–f63, then clear the bit.
Service the trap.
Reverse on return.

The handler itself need not be aware of which mode the trapping code was in; the dirty bit is the only signal.

12.3 Self-Modifying Code

Copying instructions between SRAM and DDR3 is both a mode-changing operation and a Harvard-boundary crossing (§5.1). An instruction word valid in one memory may not be valid in the other:

A SRAM instruction that uses x40 has extension bits set. Copied to DDR3, those bits no longer exist, and the instruction would address x8 instead.
A DDR3 instruction copied to SRAM has zero extension bits, which decodes correctly (extension bit = 0 ⇒ standard register).

The asymmetry means SRAM-to-DDR3 copies require re-assembly with narrow-mode register pressure constraints. This is principally a concern for JIT compilers; ahead-of-time toolchains place each function in its target memory at link time.

Harvard restriction on data access to SRAM. Per §5.1, only M-mode code may write to the 36-bit SRAM range (data loads always trap). JIT compilers therefore run in M-mode, or delegate the code-deposit step to an M-mode service via an SBI call.

Prefetch buffer and D-cache coherence after SMC. Code that has been overwritten — whether copied between memories, JIT-generated, or modified in place — must invalidate any prefetch buffer that may hold stale instructions, and any D-cache line covering the modified DDR3 region. The mechanisms:

Writes via DMACPY / DMASET auto-invalidate any overlapping prefetch buffer (§4.7) and any matching D-cache line (§5.2).
Writes via standard stores: M-mode stores to SRAM auto-invalidate the matching prefetch buffer; stores to DDR3 update the D-cache (write-through). After SMC, fence.i flushes all non-pinned prefetch buffers; mxdcache_flush_addr invalidates a specific D-cache line if needed.
Pinned prefetch buffers are not auto-invalidated; M-mode code intending to modify a pinned region must unpin first.

13. ABI Specification

The FireStorm ABI is the standard lp64d RISC-V calling convention, unchanged. Argument passing, return values, stack alignment, and the register-saving categories for x0–x31 and f0–f31 are identical to vanilla RV64GC.

13.1 Extensions to the ABI

x32–x63 and f32–f63 are caller-saved scratch. They are not used for argument passing, return values, variadic argument expansion, or any other cross-call purpose.
No callee preserves them. A function may freely write any of them at any time. A caller wishing a value to survive a call must place it in a standard saved register (x8–x9, x18–x27, f8–f9, f18–f27).
No callee restores them on return. A function returning has no obligation regarding x32–x63.
Trap handlers must preserve them when the wide-dirty bit indicates they are in use. (See §12.2.)

13.2 Linkage

A wide-aware function and a vanilla-RV64 function are link-time interchangeable. The ABI does not change. The only consequence of compiling a translation unit with -march=...+xwide is that the compiler may emit instructions naming x32–x63 within that function's body.

A function compiled -march=...+xcrisp may emit Xcrisp instructions in its body but otherwise follows the ABI identically.

13.3 Linker Sections and Mode Identification

The toolchain identifies each text section's mode through two parallel mechanisms, both required to be consistent:

Section name convention (human-readable). The conventional names are listed in the table below. Linker scripts and human readers use these.
ELF section flag (machine-readable, authoritative). A FireStorm-specific flag bit in the ELF sh_flags field, named SHF_FIRESTORM_WIDE (bit value TBD; uses the SHF_MASKPROC range reserved for processor-specific flags). The linker and downstream tools (loaders, debuggers, profilers) treat this flag as authoritative.

When a section's name and flag disagree, the linker raises an error. The dual mechanism gives humans a readable cue and tools an unambiguous bit.

Section	Name	`SHF_FIRESTORM_WIDE`	Linker placement	Notes
Standard text	`.text`	0	DDR3 region (`0x0000_0000`–`0x7FFF_FFFF`)	Default
Wide-mode text	`.text.wide`	1	36-bit SRAM (`0x8000_0000`–`0xBFFF_FFFF`)	Compiled with `+xwide`
Xcrisp-aware narrow text	`.text.crisp`	0	DDR3 or SRAM, default DDR3	May contain Xcrisp; narrow encoding
Wide-mode named subsections	`.text.wide.<name>`	1	SRAM	Same as `.text.wide`, grouped by name

The default linker script for Ant64 places .text.wide* in SRAM and .text, .text.crisp in DDR3. A __attribute__((section(".text.wide"))) directive on a C function moves it into wide memory; the compiler then permits wide register allocation and wide-mode-only instructions within that function. Section attributes set the flag automatically; the compiler does not require source-level flag manipulation.

Linker Mode-Placement Check

After final layout, the linker verifies that every section's mode flag matches its physical placement:

A SHF_FIRESTORM_WIDE section placed below 0x8000_0000 (DDR3 range) is a link error: the CPU would decode the code in narrow mode, misinterpreting the extension-nibble bits.
A non-wide section placed at or above 0x8000_0000 (SRAM range) is also a link error: the CPU would decode the code in wide mode, but the section was assembled without extension nibbles. The high 4 bits of each slot would be whatever the SRAM happens to contain.

The error message identifies the offending section and indicates the expected placement range. The linker does not attempt to silently fix placement, since either the section attribute is wrong (programmer intent error) or the linker script is wrong (build-system error).

Cross-Section Symbol References

Symbol references between narrow and wide sections are normal — the CPU handles mode transition transparently on the next fetch. The linker does not insert thunks or veneers; a call wide_function from a narrow section is simply a branch to an SRAM address, which causes the CPU to enter wide mode for the next fetch.

The only cross-section concern is the immediate-range mismatch: a wide-mode function emitting a JAL with 23-bit immediate cannot directly reach a narrow section more than ±16 MiB away in physical address space (and vice versa across the DDR3/SRAM boundary). Long-range calls go through auipc + jalr (narrow) or JALPC / CALLM (wide) trampolines as needed; the linker relaxes accordingly.

14. Toolchain Integration

14.1 Target Features

The compiler accepts the following target feature flags:

Flag	Meaning
`+xwide`	Permit emission of instructions naming x32–x63 / f32–f63. Implies the function is placed in `.text.wide`.
`+xcrisp`	Permit emission of CRISP custom instructions.
`+xfirestorm`	Shorthand for `+xwide,+xcrisp`.

A function may be annotated __attribute__((target("xwide"))) or __attribute__((target("xcrisp"))) to enable features per-function without changing the global build flags. Functions annotated with xwide are automatically placed in .text.wide unless an explicit section attribute overrides.

14.2 Assembler Syntax

The assembler accepts register names x32–x63 and f32–f63 in all instruction contexts. In a non-wide section, naming x32–x63 is a hard error. In a wide section, the assembler automatically emits the appropriate extension nibble for each instruction.

The CRISP mnemonics (LWPI, LWADD, BEQM, BMCPY, etc.) are accepted in any section if +xcrisp is enabled.

Disassembly of a 36-bit slot prints the standard RV64GC mnemonic with the resolved register names — c.add x44, x40 rather than the more honest but verbose c.add.wide x44, x40. The wide-mode property is implied by the section the address lives in, and is shown in the section header rather than per-instruction.

Mode Directives

The assembler determines the current mode from two sources, in priority order:

Mode override directive (if any active) — .fsmode, .option arch.
Section attribute — the section's name and SHF_FIRESTORM_WIDE flag (§13.3).

Mode directives change the encoding rules from the current point forward, without affecting section placement.

Directive	Effect
`.fsmode wide`	Subsequent instructions assemble in wide mode (extended immediates, x32–x63 valid, wide-mode-only opcodes permitted).
`.fsmode narrow`	Subsequent instructions assemble in narrow mode (standard RV64 ranges only).
`.fsmode push wide` / `.fsmode push narrow`	Push current mode onto an internal stack and switch.
`.fsmode pop`	Restore the previously pushed mode.
`.option arch +xwide`	RISC-V-canonical equivalent of `.fsmode wide`. Recognised for toolchain compatibility.
`.option arch -xwide`	Equivalent of `.fsmode narrow`.

Compilers emit .option arch +xwide (matching the RISC-V convention used by other extensions). Human-written assembly may prefer the shorter .fsmode wide. Both are accepted and interchangeable.

A mode directive that contradicts the enclosing section attribute is a warning, not an error, since some code-generation patterns legitimately want temporary mode overrides (e.g., a wide section containing a small block of narrow-mode initialisation code intended to be moved at link time). The output object file's section flag is determined by the section attribute, not by override directives within the section.

Example:

.section .text.wide, "ax", @progbits          # SHF_FIRESTORM_WIDE flag set
                                              # Mode is implicitly 'wide' from here
hot_function:
    PUSH    rlist=01000, spimm=4              # Xstack push
    LIZ     t0, 0x8000, 1                     # wide-mode-only — valid here
    LUI     t1, 0x12345 0                     # imm23 — valid in wide mode
    ADDI    t1, t1, 4000                      # imm14 — valid in wide mode
    ...
    POPRET  rlist=01000, spimm=4

    .fsmode push narrow
    .word   0x00010113                        # raw narrow-mode encoding for some reason
    .fsmode pop                               # back to wide

    .section .text, "ax", @progbits           # SHF_FIRESTORM_WIDE flag clear
                                              # Mode is implicitly 'narrow'
narrow_helper:
    addi    a0, a0, 1
    ret

The .fsmode directive output does not affect the section flag, which remains controlled by .section. If a section's content disagrees with its flag at the end of assembly (e.g., wide-mode-only opcodes used inside a non-wide section without an override), the assembler issues a hard error.

14.3 Register Allocator Hints

The register allocator should treat x40–x47 / f40–f47 as the preferred extended pool, since instructions in those ranges remain expressible in compressed form. x32–x39 and x48–x63 are still valid but force 32-bit instruction encodings, increasing code size.

A reasonable allocator policy:

Allocate from x8–x15 (standard ABI hot registers) first.
Spill candidates: allocate from x40–x47 next.
If still under pressure: allocate from x32–x39, x48–x63.
Only after the entire extended pool is exhausted, spill to stack.

This naturally produces compressed code for inner loops and 32-bit code for code that has genuinely run out of registers.

15. Implementation Notes (Non-Architectural)

The following are properties of the initial FireStorm FPGA implementation, not architectural commitments. Future implementations may vary.

Pipeline depth. TBD — anticipated 5- to 7-stage in-order.
Issue width. Single-issue in narrow mode. In wide mode, both RVC-pair and 32-bit-pair dual-issue (see §8.7) when independence criteria are met. The decoder's pair-independence check is roughly 30 LUTs; the second integer ALU and additional register-file ports add ~700–1000 LUTs. Superscalar beyond 2-wide is a possible future revision.
Prefetch buffer port width. All models use a 256×72-bit BSRAM port (two slots per cycle, enabling 32-bit-pair dual-issue).
Register file. Implemented as 64-entry × 64-bit register array, replicated for read ports as appropriate. The extra 32 entries cost approximately 2 KiB of register storage.
Wide-memory capacity. Set by the choice of external SRAM chips. All models fit a single 1M × 36-bit SRAM bank (≈4.5 MiB of wide-mode code space), which far exceeds practical .text.wide requirements. The SRAM is code-only by Harvard restriction (§5.1) — wide-mode data lives in DDR3 or scratchpad BSRAM.
Wide-dirty bit. Implemented as a single flop on the register-file write path, OR'd with any write whose decoded destination index is ≥ 32.
No I-cache. Instruction fetch uses the multi-buffer prefetch system described in §4.
8 KB direct-mapped D-cache (§5.2) covers DDR3 data accesses. Write-through, no-write-allocate, 32-byte lines, ~4 BSRAM blocks plus a small tag register file.
8 KB / 32 KB scratchpad BSRAM (§5.3) for software-managed hot data. Single-cycle, wide-port, uncached.
Branch prediction. Bimodal or g-share predictor TBD; integrated with the speculative-prefetch system (§4.4) to trigger background buffer fills on predicted-taken branches to out-of-buffer targets.
Target clock frequency. ~380 MHz on GoWin GW5AST (the BSRAM peak rate; pipeline balanced to match). The pipeline depth (5–7 stages) is chosen so that the critical path of any stage fits within one BSRAM cycle at this frequency. Higher clocks would require deeper pipelines and likely additional forwarding paths; this is a v0.2 question.
Out-of-order completion (register scoreboarding). See §15.1 below.

15.1 Out-of-Order Completion (Register Scoreboarding)

FireStorm uses register scoreboarding to hide the latency of multi-cycle operations. The mechanism:

When a multi-cycle instruction (MUL, DIV, REM, FADD, FMUL, FDIV, FSQRT, FMA, or a load that misses the D-cache) issues to its functional unit, its destination register is marked pending in the scoreboard.
Subsequent instructions continue to execute on the main pipeline normally.
When a subsequent instruction's decode stage finds that one of its source registers is pending, the pipeline stalls at decode until the pending write completes.
When the multi-cycle unit finishes, it writes back to the register file and clears the pending bit.

This is an in-order-issue, out-of-order-completion model — a classic optimisation since the CDC 6600 (1964). It hides the cycles of slow operations as long as the result isn't immediately consumed.

Trap semantics with scoreboarding. RV64GC's multi-cycle ops were chosen by the standard committee to never take precise traps:

MUL never traps.
DIV by zero returns -1; overflow (MININT / -1) returns MININT. Neither traps.
FP exceptions accumulate as flags in fcsr; they do not raise traps in standard RV64GD.
D-cache misses are transparent — the missing load issues to DRAM, the scoreboard bit holds, the pipeline continues.

This means scoreboarding requires no precise-exception machinery, no reorder buffer, no rollback. Traps that do occur (address misalignment, page fault, illegal instruction) are caught at issue time, before any later instruction has committed in the pipeline.

Per-FU latency tracking. Each multi-cycle unit has its own latency and produces a "completion ready" signal. The scoreboard arbitrates write-back when multiple units complete on the same cycle (priority order: integer ALU > load unit > multiplier > divider > FPU, with the lower-priority units staging their writeback one cycle).

Composition with dual-issue. Scoreboarding composes cleanly with the dual-issue execution of §8.7:

A dual-issue pair where neither instruction is multi-cycle: both execute in one cycle, both writeback in one cycle.
A dual-issue pair where the first is multi-cycle: first issues to its FU and marks destination pending; second instruction executes if independent of the pending register. Effective IPC > 1 even with a long-latency op in flight.
A dual-issue pair where the second is multi-cycle: same model — both issue, but the slower one completes later.

The combination effectively pipelines independent work around long-latency operations.

Cost on FPGA:

Component	LUT estimate
Scoreboard (1 bit per architectural register; 64 GPRs + 64 FPRs = 128 bits)	~128
Decode-stage hazard check (compare 2 source-reg indices to 128-bit pending mask)	~150
Per-FU destination tracking (5 FUs × ~50 LUTs)	~250
Write-back arbitration (priority + register-file port mux)	~100
Total	~600–800 LUTs

Modest enough to implement; with both dual-issue modes available, scoreboarding and dual-issue compose cleanly.

Expected impact. Hard to give precise numbers without measurement, but typical workload categories (assuming DSP-backed MUL/FMA per §15.3):

Workload	Expected scoreboarding gain
Integer-heavy with occasional MUL	~3–7% (DSP MUL is 2–3 cycles, less to hide)
DSP with many independent MACs in flight	~10–15% (DSP FMA is 4–5 cycles per op)
FP-heavy (audio synthesis with FMA chains)	~10–20%
DIV/REM-heavy (cryptography, modular arithmetic)	~25–40% (DIV is 30–65 cycles, big hide opportunity)
D-cache-miss-bound (linked-list traversal, etc.)	~30–50%
Mixed enterprise code	~8–12%

These compose with dual-issue and the ISA-level wins — total improvement vs vanilla RV64GC at the same clock can exceed 50% on representative FireStorm workloads.

15.2 Execution Model

FireStorm uses an in-order-issue, out-of-order-completion execution model with 2-wide superscalar dispatch and register scoreboarding. The model is sometimes called:

Shallow out-of-order in performance-architecture literature;
In-order superscalar with completion scoreboard in academic taxonomies;
Statically scheduled superscalar to distinguish from dynamically-scheduled (full OoO) designs.

The closest industry analogue is the ARM Cortex-A55 — an in-order, 2-wide dispatch, scoreboarding design widely deployed in mobile SoCs. Other comparable designs include ARM Cortex-A53, Apple Icestorm (the efficiency cores of M-series SoCs), and the SiFive U54 / U74 (which also pair with RV64GC).

What FireStorm v0.1 does:

Fetches in program order from the prefetch buffer (§4).
Issues up to 2 instructions per cycle when independent (§8.7 dual-issue rules).
Dispatches multi-cycle operations to their respective functional units (integer ALU, multiplier, divider, FPU, load unit), each pipelined independently.
Tracks pending destination registers in the scoreboard (§15.1); stalls decode of consumers until pending writes complete.
Allows out-of-order completion: a fast operation issued after a slow one can complete first and write back first.

What FireStorm v0.1 does not do:

Register renaming. No alias tables; instructions write directly to architectural registers. This means WAW hazards stall (rare in optimised code).
Speculative execution past unresolved branches. Branch prediction triggers speculative prefetch (§4.4), but execution stalls at the branch until resolution. No branch speculation, no recovery rollback needed.
Reorder buffer. Instructions commit in program order; completion-out-of-order only means writeback order can differ. No retirement window, no precise-exception machinery beyond what RV64 already requires.
Memory-load speculation. Loads issue in program order with respect to stores (no load-store reordering); they may complete out of order with respect to other operations.

This balance captures 70–80% of full OoO's performance gain at maybe 20% of the verification complexity. The remaining performance is locked behind register renaming + reorder buffer + branch speculation, all of which together would more than triple the verification surface and significantly increase FPGA resource use. FireStorm chooses the practical-RISC sweet spot.

The execution model is a microarchitectural property, not architectural. Future FireStorm implementations may add register renaming, branch speculation, or wider issue, and software will see only performance change — never functional difference.

15.3 DSP Block Usage for Multiplication and FMA

The GoWin GW5AST family includes hardened DSP blocks with substantially better multiply performance than fabric-based multipliers. Each DSP block contains:

One 27×18 signed multiplier (M0), with optional pre-adders and pipeline registers.
One 12×12 multiplier (M1) auxiliary, sharing the ALU.
A 48-bit ALU/accumulator at the output with cascade chains to adjacent DSPs.
Combined 27×36 mode when M0+M1 work together — 63-bit signed product.
Pipeline mode on inputs and outputs (configurable per stage).

DSP block counts per Arora V variant:

Variant	LUTs	DSP Blocks	Headroom after CPU
GW5AT-15	15,120	34	small — not targeted
GW5AT-60	59,904	118	not targeted
GW5AT-75	86,688	213	~190 blocks free
GW5AST-138	138,240	298	~278 blocks free

FireStorm EE Core DSP Usage

Functional unit	Implementation	DSP blocks	Latency	Throughput
Integer MUL (64×64 → 64)	4× 27×18 multipliers + post-add	4	2–3 cycles	1/cycle
Integer MULH (64×64 → 128 high)	Same hardware, output selection	4 (shared)	2–3 cycles	1/cycle
FP64 FMA (53×53 + 53 → normalized)	27×36 mode + significand alignment + post-add + normalise	6	4–5 cycles	1/cycle
FP64 ADD/SUB	Uses DSP ALU only (no multiply)	2	3 cycles	1/cycle
FP32 FMA	27×18 + alignment + normalise	2	3 cycles	1/cycle
CPU core total		~16–20 blocks

Latency improvement over fabric-only: the DSP-backed implementation cuts MUL latency from 3–5 cycles (fabric) to 2–3 cycles, and runs at the full 380 MHz BSRAM clock without timing closure pressure. FMA gains even more: a fabric FMA would be 8–10 cycles; DSP-backed FMA is 4–5.

Effect on scoreboarding gains: the scoreboarding speedup numbers in §15.1 assumed fabric multipliers (3–5 cycle MUL, 5–8 cycle FMA). With DSP-backed multipliers, the absolute gain from scoreboarding is smaller (less latency to hide) but the baseline performance is higher. Net: faster multiply-heavy code, smaller relative scoreboarding contribution.

DSP Block Headroom for Future Extensions

On Ant64, the 278 DSP blocks remaining after CPU allocation are a substantial budget for v0.2 features:

Custom DSP / SIMD accelerators. Hardware accelerators for application-specific kernels (image processing pipelines, audio effects chains, neural-net inference) can be allocated from the DSP block pool. A 4-lane FP32 SIMD bank needs ~16 DSP blocks; substantially broader configurations are possible.
Xcrisp DSP extensions. Hardware FIR/IIR filters, MAC chains, butterfly operators for FFT — each costs a few DSP blocks and can be exposed via custom-opcode instructions. Several can coexist.
Audio synthesis accelerators. A wavetable oscillator with cubic interpolation needs ~4 DSP blocks; running 32 in parallel needs ~128, well within budget.
Application-visible MAC arrays. Memory-mapped DSP arrays accessible via Xcrisp DMA, for offloading hot audio kernels.

The DSP-block-rich nature of the GW5AST is one reason FireStorm's audio-and-DSP focus is realistic on this fabric. Other FPGA families with fewer DSP blocks would force more multiplies into the fabric, increasing both latency and area cost.

DSP Budget

The GW5AST-138 has 298 DSP blocks, of which ~16 go to the CPU core and ~6–10 to Xmath, leaving ~272–276 free — ample headroom for DSP-accelerator features.

16. Open Items

Items deferred from this draft, to be resolved in subsequent revisions:

Xcrisp instruction encoding. Bit-level layouts captured in FireStorm Xcrisp Extension. Remaining open items for Xcrisp are listed in that document's §13.
Wide-dirty CSR number. A specific allocation in the machine-mode custom CSR range.
Trap-vector mode awareness. Whether the trap vector itself may live in SRAM (currently assumed to live in DDR3 / narrow memory).
Vector extension (V) revisited. v0.1 allocates the OP-V opcode (0x57) to Xmath (§10), so V is not currently planned. If a clear workload need for SIMD-style data parallelism emerges post-implementation (bulk image processing, mass audio mixing into wide vectors), V could be added in v0.3+ at a different opcode allocation. Several custom-N opcode spaces remain free.
Xmath open items. See FireStorm Xmath Extension §17 for Xmath-specific TBDs (FP16 variants, additional vector bundles, encoding finalisation, library integration).
Hardware performance counters. A cycle / instret baseline plus FireStorm-specific counters (buffer hit/miss, D-cache hit/miss, mode-switch count, DMA-invalidation count) for profiling.
Prefetch buffer and D-cache CSR addresses. All mxbuf* and mxdcache* CSR addresses in §4.8 and §5.2 are suggested; final assignment requires coordination with the other FireStorm extension CSRs.
JIT and self-modifying code. The fence.i semantics map to "flush all non-pinned buffers" (§4.7) and "invalidate D-cache" via mxdcache_flush (§5.2). Whether finer-grained fence.i.va (flush specific address) variants are needed is open.
Branch predictor integration with speculative prefetch (§4.4). The exact mechanism — predictor-driven fill triggers, training, eviction policy when speculation is wrong — is implementation-defined in v0.1.
mxsram_read CSR encoding. M-mode debug-only path for reading SRAM contents (§5.1). Format and exact CSR number TBD.
Per-context scratchpad (§5.3). v0.1 scratchpad is global. Whether to partition the scratchpad per Xctx context (more isolation, but added context-switch cost) is a v0.2 candidate.
MOVN-style negated-immediate variant of LIZ (§7.6). ARM-A64's MOVN constructs negative constants in fewer instructions than MOVZ+MOVK sequences. Whether to allocate the encoding space for a third LI-family instruction is a v0.2 question.
Wide-mode imm14 / imm23 in toolchain. GCC/LLVM assembler must accept the wider immediate ranges only in wide-mode sections (§13.3); the linker must check immediate-range compatibility when relocating wide-mode code to narrow placement.
SHF_FIRESTORM_WIDE ELF flag bit assignment. The exact bit value within the SHF_MASKPROC range needs final allocation in coordination with the binutils maintainers.

17. Extension Encoding Strategy

This section explains the mechanics of how the RISC-V instruction encoding leaves room for new instructions, and how FireStorm uses that room. It's a reference appendix — the specific allocations are stated in §10 (Xmath) and §11 (Xcrisp), and in the per-extension specs for Xstack, Xctx, Xlate, and Xcond. Here the question is why those allocations exist where they do, and what other choices the encoding scheme makes available.

17.1 The 32-bit Instruction Shape

Every standard 32-bit RISC-V instruction has the same low-level structure:

   bit 31                                          bit 0
   ┌──────────────────────────────────────────────┬───┐
   │  instruction body                            │ 11│ ← bits [1:0]
   └──────────────────────────────────────────────┴───┘

Bits [1:0] = 11 — flag that this is a "regular" 32-bit instruction (not a 16-bit compressed instruction, where bits [1:0] = 00, 01, or 10).
Bits [6:2] — the major opcode field within the 32-bit instruction space (5 bits = 32 possible major opcodes).
Bits [31:7] — payload, layout determined by the format the major opcode implies (R, I, S, B, U, J, or one of the more specialised types).

The full 7-bit field inst[6:0] is what we usually call "the opcode byte". Because bits [1:0] are fixed at 11 for normal 32-bit instructions, we have 32 major opcodes to allocate across all of RISC-V — standard and custom combined.

17.2 The Standard Opcode Map

The RISC-V ISA Manual assigns these 32 opcodes as follows. The table is indexed by inst[4:2] across and inst[6:5] down (so the full opcode is <inst[6:5]><inst[4:2]>11):

inst[6:5] \ inst[4:2]	`000`	`001`	`010`	`011`	`100`	`101`	`110`	`111`
`00`	LOAD `0x03`	LOAD-FP `0x07`	custom-0 `0x0B`	MISC-MEM `0x0F`	OP-IMM `0x13`	AUIPC `0x17`	OP-IMM-32 `0x1B`	48-bit `0x1F`
`01`	STORE `0x23`	STORE-FP `0x27`	custom-1 `0x2B`	AMO `0x2F`	OP `0x33`	LUI `0x37`	OP-32 `0x3B`	64-bit `0x3F`
`10`	MADD `0x43`	MSUB `0x47`	NMSUB `0x4B`	NMADD `0x4F`	OP-FP `0x53`	OP-V `0x57`	custom-2 `0x5B`	48-bit `0x5F`
`11`	BRANCH `0x63`	JALR `0x67`	reserved `0x6B`	JAL `0x6F`	SYSTEM `0x73`	reserved `0x77`	custom-3 `0x7B`	≥80-bit `0x7F`

Key observations:

Most of the map is committed to base ISA and ratified extensions. LOAD, STORE, BRANCH, JAL, JALR, OP, OP-IMM, OP-FP, AMO, the FMA family — all of those are mandatory parts of RV64GC and untouchable.
Four slots are RISC-V-blessed for custom use: 0x0B, 0x2B, 0x5B, 0x7B (custom-0 through custom-3). These are the official extension slots; a conforming RISC-V implementation may use them for anything, and no standard extension will ever claim them.
Two slots are reserved with no current claim: 0x6B and 0x77. These are not officially "custom" but no standard extension uses them either. Using one is a calculated bet: it works today, and the bet pays off as long as RISC-V International doesn't later ratify an extension into that slot.
Three slots indicate longer instructions: 0x1F (48-bit), 0x3F (64-bit), 0x5F (also 48-bit), 0x7F (≥80-bit). These were reserved by the original RISC-V designers to allow future extensions of the encoding length.

17.3 What an Opcode Buys: Format and Sub-Encoding

The opcode field alone is only 7 bits — far too few to enumerate every instruction in the ISA. The trick is that each opcode also implies a format (R, I, S, B, U, J, R4, or one of the specialised variants), and the format defines how the remaining 25 bits are subdivided. The two most relevant for adding new instructions are:

R-type (used by OP 0x33, OP-32 0x3B, OP-FP 0x53, OP-V 0x57, etc.):

   bit 31                            bit 0
   ┌────────┬─────┬─────┬────┬────┬─────┐
   │ funct7 │ rs2 │ rs1 │f3  │ rd │ op  │
   └────────┴─────┴─────┴────┴────┴─────┘
       7      5     5    3    5    7

R-type gives you funct7 × funct3 = 128 × 8 = 1024 distinct sub-encodings within one opcode, two source register fields, and one destination — the standard 3-operand integer instruction shape. ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU, MUL, DIV, etc., all share opcode 0x33 and just differ in funct3/funct7.

R4-type (used by FMA family 0x43, 0x47, 0x4B, 0x4F):

   bit 31                                  bit 0
   ┌─────┬───┬─────┬─────┬────┬────┬─────┐
   │ rs3 │fmt│ rs2 │ rs1 │ f3 │ rd │ op  │
   └─────┴───┴─────┴─────┴────┴────┴─────┘
      5    2   5     5    3    5    7

R4-type spends 5 bits of the funct7 space on a third source register (rs3), leaving only 2 bits of fmt and 3 bits of funct3 = 32 sub-encodings per opcode. Far less encoding space, but you get a 4-operand instruction — essential for fused multiply-add (rd = rs1 * rs2 ± rs3).

I-type (0x03, 0x13, 0x67, 0x0B, ...): 12-bit immediate, one source, one destination, plus funct3. 8 sub-encodings per opcode. Used by loads, ALU-with-immediate, JALR, and Xcrisp's auto-inc loads.

S/B-type (stores and branches): 12-bit immediate, two sources, funct3. 8 sub-encodings per opcode.

U/J-type (LUI/AUIPC, JAL): 20-bit immediate, one destination, no funct fields. The whole opcode encodes the instruction.

So when an extension needs new instructions, the design question is:

What format? Three-operand register? Two-operand with immediate? Branch?
How many sub-encodings does it need? A handful (use 1 funct3 slot in an existing opcode), tens (use a full opcode with funct7+funct3), or hundreds (use a full opcode with all the format's space)?
Where does it land? In an existing custom slot, in a reserved slot, or by displacing an unused standard extension?

17.4 How FireStorm Uses the Map

FireStorm's allocations follow the conservative principle: use RISC-V-blessed custom slots wherever possible, and only displace standard opcodes when the extension being displaced is genuinely not implemented.

Opcode	Standard meaning	FireStorm use
`0x0B`	custom-0	Xcrisp auto-increment loads (I-type, §11)
`0x2B`	custom-1	Xcrisp auto-increment stores (S-type, §11)
`0x5B`	custom-2	Xcrisp memory-fused arithmetic, Xstack, Xctx — fully subdivided by funct3
`0x7B`	custom-3	Xcrisp compare-mem-branch (B-type)
`0x57`	OP-V (vector) — not implemented in v0.1	Xmath R-type (G2–G11, §10)
`0x6B`	reserved (no standard claim)	Xmath R4-type (G1 integer fused MAC, §10)
`0x7F`	≥80-bit instruction marker	Wide-mode-only escape for Xcrisp PIC and LIZ/LIK (see §7)

The four custom-N opcodes are claimed for Xcrisp + Xstack + Xctx. Within 0x5B the eight funct3 slots are divided cleanly:

funct3 in `0x5B`	Owner	Purpose
`000`–`011`	Xcrisp	Load-op, op-store, load-op-store, B-tree primitives
`100`	Xstack	User-stack push/pop family
`101`	Xstack	Supervisor-stack push/pop family
`110`	Xstack	Stack management
`111`	Xctx	Hardware context switching

That packs four extensions into one opcode without collision and with each extension having a sub-family of its own.

0x57 is reallocated from OP-V because v0.1 does not implement the standard Vector extension. This is a real trade-off — if V is added later it will need a different opcode allocation, and several custom-N and reserved slots remain available for that. 0x6B is the cleanest "reserved" slot in the map and is the natural place for FireStorm's R4-type integer fused MAC — it lives alongside the FP FMA family at 0x43–0x4F in encoding shape, but at a distinct opcode that doesn't displace any standard feature.

17.5 Why Not Just Put Everything in Custom Slots?

A reasonable question: there are four custom slots (0x0B, 0x2B, 0x5B, 0x7B) — why does FireStorm reach for 0x57 and 0x6B at all?

Two reasons:

Format constraints. Each custom slot is one opcode, which implies one format. Xcrisp wants auto-inc loads (I-type) in one slot, auto-inc stores (S-type) in another, memory-fused arithmetic (R-type) in a third, and compare-mem-branch (B-type) in a fourth. That's four formats, and the custom slots are exactly the right shape for those. They're used. Xmath then needs additional R-type space (G2–G11 alone require ~60 distinct instructions, comfortable in 0x57's 1024 slots) and a separate R4-type slot for G1. Those don't fit in custom-N because custom-N is taken, and trying to share would force awkward encoding tricks.
The custom slots are best spent on extensions that integrate with the standard ISA fabric — Xcrisp's load/store/branch family naturally sits adjacent to the standard load/store/branch opcodes (the alignment of 0x0B/0x2B/0x7B to the standard 0x03/0x23/0x63 pattern at inst[6:5] is intentional and lets a decoder reuse most of its existing extract logic). Xmath, by contrast, is more like a coprocessor extension — a separate functional unit with its own opcode is the cleaner shape, and the OP-V opcode is exactly the right shape because it was designed for arithmetic-heavy extensions.

17.6 The Wide-Mode Escape (`0x7F`)

Wide mode adds a 36-bit fetch with a 4-bit extension nibble that augments register fields and a few sparse format bits. The escape mechanism is different: opcode 0x7F is recruited as a wide-mode-only family-prefix for instructions that need more encoding space than the nibble scheme provides — Xcrisp's PIC family, X-type indexed loads, and LIZ/LIK 16-bit immediate construction (see §7.5).

In standard RISC-V, 0x7F means "this is an ≥80-bit instruction, the rest is reserved for future definition." FireStorm preserves that meaning in narrow mode (a 0x7F instruction in DDR3 traps as illegal-instruction, because we don't implement the longer encoding). In wide mode, where the fetch is already 36 bits, 0x7F is reinterpreted as a 32-bit opcode plus 4-bit nibble — a non-standard repurposing that works specifically because wide mode is itself non-standard.

This is a softer discipline than custom-N (which is RISC-V-blessed). The 0x7F escape is therefore treated with the same allocation rigour as any other opcode in this spec: every funct3/funct4 sub-allocation is tracked, and the family operates in a single coherent design space rather than as a free-for-all.

17.7 Headroom

After all of FireStorm's v0.1 allocations, the following opcode-level capacity remains:

0x77 (reserved) — untouched, available for a future extension
0x57 funct3 slots — Xmath uses all 8, but each funct3 has 128 funct7 slots of which only a small fraction are populated; substantial headroom for additional Xmath operations
0x6B funct3 / fmt slots — fmt=11 is currently reserved for future G1 expansion; the full R4 space at this opcode is mostly empty
Wide-mode 0x7F — multiple funct3 sub-families remain reserved for future wide-mode-only instructions
0x5B Xstack management — funct7 = 1xxxxxx within funct3 = 110 is reserved for future Xstack expansion (e.g. machine-stack push/pop)
The four custom-N opcodes are fully claimed but most are far from full — funct3 slots remain in 0x5B only at the per-funct7 sub-level; the entire 0x0B, 0x2B, 0x7B spaces have substantial funct7 / immediate-payload headroom for future Xcrisp instructions.

A v0.2 of the spec could add a substantial set of new instructions without needing to touch a single new opcode — the existing allocations have plenty of room. If V becomes desirable in v0.3+, it will need a new opcode allocation: either reclaiming 0x77, defining a new wide-mode-only family at 0x7F, or trading off some Xmath capacity. The mechanisms exist; the choice can be made later.

18. Glossary

Term	Meaning
Narrow mode	Instruction fetch from a 32-bit memory region; standard RV64GC decoding.
Wide mode	Instruction fetch from a 36-bit SRAM region; RV64GC + Xwide decoding.
Mode comparison	See §3.1 for a comprehensive narrow vs wide difference table.
Slot	The atomic unit of wide-mode instruction memory: a 36-bit word holding either one 32-bit instruction or two 16-bit RVC instructions. Slot N occupies byte addresses N×4 through N×4+3. (§8.6)
Slot-aligned	An address that is a multiple of 4 (i.e., bits[1:0] = 00). In wide mode, all branch and jump targets are slot-aligned. (§8.6)
Slot-indexed PC	The wide-mode convention that PC values are always slot-aligned, enabling branch and jump immediates to use ×4 scaling for double the effective byte range. (§7.3.2, §8.6)
C.NOP padding	Insertion of `c.nop` (`0x0001`) to push a label to a slot boundary when natural code flow would land it at slot+2. (§8.6)
Dual-issue	Execution of two consecutive instructions in a single cycle when they meet independence criteria (no RAW hazard, neither is memory / branch / system / FP / multiply / divide). RVC-pair and 32-bit-pair both supported. (§8.7)
Scoreboarding	An in-order-issue, out-of-order-completion mechanism that lets multi-cycle instructions (MUL, DIV, FP ops, D-cache-miss loads) execute concurrently with subsequent independent instructions. A pending-bit-per-register scoreboard stalls only those decode-stage consumers that actually depend on an in-flight result. (§15.1)
Shallow OoO	The in-order-issue / out-of-order-completion model used by FireStorm (§15.2). Closest industry analogues: ARM Cortex-A55, Apple Icestorm, SiFive U74. Captures ~70–80% of full out-of-order's performance gain at ~20% of the verification complexity.
DSP block	A hardened multiply-add cell in the GoWin GW5AST fabric. Contains a 27×18 multiplier, a 12×12 auxiliary multiplier, and a 48-bit ALU/accumulator with cascade chains. FireStorm uses DSP blocks for integer MUL, FP FMA, and FP ADD/SUB, leaving substantial headroom for v0.2 DSP extensions and vector units (§15.3).
Extension nibble	The 4 extra bits per 36-bit slot, bits [35:32], used for register-index extension, immediate extension, and the Xcond PRED-EN bit.
Extended immediate	A wide-mode immediate that uses former "spare" nibble bits as high-order bits, giving imm14/imm23 in place of standard imm12/imm20 (§7.3). Compressed RVC instructions get a similar one-bit (or two-bit, for CJ) extension where spare nibble bits permit (§8.5).
Bank select	The use of an extension bit in a 3-bit RVC field to choose between x8–x15 (bank 0) and x40–x47 (bank 1).
Cache-bypass alias	An address with bit 63 set, addressing the same physical DDR3 location as `addr & ~(1ULL << 63)` but bypassing the D-cache (§5.2).
LIZ / LIK	Wide-mode-only instructions for direct 16-bit-chunk construction of 64-bit constants (§7.6).
Prefetch buffer	One of N BSRAM-backed instruction holding ranges, replacing the conventional I-cache (§4).
Buffer pinning	M-mode-controlled exclusion of a buffer from LRU eviction; used for trap-vector and real-time deterministic fetch (§4.5).
Harvard restriction	Prohibition on data accesses to the 36-bit SRAM region; wide-mode programs have code in SRAM and data in DDR3/scratchpad (§5.1).
D-cache	8 KB direct-mapped write-through cache covering DDR3 data accesses (§5.2).
Scratchpad	User-addressable BSRAM region for software-managed hot data, single-cycle access (§5.3).
Wide-dirty bit	Per-hart hardware flag indicating that x32–x63 or f32–f63 have been written since the bit was last cleared; used by trap handlers to lazily save the extended register state.
`.fsmode`	Assembler directive that toggles wide / narrow mode locally within a section (§14.2).
`SHF_FIRESTORM_WIDE`	ELF section flag marking a code section as wide-mode (§13.3). The authoritative machine-readable mode indicator, paired with the human-readable section name.
Xwide	The wide-register-file extension; not visible in `misa`, active by virtue of physical fetch address.
Xcrisp	The CRISP-influenced custom-opcode extension; available in both narrow and wide mode.

End of document.