FireStorm Xcrisp Extension — Instruction Encodings
Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA See also: FireStorm Performance Examples for worked comparisons
1. Overview
The Xcrisp extension is FireStorm's set of CRISP-influenced custom instructions, designed to raise the performance and code density of compiler-generated C code without breaking the RV64GC baseline. It is available in both narrow and wide modes (§3 of the parent doc): vanilla DDR3-resident code may use Xcrisp instructions exactly as SRAM-resident wide-mode code may. In wide mode, Xcrisp register operands extend via the standard extension nibble scheme.
The extension contains four instruction families:
| Family | Purpose | Opcode | Format |
|---|---|---|---|
| Auto-increment loads | *p++ / *--p read patterns |
0x0B (custom-0) |
I-type |
| Auto-increment stores | *p++ / *--p write patterns |
0x2B (custom-1) |
S-type |
| Memory-fused arithmetic | load-op, op-store, block-memory | 0x5B (custom-2) |
R-type |
| Compare-mem-branch | sentinel scans, table walks | 0x7B (custom-3) |
B-type |
The opcode-to-family mapping is deliberately aligned with the standard RISC-V opcode bit pattern at [6:5]:
[6:5] |
Standard | Xcrisp |
|---|---|---|
00 |
LOAD (0x03) |
auto-inc loads (0x0B) |
01 |
STORE (0x23) / OP (0x33) |
auto-inc stores (0x2B) |
10 |
reserved | memory-fused R-type (0x5B) |
11 |
BRANCH (0x63) |
compare-mem-branch (0x7B) |
This alignment lets a FireStorm decoder reuse the standard rs1/rs2/rd/imm extract logic for Xcrisp encodings; only the funct3/funct7 decode tables expand.
2. Feature Detection
The presence of Xcrisp is indicated by a non-zero value in the implementation-defined CSR mxcrisp (machine custom read-only, address 0xFC1, suggested). Bit [0] of mxcrisp is the Xcrisp version (1 for v0.1). A reduced FireStorm variant without Xcrisp returns zero; a CRISP instruction issued on such a variant traps as illegal-instruction.
Compilers normally rely on the +xcrisp target-feature flag rather than runtime detection. Detection is reserved for runtime libraries that may be deployed on multiple FireStorm variants.
3. Auto-Increment Loads (custom-0, opcode 0x0B)
3.1 Encoding
Standard I-type layout:
31 20 19 15 14 12 11 7 6 0
+----------------------+--------+-------+--------+-------------+
| imm[11:0] | rs1 | funct3| rd | 0001011 |
+----------------------+--------+-------+--------+-------------+
| Field | Bits | Meaning |
|---|---|---|
imm[11:0] |
[31:20] | Signed 12-bit increment/decrement amount (post-inc forms) or width-extended pre-dec sub-encoding (see §3.3) |
rs1 |
[19:15] | Base address register; also updated by the instruction |
funct3 |
[14:12] | Operation: width + direction (see §3.2) |
rd |
[11:7] | Load destination register |
| opcode | [6:0] | 0x0B (custom-0) |
The instruction writes to two architectural registers: rd (the loaded value) and rs1 (the updated base). If rs1 == rd, the load value wins (the increment to rs1 is suppressed) — this matches the standard RISC-V convention for instructions that would otherwise have ambiguous semantics.
3.2 Post-Increment Loads (funct3 000–110)
For funct3 ≠ 111, the immediate is a 12-bit signed offset/increment, and the operation is:
rd = sext_or_zext_W(mem[rs1])
rs1 = rs1 + sext(imm)
| funct3 | Mnemonic | Width | Sign | Operation |
|---|---|---|---|---|
000 |
LBPI | byte | signed | rd = sext8(mem8[rs1]); rs1 += sext(imm) |
001 |
LHPI | half | signed | rd = sext16(mem16[rs1]); rs1 += sext(imm) |
010 |
LWPI | word | signed | rd = sext32(mem32[rs1]); rs1 += sext(imm) |
011 |
LDPI | dword | n/a | rd = mem64[rs1]; rs1 += sext(imm) |
100 |
LBUPI | byte | unsigned | rd = zext8(mem8[rs1]); rs1 += sext(imm) |
101 |
LHUPI | half | unsigned | rd = zext16(mem16[rs1]); rs1 += sext(imm) |
110 |
LWUPI | word | unsigned | rd = zext32(mem32[rs1]); rs1 += sext(imm) |
Funct3 assignments match standard RV64I load encodings: bit [14] = unsigned, bits [13:12] = width (00=byte, 01=half, 10=word, 11=dword). This means an existing RV64I load decoder can route through a single funct3 path with only the opcode and "auto-inc" flag changing.
3.3 Pre-Decrement Loads (funct3 = 111)
For funct3 = 111, the 12-bit immediate field is repurposed as {width[2:0], offset[8:0]}:
31 29 28 20
+--------+-------------------+
| width | offset[8:0] |
+--------+-------------------+
width[2:0](imm[11:9]): one of seven width/sign variants, matching the post-incfunct3numbering.offset[8:0](imm[8:0]): signed 9-bit decrement amount, range −256..+255.
| width | Mnemonic | Operation |
|---|---|---|
000 |
LBPD | rs1 -= sext(offset); rd = sext8(mem8[rs1]) |
001 |
LHPD | rs1 -= sext(offset); rd = sext16(mem16[rs1]) |
010 |
LWPD | rs1 -= sext(offset); rd = sext32(mem32[rs1]) |
011 |
LDPD | rs1 -= sext(offset); rd = mem64[rs1] |
100 |
LBUPD | rs1 -= sext(offset); rd = zext8(mem8[rs1]) |
101 |
LHUPD | rs1 -= sext(offset); rd = zext16(mem16[rs1]) |
110 |
LWUPD | rs1 -= sext(offset); rd = zext32(mem32[rs1]) |
111 |
reserved | illegal-instruction |
A negative offset is permitted but produces unusual semantics (rs1 is incremented before the load); compilers should not emit this and disassemblers may flag it.
3.4 Examples
LWPI x10, 4(x11) — read 32-bit word at [x11], sign-extend into x10, advance x11 by 4:
imm = 0x004
rs1 = 11 (0b01011)
funct3 = 010
rd = 10 (0b01010)
opcode = 0x0B
Encoding: 0x004_5A50B = 0000 0000 0100 01011 010 01010 0001011
LDPD x14, 8(x15) — decrement x15 by 8, then read 64-bit dword into x14:
funct3 = 111 (pre-dec marker)
width = 011 (dword)
offset = 0x008
imm[11:0] = {011, 000001000} = 0x608
rs1 = 15
rd = 14
Encoding: 0x608_7B70B = 0110 0000 1000 01111 111 01110 0001011
4. Auto-Increment Stores (custom-1, opcode 0x2B)
4.1 Encoding
Standard S-type layout:
31 25 24 20 19 15 14 12 11 7 6 0
+-------------+--------+--------+-------+-----------+-------------+
| imm[11:5] | rs2 | rs1 | funct3| imm[4:0] | 0101011 |
+-------------+--------+--------+-------+-----------+-------------+
| Field | Bits | Meaning |
|---|---|---|
imm[11:0] |
[31:25] ‖ [11:7] | Signed 12-bit increment/decrement amount |
rs2 |
[24:20] | Source register (value to store) |
rs1 |
[19:15] | Base address register; also updated by the instruction |
funct3 |
[14:12] | Width + direction (see §4.2) |
| opcode | [6:0] | 0x2B (custom-1) |
The instruction writes to one register (rs1 updated) and one memory location.
4.2 Funct3 Encoding
The store-side encoding partitions funct3 as {direction[1], width[2:0]}:
| funct3 | Mnemonic | Width | Direction | Operation |
|---|---|---|---|---|
000 |
SBPI | byte | post-inc | mem8[rs1] = rs2[7:0]; rs1 += sext(imm) |
001 |
SHPI | half | post-inc | mem16[rs1] = rs2[15:0]; rs1 += sext(imm) |
010 |
SWPI | word | post-inc | mem32[rs1] = rs2[31:0]; rs1 += sext(imm) |
011 |
SDPI | dword | post-inc | mem64[rs1] = rs2; rs1 += sext(imm) |
100 |
SBPD | byte | pre-dec | rs1 -= sext(imm); mem8[rs1] = rs2[7:0] |
101 |
SHPD | half | pre-dec | rs1 -= sext(imm); mem16[rs1] = rs2[15:0] |
110 |
SWPD | word | pre-dec | rs1 -= sext(imm); mem32[rs1] = rs2[31:0] |
111 |
SDPD | dword | pre-dec | rs1 -= sext(imm); mem64[rs1] = rs2 |
Standard stores have no unsigned variants (there is no sign-extension on a store), so the funct3 space is cleanly halved between post-inc and pre-dec — no sub-encoding needed.
4.3 Examples
SDPI x12, 8(x13) — store x12 to mem64[x13], then advance x13 by 8:
imm = 0x008 → imm[11:5]=0x00, imm[4:0]=0x08
rs2 = 12
rs1 = 13
funct3 = 011
Encoding: 0x00C68_40_2B
= 0000000 01100 01101 011 01000 0101011
SWPD x4, 4(x2) — pre-decrement stack pointer x2 by 4, then store low 32 bits of x4:
imm = 0x004
rs2 = 4
rs1 = 2 (sp)
funct3 = 110
Encoding: 0000000 00100 00010 110 00100 0101011
A common stack-push idiom: SWPD rs2, 4(sp) (sp -= 4, then write).
5. Memory-Fused Arithmetic (custom-2, opcode 0x5B)
This opcode hosts four sub-families dispatched by funct3:
| funct3 | Sub-family | Section |
|---|---|---|
000 |
Load-op fusion (rd = mem[rs1] OP rs2) |
§5.2 |
001 |
Op-store fusion (mem[rs1] = rs2 OP rs3) |
§5.3 |
010 |
Block memory operations | §5.5 |
011 |
Load-op-store fusion (mem[rd] = mem[rs1] OP rs2) |
§5.4 |
100–111 |
reserved | — |
(The funct3 numbering doesn't match the section order: load-op-store at funct3 011 is presented in §5.4 because it is topologically the successor of op-store, while block memory at funct3 010 is conceptually distinct and presented in §5.5.)
All four use the R-type encoding:
31 25 24 20 19 15 14 12 11 7 6 0
+-----------+--------+--------+-------+--------+-------------+
| funct7 | rs2 | rs1 | funct3| rd | 1011011 |
+-----------+--------+--------+-------+--------+-------------+
The interpretation of rs2, rs1, and rd varies by sub-family. funct7 selects width and ALU operation within each sub-family.
5.1 Common funct7 Layout (Load-Op, Op-Store, Load-Op-Store)
For the arithmetic sub-families (funct3 000, 001, and 011), funct7 is structured as {width[1:0], aluop[4:0]}:
31 30 29 25
+--------+------------------+
| width | aluop[4:0] |
+--------+------------------+
width[1:0] |
Memory width / sign |
|---|---|
00 |
32-bit word, sign-extended on load (load-op) / low 32 bits stored (op-store) |
01 |
64-bit dword |
10 |
32-bit word, zero-extended on load (load-op only; same as 00 for op-store) |
11 |
reserved (future 16-bit support) |
aluop[4:0] |
Operation |
|---|---|
00000 |
ADD |
00001 |
SUB |
00010 |
AND |
00011 |
OR |
00100 |
XOR |
00101 |
SLL (shift left logical) |
00110 |
SRL (shift right logical) |
00111 |
SRA (shift right arithmetic) |
01000 |
SLT (set less than, signed) |
01001 |
SLTU (set less than, unsigned) |
01010–11111 |
reserved |
5.2 Load-Op Fusion (funct3 = 000)
Operation: rd = (mem[rs1] of selected width) ALUOP rs2.
The memory access uses rs1 directly as the base address; there is no immediate offset (use a separate ADDI first if non-zero offset needed, or compose with the auto-inc load instructions of §3).
| Mnemonic | width | aluop | Operation |
|---|---|---|---|
| LWADD | 00 |
00000 |
rd = sext32(mem32[rs1]) + rs2 |
| LWSUB | 00 |
00001 |
rd = sext32(mem32[rs1]) - rs2 |
| LWAND | 00 |
00010 |
rd = sext32(mem32[rs1]) & rs2 |
| LWOR | 00 |
00011 |
rd = sext32(mem32[rs1]) \| rs2 |
| LWXOR | 00 |
00100 |
rd = sext32(mem32[rs1]) ^ rs2 |
| LWSLL | 00 |
00101 |
rd = sext32(mem32[rs1]) << (rs2 & 63) |
| LWSRL | 00 |
00110 |
rd = sext32(mem32[rs1]) >>L (rs2 & 63) |
| LWSRA | 00 |
00111 |
rd = sext32(mem32[rs1]) >>A (rs2 & 63) |
| LWSLT | 00 |
01000 |
rd = (sext32(mem32[rs1]) < rs2) ? 1 : 0 signed |
| LWSLTU | 00 |
01001 |
rd = (sext32(mem32[rs1]) < rs2) ? 1 : 0 unsigned |
| LDADD | 01 |
00000 |
rd = mem64[rs1] + rs2 |
| LDSUB | 01 |
00001 |
rd = mem64[rs1] - rs2 |
| LDAND | 01 |
00010 |
rd = mem64[rs1] & rs2 |
| LDOR | 01 |
00011 |
rd = mem64[rs1] \| rs2 |
| LDXOR | 01 |
00100 |
rd = mem64[rs1] ^ rs2 |
| LDSLL | 01 |
00101 |
rd = mem64[rs1] << (rs2 & 63) |
| LDSRL | 01 |
00110 |
rd = mem64[rs1] >>L (rs2 & 63) |
| LDSRA | 01 |
00111 |
rd = mem64[rs1] >>A (rs2 & 63) |
| LDSLT | 01 |
01000 |
signed compare |
| LDSLTU | 01 |
01001 |
unsigned compare |
| LWUADD | 10 |
00000 |
rd = zext32(mem32[rs1]) + rs2 |
| ... | 10 |
00001–01001 |
unsigned-word variants of the above |
(The full unsigned-word table mirrors the signed-word table line-for-line.)
5.3 Op-Store Fusion (funct3 = 001)
Operation: mem[rs1] = rs2 ALUOP rs3. The rd field of the R-type encoding is repurposed as rs3 (a third source register); no architectural register is written by this class.
| Mnemonic | width | aluop | Operation |
|---|---|---|---|
| ADDSW | 00 |
00000 |
mem32[rs1] = (rs2 + rs3)[31:0] |
| SUBSW | 00 |
00001 |
mem32[rs1] = (rs2 - rs3)[31:0] |
| ANDSW | 00 |
00010 |
mem32[rs1] = (rs2 & rs3)[31:0] |
| ORSW | 00 |
00011 |
mem32[rs1] = (rs2 \| rs3)[31:0] |
| XORSW | 00 |
00100 |
mem32[rs1] = (rs2 ^ rs3)[31:0] |
| SLLSW | 00 |
00101 |
mem32[rs1] = (rs2 << (rs3 & 31))[31:0] |
| SRLSW | 00 |
00110 |
mem32[rs1] = (rs2 >>L (rs3 & 31))[31:0] |
| SRASW | 00 |
00111 |
mem32[rs1] = (rs2 >>A (rs3 & 31))[31:0] |
| ADDSD | 01 |
00000 |
mem64[rs1] = rs2 + rs3 |
| SUBSD | 01 |
00001 |
mem64[rs1] = rs2 - rs3 |
| ANDSD | 01 |
00010 |
mem64[rs1] = rs2 & rs3 |
| ORSD | 01 |
00011 |
mem64[rs1] = rs2 \| rs3 |
| XORSD | 01 |
00100 |
mem64[rs1] = rs2 ^ rs3 |
| SLLSD | 01 |
00101 |
mem64[rs1] = rs2 << (rs3 & 63) |
| SRLSD | 01 |
00110 |
mem64[rs1] = rs2 >>L (rs3 & 63) |
| SRASD | 01 |
00111 |
mem64[rs1] = rs2 >>A (rs3 & 63) |
SLT/SLTU forms are omitted for op-store (storing a 0/1 flag to memory is unusual; existing slt + sw is preferable for clarity).
Assembler convention. The op-store mnemonics are written with the memory destination in brackets to clarify that the third operand is read, not written:
ADDSW [x5], x6, x7 ; mem32[x5] = x6 + x7
Disassemblers should follow the same convention.
5.4 Load-Op-Store Fusion (funct3 = 011)
Operation: mem[rd] = (mem[rs1] of selected width) ALUOP rs2. Two memory operands plus one register operand. No architectural register is written.
This is the closest a 32-bit RISC-V slot can get to true memory-to-memory operation in the CRISP/Hobbit tradition: one fetch, one decode, both the load result and the ALU result flow directly through the pipeline without ever entering the register file, then the result is stored. The fused encoding replaces a three-instruction sequence (lw t0, (rs1); add/op t0, t0, rs2; sw t0, (rd)) without ever materialising the temporary.
The encoding repurposes the R-type rd field as the destination memory base address (read, not written). The rs1 field is the source memory base address. Funct7 partitioning is identical to load-op (§5.1).
Encoding:
31 25 24 20 19 15 14 12 11 7 6 0
+-----------+--------+--------+-------+--------+-------------+
| funct7 | rs2 | rs1 | 011 | rd | 1011011 |
+-----------+--------+--------+-------+--------+-------------+
| Field | Bits | Meaning |
|---|---|---|
funct7[6:5] |
[31:30] | Width (00=word signed load, 01=dword, 10=word unsigned load, 11=reserved) |
funct7[4:0] |
[29:25] | ALU operation (same encoding as §5.1) |
rs2 |
[24:20] | Register-held ALU operand |
rs1 |
[19:15] | Source memory base address |
| funct3 | [14:12] | 011 |
rd |
[11:7] | Destination memory base address (read-only with respect to the register file) |
| opcode | [6:0] | 0x5B |
5.4.1 Variant Table
| Mnemonic | width | aluop | Operation |
|---|---|---|---|
| MMWADD | 00 |
00000 |
mem32[rd] = (sext32(mem32[rs1]) + rs2)[31:0] |
| MMWSUB | 00 |
00001 |
mem32[rd] = (sext32(mem32[rs1]) - rs2)[31:0] |
| MMWAND | 00 |
00010 |
mem32[rd] = (sext32(mem32[rs1]) & rs2)[31:0] |
| MMWOR | 00 |
00011 |
mem32[rd] = (sext32(mem32[rs1]) \| rs2)[31:0] |
| MMWXOR | 00 |
00100 |
mem32[rd] = (sext32(mem32[rs1]) ^ rs2)[31:0] |
| MMWSLL | 00 |
00101 |
mem32[rd] = (sext32(mem32[rs1]) << (rs2 & 31))[31:0] |
| MMWSRL | 00 |
00110 |
mem32[rd] = (sext32(mem32[rs1]) >>L (rs2 & 31))[31:0] |
| MMWSRA | 00 |
00111 |
mem32[rd] = (sext32(mem32[rs1]) >>A (rs2 & 31))[31:0] |
| MMWSLT | 00 |
01000 |
mem32[rd] = (sext32(mem32[rs1]) < rs2) ? 1 : 0 (signed) |
| MMWSLTU | 00 |
01001 |
mem32[rd] = (sext32(mem32[rs1]) < rs2) ? 1 : 0 (unsigned) |
| MMDADD | 01 |
00000 |
mem64[rd] = mem64[rs1] + rs2 |
| MMDSUB | 01 |
00001 |
mem64[rd] = mem64[rs1] - rs2 |
| MMDAND | 01 |
00010 |
mem64[rd] = mem64[rs1] & rs2 |
| MMDOR | 01 |
00011 |
mem64[rd] = mem64[rs1] \| rs2 |
| MMDXOR | 01 |
00100 |
mem64[rd] = mem64[rs1] ^ rs2 |
| MMDSLL | 01 |
00101 |
mem64[rd] = mem64[rs1] << (rs2 & 63) |
| MMDSRL | 01 |
00110 |
mem64[rd] = mem64[rs1] >>L (rs2 & 63) |
| MMDSRA | 01 |
00111 |
mem64[rd] = mem64[rs1] >>A (rs2 & 63) |
| MMDSLT | 01 |
01000 |
mem64[rd] = (mem64[rs1] < rs2) ? 1 : 0 (signed) |
| MMDSLTU | 01 |
01001 |
mem64[rd] = (mem64[rs1] < rs2) ? 1 : 0 (unsigned) |
| MMWUADD | 10 |
00000 |
mem32[rd] = (zext32(mem32[rs1]) + rs2)[31:0] |
| ... | 10 |
00001–01001 |
unsigned-word-load variants of the above |
The unsigned-word variants (width = 10) only differ from the signed-word forms (width = 00) for operations where the load sign-extension matters: MMWSRA, MMWSRL, MMWSLT, MMWSLTU. For ADD, SUB, AND, OR, XOR, and SLL the result bits are identical; the assembler may accept the unsigned form as a synonym or flag it as redundant.
5.4.2 Assembler Convention
The instruction takes three register operands. The first and second name memory locations (the destination and source base addresses, both bracketed in source code); the third names a register-held ALU operand:
MMWADD [x10], [x11], x12 ; mem32[x10] = mem32[x11] + x12
MMDOR [x10], [x10], x12 ; mem64[x10] |= x12 (in-place, rd == rs1)
Both bracketed operands are read from the register file as pointers; neither is modified by the instruction. Compose with auto-increment loads/stores (§3, §4) on the surrounding code if pointer advance is needed.
5.4.3 In-Place Updates (rd == rs1)
When rd and rs1 name the same register, the instruction performs an in-place memory update: the load reads from the location, the ALU computes the new value, and the store writes back to the same location. This is well-defined: the load completes before the store begins, and there is exactly one memory location involved. The pattern matches C idioms like:
arr[i] &= mask; // MMWAND [p], [p], mask where p = &arr[i]
counter += step; // MMDADD [p], [p], step
buf[i] ^= 0x80; // MMWXOR [p], [p], x80
When rd != rs1, the load and store target distinct memory locations and the operation moves data with transformation:
dst[i] = src[i] + bias; // MMWADD [d], [s], bias
5.4.4 Trap Restart
Unlike block memory (§5.5), load-op-store carries no partial progress across traps. The instruction completes atomically or not at all from an architectural perspective:
- Trap before the load completes (e.g., load page fault): no architectural state has changed; PC remains at the instruction; retry re-executes from the beginning.
- Trap between the load and the store (e.g., timer interrupt mid-instruction): the loaded value lives only in pipeline internal state, never in any architectural register; discarding it is safe. PC remains at the instruction; retry re-executes the load (idempotent — the source memory has not been written), the ALU op, and the store.
- Trap during the store (e.g., store page fault on the destination): the load has already completed but its result is internal; the store has not committed to architectural memory; retry from the beginning.
- No trap: PC advances normally past the instruction.
The implementation must guarantee that the store is not visible to other harts or to the memory system until it is the next architectural commit. Standard store-buffer pipelining with commit-at-retire satisfies this.
5.4.5 Wide-Mode Extension
In wide mode, the extension nibble extends rd, rs1, and rs2 exactly as for a normal R-type:
| Bit | Extends |
|---|---|
| bit[32] | rd (destination memory base) |
| bit[33] | rs1 (source memory base) |
| bit[34] | rs2 (register-held ALU operand) |
| bit[35] | spare (reserved) |
A wide-mode load-op-store may name any combination of x0–x63 for all three operands.
5.4.6 Worked Example
MMWADD [x20], [x21], x12 — read 32-bit word at [x21], add x12, store to [x20]:
width = 00
aluop = 00000
funct7 = 0000000
rs2 = 12, rs1 = 21, funct3 = 011, rd = 20
Encoding: 0000000 01100 10101 011 10100 1011011
MMDOR [x10], [x10], x14 — in-place 64-bit OR of mem64[x10] with x14:
width = 01
aluop = 00011
funct7 = 0100011
rs2 = 14, rs1 = 10, funct3 = 011, rd = 10
Encoding: 0100011 01110 01010 011 01010 1011011
5.4.7 Implementation Cost
Load-op-store is the most expensive Xcrisp instruction and the one most likely to drive microarchitectural complexity. The instruction requires:
- One memory read from
[rs1] - One ALU op on the load result and
rs2 - One memory write to
[rd]
The architecture permits any of three implementation strategies:
-
Sequential micro-ops. Decode into three internal operations (load, ALU, store) and issue them sequentially. Simplest to implement; latency 3+ cycles, throughput 1 per 3 cycles. Suitable for compact pipelines where load-op-store is rare.
-
Pipelined load → ALU → store. Treat the instruction as occupying three pipeline stages in sequence, but allow successive load-op-store instructions to overlap (one in load, one in ALU, one in store). Steady-state throughput one per cycle; latency 3 cycles per instruction. Requires separate memory read and write ports on the load/store unit (or a dual-pumped path to main memory). The natural FireStorm target.
-
Same-cycle load/ALU/store. A wide single-cycle implementation reads the source, computes the result, and issues the store all in one cycle, completing in 1 cycle of latency. Requires a very fast critical path and may bottleneck on the memory port count. Likely impractical without substantial pipelining work.
Strategy 2 captures most of the performance win at modest implementation cost and is the recommended baseline. An implementation may freely choose to fall back to strategy 1 for unaligned accesses or other corner cases.
5.5 Block Memory Operations (funct3 = 010)
Block memory operations come in two flavours: synchronous (BMCPY, BMSET) execute on the CPU's load/store ports and are interruptible per §5.5.1; asynchronous (DMACPY, DMASET) hand the work to a hardware DMA queue and return immediately per §5.5.2. The choice is made per call site: small copies use the synchronous path (no DMA setup overhead, predictable latency); large transfers use DMA to overlap with CPU work.
For this sub-family, funct7 directly selects the operation; width bits are reserved (must be zero).
| funct7 | Mnemonic | Operands | Operation | Section |
|---|---|---|---|---|
0000000 |
BMCPY | rd, rs1, rs2 |
Synchronous copy: rs2 bytes from rs1 to rd. All three registers advance to reflect progress; on completion rs2 = 0. |
§5.5.1 |
0000001 |
BMSET | rd, rs1, rs2 |
Synchronous fill: write rs1[7:0] to mem8[rd] × rs2. rd advances, rs2 counts down. |
§5.5.1 |
0000010 |
DMACPY | rd, rs1, rs2 |
Asynchronous copy: enqueue copy of rs2 bytes from rs1 to rd on the DMA queue; CPU continues. rs2 becomes DMA-tagged. |
§5.5.2 |
0000011 |
DMASET | rd, rs1, rs2 |
Asynchronous fill: enqueue fill of rs2 bytes at rd with byte rs1[7:0]. rs2 becomes DMA-tagged. |
§5.5.2 |
0000100 |
BMCMP | rd, rs1, rs2 |
Compare rs2 bytes at rd vs rs1. Reserved for v0.2. |
— |
0000101–1111111 |
reserved | — | illegal-instruction | — |
5.5.1 Synchronous Block Operations (BMCPY, BMSET)
Synchronous block operations are interruptible: they update their register operands as they progress, and a trap mid-execution leaves the registers in a consistent restartable state.
Restart semantics. On an asynchronous trap mid-block-op, the architectural state holds:
rd,rs1advanced past completed bytesrs2reduced to the byte count remaining- PC pointing at the block instruction (not past it)
Returning from the trap re-executes the instruction with the partially-advanced register state, resuming from where it stopped. This requires the trap return path to use mret with the same PC, which is standard behaviour.
Overlap behaviour. For BMCPY, overlapping source and destination regions where rd > rs1 (forward overlap that would corrupt the source) is implementation-defined: a conservative implementation falls back to byte-at-a-time copy; an optimistic implementation uses wider transfers and is undefined for overlapping ranges. Code requiring guaranteed forward overlap (memmove semantics) should test and either swap direction or use a library routine.
Width hint (future). Although width[1:0] in funct7 is reserved for v0.1, a future revision may use it to express "minimum guaranteed alignment of the operands" (e.g., 01 = both pointers 8-byte aligned, allowing the implementation to issue 64-bit transfers). For v0.1, the implementation infers alignment from the runtime values.
5.5.2 Asynchronous DMA Operations (DMACPY, DMASET)
DMA operations enqueue a memory transfer onto a hardware DMA queue and return in one cycle, leaving the CPU free to execute other instructions while the transfer proceeds in parallel. The CPU and the DMA engine synchronise through the count register, which is hardware-tagged "DMA-pending" until the operation completes.
Operand Capture at Issue
When DMACPY or DMASET issues, the DMA engine captures the values of rs1 (source pointer or fill byte), rd (destination pointer), and rs2 (byte count). The DMA engine owns these copies until the operation completes; the CPU's source and destination registers (rs1, rd) are not subsequently modified and may be freely reused for other work on the next cycle.
The count register (rs2) is given special treatment — it remains architecturally bound to the DMA's live progress counter for the duration of the operation. See Register Tagging below.
Register Tagging Semantics
At issue, the named rs2 register is recorded in a small hardware DMA tag table (one entry per outstanding DMA, sized to the queue depth). The tag binds the register to the DMA's internal byte-count counter.
While the tag is active:
- CPU reads of the tagged register return the live remaining byte count from the DMA engine. The register-file read port has a forwarding path from the DMA engine's count register; reads complete in the normal load-use cycles. This permits progress polling without stalling.
- CPU writes to the tagged register stall the pipeline until the DMA completes and the tag clears. The pending write then takes effect on the (now-untagged) register. This is the canonical wait-for-completion mechanism.
When the DMA finishes, the engine drains its final byte count (0) into the register file, clears the tag, and any blocked write proceeds.
DMAWAIT Idiom
The assembler provides a pseudo-instruction:
DMAWAIT rs ; expands to: ADDI rs, rs, 0
When rs is currently DMA-tagged, this stalls the CPU until the DMA completes. When rs is not tagged, it is a no-op. The pseudo makes the intent explicit in source code without consuming a separate encoding.
Common usage:
DMACPY a0, a1, a2 ; queue 1MB copy; a2 holds count (becomes tagged)
; ... CPU does other useful work for hundreds of cycles ...
DMAWAIT a2 ; block until copy complete
ld t0, 0(a0) ; safe to read destination now
Progress Polling
Because reads of the tagged register do not stall, software can monitor DMA progress for use-cases like watchdog timeouts or partial-completion processing:
DMACPY a0, a1, a2 ; large copy, a2 = 1048576
.Lwait:
mv t0, a2 ; t0 = live remaining count (no stall)
bnez t0, .Lcheck ; ... or process partial data, etc.
j .Ldone
.Lcheck:
; ... do some work that does NOT depend on the destination region ...
j .Lwait
.Ldone:
Queue Depth and Back-Pressure
The DMA engine has a queue of pending operations whose depth is implementation-defined. Suggested values:
| FireStorm variant | DMA queue depth |
|---|---|
| All models | 8 |
When the queue is full, a new DMACPY/DMASET issue stalls until a slot becomes free. The queue capacity is reported in the mxcrisp CSR (§12.3, DMA_QUEUE_DEPTH field).
The DMA tag table has the same number of entries as the queue, so each outstanding DMA can tag a distinct register. Issuing a new DMA naming a still-tagged register stalls until that register's tag clears — this is the natural back-pressure mechanism for serialised DMA dispatch from a single producer.
Cache Coherence
FireStorm has a small 8 KB direct-mapped write-through D-cache covering DDR3 data accesses (§5.2 of the parent doc); the hot data structures (Xstack frames, Xctx contexts, scratchpad-resident voice state, etc.) live in dedicated BSRAM and do not pass through the cache.
DMA coherence is handled automatically at two levels:
- D-cache lines covering DMA-target addresses are auto-invalidated. For each DMA write to address
A, the cache index(A >> 5) & 0xFFis computed and that line'svalidbit is cleared. No software flush is needed. - Prefetch buffer ranges overlapping DMA-target addresses are auto-invalidated (§4.7 of the parent doc). This handles DMA-to-code coherence for JIT compilers and dynamic loaders.
Scratchpad-targeted DMA and BSRAM-region DMA (Xstack, Xctx) need no special coherence handling — they bypass the cache entirely.
DMA reads from DRAM see current data because the write-through D-cache always streams stores to DRAM.
External DMA agents that bypass the FireStorm DMA engine (e.g., a DMA controller on the chipset writing through a different path) must coordinate explicitly via mxdcache_flush_addr and mxbuf_flush_addr.
Trap and Interrupt Behaviour
DMA operations are independent of CPU traps and continue running through interrupts. A trap handler may freely issue DMACPY/DMASET (subject to the queue depth). If the CPU traps while stalled on a tagged-register write, the trap is taken normally and the still-pending DMA continues — the handler observes the still-tagged register and may either ignore it (the original stall resumes after return) or itself wait on it via DMAWAIT.
If the DMA encounters a memory fault mid-transfer (unmapped page, write to read-only region, etc.), the engine raises an asynchronous trap and reports the faulting address and DMA ID in a status CSR (TBD; see open items §13). The associated count register retains the remaining byte count at the fault; the tag is not cleared until software acknowledges. This permits a recovery handler to inspect and either retry or abort.
Operand Edge Cases
rs2 = 0(orrs2= x0): zero-byte DMA, no-op. The DMA may still consume a queue slot briefly; tags clear immediately.rs1 = rdfor DMACPY: defined as a no-op copy (copy region overlaps itself trivially).- Overlapping source/destination regions for DMACPY: same implementation-defined behaviour as BMCPY (§5.5.1).
- DMA targeting memory-mapped I/O: permitted and useful (e.g., streaming audio buffers to a DAC FIFO). Implementation-defined whether the DMA engine respects MMIO ordering constraints; the suggested behaviour is "treat MMIO writes as strictly ordered, identical to CPU MMIO stores."
Worked Examples
1 MB memory clear, overlapped with CPU work:
li a2, 1048576 ; count = 1 MB
li a1, 0 ; fill byte = 0
mv a0, buffer ; destination pointer
DMASET a0, a1, a2 ; queue the clear, a2 now tagged
; --- CPU does ~hundreds of microseconds of other work here ---
jal ra, prepare_next_frame
jal ra, run_audio_callback
; --- finally need the buffer ---
DMAWAIT a2 ; ensure clear is complete
; buffer now zeroed; safe to use
If the "other work" takes longer than the DMA, the DMAWAIT is a no-op; if it takes less, the wait stalls for the remainder. Either way the total wall time is max(DMA_time, CPU_work_time) rather than their sum.
Double-buffered audio block render:
render_loop:
DMACPY out_a, render_a, blk_bytes_a ; flush previous block; blk_bytes_a tagged
; while DMA copies block A to output:
jal ra, render_block_b ; CPU renders block B into render_b
DMAWAIT blk_bytes_a ; wait for A's copy to finish
; swap A/B pointers
mv tmp, render_a
mv render_a, render_b
mv render_b, tmp
mv tmp, out_a
mv out_a, out_b
mv out_b, tmp
j render_loop
The DMA copy of the previous block overlaps with the CPU's rendering of the next. Throughput is determined by the slower of (DMA bandwidth, CPU render time) rather than their sum.
Encoding example — DMACPY x10, x11, x12 (copy x12 bytes from [x11] to [x10]):
funct7 = 0000010
rs2 = 12, rs1 = 11, funct3 = 010, rd = 10
Encoding: 0000010 01100 01011 010 01010 1011011
5.6 Examples
LDADD x8, (x10), x12 — read 64-bit dword from [x10], add x12, write to x8:
width = 01
aluop = 00000
funct7 = 0100000
rs2 = 12, rs1 = 10, funct3 = 000, rd = 8
Encoding: 0100000 01100 01010 000 01000 1011011
ADDSW [x5], x6, x7 — write x6 + x7 (low 32 bits) to mem32[x5]:
width = 00
aluop = 00000
funct7 = 0000000
rs2 = 7 (note: rs2 is the second-named operand)
rs1 = 5 (the bracketed destination)
funct3 = 001
rd = 6 (interpreted as rs3 — the first-named operand after the bracketed dest)
Encoding: 0000000 00111 00101 001 00110 1011011
Assembler operand order: ADDSW [rs1], rs2, rs3. The encoded bit positions place rs2 at [24:20] and rs3 (as rd field) at [11:7]. Either operand order convention is fine in the assembler grammar; the encoding fixes which bit-position each operand occupies.
BMCPY x10, x11, x12 — copy x12 bytes from [x11] to [x10]:
funct7 = 0000000
rs2 = 12
rs1 = 11
funct3 = 010
rd = 10
Encoding: 0000000 01100 01011 010 01010 1011011
6. Compare-Mem-Branch (custom-3, opcode 0x7B)
6.1 Encoding
Standard B-type layout:
31 25 24 20 19 15 14 12 11 7 6 0
+--------------+--------+--------+-------+--------------+-------------+
| imm[12|10:5] | rs2 | rs1 | funct3| imm[4:1|11] | 1111011 |
+--------------+--------+--------+-------+--------------+-------------+
| Field | Bits | Meaning |
|---|---|---|
imm[12:1] |
[31:25] ‖ [11:7] (scrambled, standard B-type pattern) | Signed 13-bit branch offset (bit 0 is implicit zero, half-word alignment) |
rs1 |
[19:15] | First operand (a register value) |
rs2 |
[24:20] | Second operand (interpreted as a base address — memory is read from mem[rs2]) |
funct3 |
[14:12] | Condition + width |
| opcode | [6:0] | 0x7B (custom-3) |
The branch range follows the underlying B-type encoding rules:
- Narrow mode: standard B-type ±4 KiB (imm12 with ×2 byte scaling, bit[0] implicit zero).
- Wide mode: imm14 with ×4 slot scaling (bits[1:0] implicit zero), giving ±32 KiB. The compare-mem-branch instruction inherits the wide-mode immediate extension and slot-indexed PC convention described in §7.3.2 and §8.6 of
ee_cpu.
6.2 Funct3 Encoding
The condition encoding mirrors standard branches; the funct3 = 010 and 011 slots (unused in standard RV) host dword variants:
| funct3 | Mnemonic | Condition |
|---|---|---|
000 |
BEQM | (rs1 as 32-bit) == sext32(mem32[rs2]) |
001 |
BNEM | (rs1 as 32-bit) != sext32(mem32[rs2]) |
010 |
BEQMD | rs1 == mem64[rs2] |
011 |
BNEMD | rs1 != mem64[rs2] |
100 |
BLTM | (int32)rs1 < sext32(mem32[rs2]) |
101 |
BGEM | (int32)rs1 >= sext32(mem32[rs2]) |
110 |
BLTUM | (uint32)rs1 < zext32(mem32[rs2]) |
111 |
BGEUM | (uint32)rs1 >= zext32(mem32[rs2]) |
Ordered comparisons (BLTM, BGEM, BLTUM, BGEUM) are word-only in v0.1; dword ordered compare-mem-branch is reserved for v0.2 (would require an additional opcode partition or width-modifier bit).
6.3 Examples
BEQM x10, (x11), .L1 — branch to .L1 if x10[31:0] equals mem32[x11]:
funct3 = 000
rs1 = 10, rs2 = 11
imm = offset to .L1
Encoding (with offset = +16): imm field scrambled per B-type:
imm[12]=0, imm[10:5]=000000, imm[4:1]=1000, imm[11]=0
→ bits [31:25] = 0_000000 = 0x00
→ bits [11:7] = 1000_0 = 0x10
Full: 0000000 01011 01010 000 10000 1111011
BNEMD x4, (x5), .L_end — loop exit when x4 != mem64[x5]:
funct3 = 011
rs1 = 4, rs2 = 5
6.4 Common Idioms
Null-terminated string scan:
; rs1 = candidate char, x10 = pointer, x11 = 0 (terminator value)
loop:
LBPI x12, 1(x10) ; read byte, advance pointer
BNEM x12, (x11), loop ; loop while not terminator
; ... at end: x10 points past terminator, x12 = 0
A two-instruction inner loop, one cycle per byte on a forwarding implementation.
Lookup-table walk:
; x10 = key, x11 = table base, x12 = entry stride
loop:
BEQM x10, (x11), found
ADD x11, x11, x12
BNE x11, x13, loop ; standard branch on table-end pointer
found:
; x11 = pointer to matching entry
7. B-tree Primitives (custom-2, opcode 0x5B)
B-trees and related sorted-array data structures (sorted vectors, sorted hash buckets, ordered indexes) are fundamental to databases, key-value stores, set/map containers, and any code that maintains a sorted collection. The hot operation in every B-tree is find the first key ≥ target within a node — a sequential or branchy binary search through a small sorted array, typically 16–64 keys.
Standard RV64GC implements this as a comparison loop with conditional branches, which:
- Misspredicts roughly half the time (the comparison outcome depends on the data),
- Takes one cycle per key compared,
- Pollutes the branch predictor with high-entropy branches.
For a 16-key B-tree node, software search is 16 compares + 8–16 branches + ~5 mispredicts × ~15 cycles each = ~100 cycles per node visit. With a tree depth of 5–7, a single B-tree lookup costs 500–700 cycles, dominated by branch mispredicts.
The Xcrisp B-tree primitives provide fixed-width parallel search of a sorted array of keys, returning the first position satisfying key[i] >= target. The operation is parallelised across an entire cache line (or two) per instruction, with no branches.
7.1 BSRCH Family — Parallel Sorted-Array Search
| Mnemonic | Key width | Keys per instruction | rd width | Operation |
|---|---|---|---|---|
BSRCH.B rd, rs1, rs2 |
8-bit | 64 | 7-bit position (0–64) | Find first 8-bit key in mem[rs2..rs2+63] ≥ low byte of rs1 |
BSRCH.H rd, rs1, rs2 |
16-bit | 32 | 6-bit position (0–32) | Find first 16-bit key ≥ low halfword of rs1 |
BSRCH.W rd, rs1, rs2 |
32-bit | 16 | 5-bit position (0–16) | Find first 32-bit key ≥ low word of rs1 |
BSRCH.D rd, rs1, rs2 |
64-bit | 8 | 4-bit position (0–8) | Find first 64-bit key ≥ rs1 |
All variants:
- Read 64 bytes (one cache line) starting at the address in
rs2. The address must be 64-byte aligned; misaligned addresses trap. - Compare each key against the search target in
rs1in parallel. - Return in
rdthe index of the lowest position wherekey[i] >= target. If no key satisfies, return the total count (sentinel "not found within this node"). - Keys are assumed sorted ascending. If unsorted, the result is the first matching position but no ordering is implied.
Latency: 4 cycles (load 64 bytes from D-cache + 16/32/64 parallel compares + priority encoder + writeback). On a cache miss the load latency dominates and the operation effectively takes the cache-fill time.
Throughput: 1 per cycle pipelined; 1 per 4 cycles in dependency chain.
Mode availability: both narrow and wide. Narrow mode addresses up to 32 keys per call (BSRCH.B/H/W return position fitting in 5 bits); BSRCH.B returning position 33–64 requires wide mode's 6-bit return register width.
Encoding
BSRCH.X: opcode = 0x5B, funct3 = 010, funct7 = 0010xxx
funct7[2:0] = width selector
000 = .B (64 keys × 8-bit)
001 = .H (32 keys × 16-bit)
010 = .W (16 keys × 32-bit)
011 = .D ( 8 keys × 64-bit)
100–111 = reserved (future widths: 128 keys × 4-bit for sub-byte indexes, etc.)
Example: B-tree Node Search
// C version: linear search
int find_position(int32_t target, int32_t *keys, int n) {
int i = 0;
while (i < n && keys[i] < target) i++;
return i;
}
Standard RV64GC with n=16 mispredicts roughly half the iterations. Average cost: ~50 cycles for n=16, dominated by mispredicts.
With BSRCH:
; a0 = target, a1 = key array (aligned 64 bytes)
BSRCH.W a2, a0, a1 ; a2 = position (0..16), 4 cycles
ret
One instruction, 4 cycles, no branches, no mispredicts. ~12× speedup on the inner search; per full B-tree lookup at depth 5, total speedup ~10× since the search dominates.
7.2 BSCAN Family — First-Match Search
A variant of BSRCH that searches for an exact-match key (returns position of first key[i] == target, or N if not found). Used in hash table chaining, dictionary lookups within a small bucket, and validation paths in indexed structures.
| Mnemonic | Key width | Keys per instruction |
|---|---|---|
BSCAN.B rd, rs1, rs2 |
8-bit | 64 |
BSCAN.H rd, rs1, rs2 |
16-bit | 32 |
BSCAN.W rd, rs1, rs2 |
32-bit | 16 |
BSCAN.D rd, rs1, rs2 |
64-bit | 8 |
Same encoding family as BSRCH, with funct7[5:3] distinguishing operation:
- BSRCH: funct7 =
0010xxx - BSCAN: funct7 =
0011xxx
Latency and semantics identical to BSRCH except the comparison is equality rather than ≥.
Use Case: Hash Bucket Probe
// 8-way hash bucket with 16-bit fingerprints; probe for match
int probe_bucket(uint16_t fingerprint, uint16_t *bucket) {
BSCAN.H pos, fingerprint, bucket; // 4 cycles
if (pos < 8) return bucket_values[pos];
return MISS;
}
For an in-memory hash table with cuckoo or chaining within bucket-sized arrays, BSCAN.H + BSCAN.B replace the multi-cycle scan and branch sequence with single-cycle probes.
7.3 BSHIFT — Block Shift for Insert/Delete
When inserting a new key into a sorted B-tree node, all keys at and after the insertion position must shift right by one slot. When deleting, the keys after the deletion position shift left. This is fundamentally a memmove operation on a small fixed-size range.
Existing Xcrisp BMCPY (§5.5) handles general overlap-aware block memory copy and is the right tool for shifts on multi-cache-line nodes. For the common case of B-tree node shifts within a 64-byte cache line, BSHIFT is a single-instruction primitive:
| Mnemonic | Operation |
|---|---|
BSHIFTR.X rd, rs1, rs2 |
Shift keys in mem[rs1] right (toward higher addresses) by rs2 slots, starting at position 0 |
BSHIFTL.X rd, rs1, rs2 |
Shift keys in mem[rs1] left (toward lower addresses) by rs2 slots, starting at position rs2 |
Variants for width .X ∈ {B, H, W, D} match the key sizes of BSRCH.
rs2 is the shift count (typically 1 for insert-one or delete-one operations). rd returns the actual number of slots shifted (lower than rs2 if the shift would have moved data outside the 64-byte window — useful for chain insertion across nodes).
Encoding
BSHIFT: opcode = 0x5B, funct3 = 010, funct7 = 0100xxd
funct7[2:1] = width selector (00=.B, 01=.H, 10=.W, 11=.D)
funct7[0] = direction (0 = right/insert, 1 = left/delete)
Latency: 5 cycles (load 64 bytes + barrel shift + store 64 bytes back). Throughput: 1 per 5 cycles.
Example: B-tree Insert at Found Position
; Find insertion position
BSRCH.W pos, key, node_keys ; 4 cycles
; Shift keys at pos..end right by 1
addi shift_base, node_keys, 0 ; (already in register)
li count, 1
BSHIFTR.W rd_count, shift_base, count ; 5 cycles
; Now slot at pos is "free" — write new key
slli offset, pos, 2 ; pos × 4 = byte offset
add slot_addr, node_keys, offset
sw key, 0(slot_addr)
; Done — 9 cycles for full insert (excluding cache effects)
vs standard RV64GC (which uses scalar memmove with conditional branches): ~60 cycles for the equivalent operation.
~7× speedup on B-tree insertion, sustained at every level of the tree during an insert path.
7.4 Performance Impact on Database / Index Workloads
For a typical in-memory ordered index (B+ tree with 32-key nodes, 5-level tree, 10M-entry index):
| Operation | Standard RV64GC | With BSRCH/BSHIFT | Speedup |
|---|---|---|---|
| Point lookup (find one key) | ~600 cycles | ~60 cycles | 10× |
| Sequential range scan (init) | ~600 cycles (find start) | ~60 cycles | 10× |
| Insert one key | ~1200 cycles | ~150 cycles | 8× |
| Delete one key | ~1100 cycles | ~140 cycles | 8× |
| Bulk-load 1M entries | ~10 s at 380 MHz | ~1.3 s | 8× |
For workloads dominated by index access — relational query engines, key-value stores, sorted-set caches, ordered-merge joins — these speedups translate directly to overall throughput improvements.
The 5K LUTs + 2 BSRAM blocks of dedicated B-tree hardware represents one of the highest LUT-per-speedup ratios in any FireStorm extension. For a system that hosts a serious database or indexed query engine, the B-tree primitives are likely the single most valuable instruction family.
7.5 Implementation Cost
Hardware structure:
- 64-byte staging register (one cache line, 512 bits) — read from D-cache or scratchpad in one cycle.
- 64 parallel 8-bit comparators, partitionable into 32×16-bit, 16×32-bit, or 8×64-bit modes for the different BSRCH widths.
- Priority encoder producing the lowest set position from the comparator outputs.
- Barrel shifter (64-byte rotate) for BSHIFT.
- Write-back path to D-cache or scratchpad (for BSHIFT only).
| Component | LUTs |
|---|---|
| 64 × 8-bit comparator array (with width-mode mux) | ~1500 |
| Priority encoder (with width-mode mask) | ~300 |
| 64-byte barrel shifter | ~2000 |
| Register-file interface and result formatting | ~200 |
| Decoder and dispatcher | ~150 |
| Total | ~4150 LUTs |
Plus 2 BSRAM blocks for the staging register (one for load, one for store-back).
On the GW5AST-138: ~3% of LUT budget.
7.6 Mode Behaviour and Composability
Both BSRCH and BSHIFT are available in narrow and wide modes. In wide mode:
- Register fields use the extension nibble for 6-bit register indices (rd, rs1, rs2).
- Xcond predication (bit 35 = PRED-EN) applies normally — predicated B-tree search is useful for skipping work in deleted/empty nodes.
- The 6-bit
rdaccommodates BSRCH.B's full 0–64 result range; in narrow mode, BSRCH.B is restricted to a 32-key staging area (returning 0–32) since the 5-bit narrow rd cannot represent positions 33–64.
In narrow mode, BSRCH.B should be used with caution — the 32-key restriction is a structural constraint, not a hardware capability difference. Code that needs to search the full 64-byte staging area uses wide mode, or uses BSRCH.H (32×16-bit) instead.
The B-tree primitives compose naturally with:
- Xcrisp BMCPY for cross-node shifts (when an insert overflows the 64-byte BSHIFT window).
- Xcond predication for conditional searches in compressed/sparse indexes.
- Xcrisp X-type indexed loads (wide mode) for following child pointers after a search.
8. Position-Independent Code (Wide-Mode Only)
8.1 Motivation
Standard RV64GC requires two-instruction sequences for every PC-relative access: AUIPC rd, hi20; ADDI rd, rd, lo12 for address materialisation, AUIPC t, hi20; LD rd, lo12(t) for global loads, and AUIPC t, hi20; JALR rd, lo12(t) for long-range direct calls. For a modular system where hot-path code dispatches through GOTs, vtables, or PLT trampolines, these pairs dominate the cross-module instruction count.
The Xcrisp PIC family compresses each of these patterns into a single 32-bit instruction, available only in wide mode (36-bit SRAM fetch). The encoding uses the 0x7F escape mechanism (§8.2) rather than consuming one of the four remaining funct3 slots in custom-2, which keeps that space available for future narrow-mode extensions and gives the PIC instructions a much larger immediate field than they could obtain within standard RV64 opcode layout.
8.2 The 0x7F Escape Mechanism
The standard RISC-V opcode encoding reserves bits[6:2] = 11111 for instructions ≥48 bits wide. In a 32-bit slot, a value of bits[6:0] = 0x7F is therefore unused and traps as illegal-instruction in any standard RV64 implementation.
FireStorm wide mode (36-bit fetch only) repurposes 0x7F as a wide-mode extension marker:
35 34 7 6 0
+---+----------------------------------+-------------+
| F | 29-bit custom payload | 1111111 |
+---+----------------------------------+-------------+
- Bit
[35]selects between two top-level wide-PIC formats (§8.3, §8.4). - Bits
[34:7]are the 29-bit instruction payload. - Bits
[6:0]=0x7Fmark the wide-extension instruction.
In narrow mode (DDR3 fetch), 0x7F remains illegal-instruction — the escape is invisible to standard RV64 code. In wide mode (36-bit SRAM fetch), the decoder sees the marker and dispatches to the wide-extension decode path.
The escape mechanism is a general-purpose lane for wide-mode-only instructions. The current Xcrisp PIC family uses it; future FireStorm extensions (DSP primitives, accelerators, etc.) may reuse the same mechanism by allocating sub-encodings within the 29-bit payload, coordinated through the format bit at [35] and a per-format dispatch sub-field.
8.3 W-Type Format (bit[35] = 0) — PC-Relative
For PC-relative loads, address materialisation, and direct calls:
35 34 16 15 13 12 7 6 0
+---+-------------+------+--------+-------------+
| 0 | imm[18:0] |funct3| rd | 1111111 |
+---+-------------+------+--------+-------------+
| Field | Bits | Meaning |
|---|---|---|
| format | [35] | 0 = W-type |
imm[18:0] |
[34:16] | 19-bit signed PC-relative offset (scaled per-instruction) |
funct3 |
[15:13] | Variant selector |
rd |
[12:7] | 6-bit destination register (wide-register-file access) |
| opcode | [6:0] | 0x7F |
The W-type provides 19 bits of signed immediate, scaled by the natural unit of the operation (byte / halfword / word / dword) to give an effective reach of ±256 KiB to ±2 MiB depending on variant.
| funct3 | Mnemonic | Scaling | Effective range | Operation |
|---|---|---|---|---|
000 |
LDPC | ×8 (dword) | ±2 MiB | rd = mem64[PC + sext(imm) × 8] |
001 |
LWPC | ×4 (word) | ±1 MiB | rd = sext32(mem32[PC + sext(imm) × 4]) |
010 |
LWUPC | ×4 (word) | ±1 MiB | rd = zext32(mem32[PC + sext(imm) × 4]) |
011 |
LAPC | ×1 (byte) | ±256 KiB | rd = PC + sext(imm) (address materialisation) |
100 |
JALPC | ×2 (hword) | ±512 KiB | rd = PC + 4; PC = PC + sext(imm) × 2 |
101 |
JALXPC | ×8 (dword) | ±32 KiB | Indexed PC-relative jump-and-link; see §8.4 |
110 |
X-type dispatch | — | — | Indexed memory load; imm field repurposed (§8) |
111 |
reserved | — | — | illegal-instruction |
The scaling factors match each operation's natural alignment: dwords are 8-byte aligned in the global data area, words are 4-byte aligned, byte-precision is needed for &char_data patterns, halfword precision is sufficient for RVC-aware call targets.
The rd field is 6 bits wide, giving direct access to all 64 wide-mode integer registers (x0–x63) without any further extension mechanism. This is a property of the W-type's roomier encoding compared to standard RV64 instructions.
Semantics note. The PC value used for relative addressing is the address of the PIC instruction itself, matching AUIPC convention. The immediate is sign-extended from 19 bits to 64 bits and then scaled.
8.4 WI-Type Format (bit[35] = 1) — Register-Indirect
For indirect calls (vtable dispatch, PLT, function pointer through a structure):
35 34 23 22 17 16 13 12 7 6 0
+---+-----------+-------+------+--------+-------------+
| 1 | imm[11:0] | rs1 |funct4| rd | 1111111 |
+---+-----------+-------+------+--------+-------------+
| Field | Bits | Meaning |
|---|---|---|
| format | [35] | 1 = WI-type |
imm[11:0] |
[34:23] | 12-bit signed byte-precise offset |
rs1 |
[22:17] | 6-bit base register |
funct4 |
[16:13] | Variant selector |
rd |
[12:7] | 6-bit destination register |
| opcode | [6:0] | 0x7F |
Both rs1 and rd are 6-bit fields, giving full x0–x63 access without an extension nibble.
| funct4 | Mnemonic | Operation | Replaces |
|---|---|---|---|
0000 |
CALLM | rd = PC + 4; PC = mem64[rs1 + sext(imm)] |
ld t, off(rs1); jalr rd, t |
0001 |
JMPM | PC = mem64[rs1 + sext(imm)] (no return-address save) |
ld t, off(rs1); jr t |
0010–1111 |
reserved | — | — |
CALLM is the canonical vtable / PLT dispatch instruction. JMPM is the tail-call variant (no return address saved); compilers emit it for goto *fp patterns and for the final hop of trampolines.
The 12-bit byte-precise offset gives ±2 KiB of reach within the indirected table, comfortably covering vtables of up to 256 entries (8 bytes per slot) or PLT-sized dispatch tables.
8.5 Linker Relaxation
Toolchains targeting +xfirestorm emit standard AUIPC + ADDI/LD/JALR pairs against PIC relocations. The linker examines each pair after final layout:
- If the target address is within the W-type reach for the operation, the linker relaxes the pair into a single W-type instruction (replacing 8 bytes of pair with 4 bytes of PIC instruction plus 4 bytes of
NOP, or compacting the section if alignment permits). - If the resulting code is in a wide section (
.text.wide), relaxation is permitted; in narrow sections (.text,.text.crisp), the standard pair is kept. - If the target is out of range, the pair is left as-is.
This means existing PIC-aware code recompiled with +xfirestorm and placed in wide sections automatically gains the density and performance wins, without source-level changes. The linker's PIC relaxation pass operates per-section after layout, similar to existing RISC-V relaxation for JAL ↔ AUIPC+JALR.
8.6 Wide-Mode-Only Restriction
The 0x7F escape, and therefore the entire PIC instruction family, is undefined behaviour in narrow mode. A standard RV64 implementation receiving a 0x7F instruction will trap as illegal-instruction (the spec-defined behaviour for unallocated 32-bit opcodes). FireStorm's narrow-mode decoder follows the spec.
Consequences:
- PIC instructions live only in
.text.wide. A linker that attempts to place a W-type or WI-type instruction in a narrow section is in error; toolchains must enforce this. - DDR3-resident code keeps using standard
AUIPC + ADDI/LD/JALR. Modular dynamic-loaded code where modules are in DDR3 is unaffected by this extension and continues to work exactly as on any RV64 implementation. - Module trampolines benefit asymmetrically. A trampoline placed in 36-bit SRAM that bridges DDR3 modules can use CALLM to dispatch into the target module in one instruction, while the modules themselves remain narrow-PIC. This is a clean fit for AntOS-style module dispatch.
8.7 Wide Register Extension
Unlike Xcrisp instructions in standard RV64 opcode space (§3–§6), the PIC family does not use the extension-nibble scheme. The W-type and WI-type encodings already provide 6-bit register fields natively (rd at [12:7], rs1 at [22:17]), addressing x0–x63 directly.
The 36-bit fetch still carries 4 bits beyond the standard 32-bit instruction word, but for PIC instructions those bits are entirely consumed by the format bit [35] and the larger immediate/funct4 fields; there are no spare bits available for hints or sub-encoding.
8.8 Worked Examples
LDPC x40, .Lglobal_table — load a global pointer from a table 1024 bytes ahead:
PC offset = +1024 bytes = +128 dwords = imm 0x080
funct3 = 000 (LDPC)
rd = 40 (6-bit field = 0b101000)
imm = 0x00080
format = 0
Encoding bits [35:0]: 0 0000000000010000000 000 101000 1111111
= 0x008000147F (36-bit slot)
LAPC x12, .Lstring_const — materialise the address of a string constant 7 bytes ahead (byte-precise):
imm = 7
funct3 = 011 (LAPC)
rd = 12
format = 0
Encoding: 0 0000000000000000111 011 001100 1111111
CALLM x1, 24(x10) — vtable dispatch: load function pointer at offset 24 from x10, call with ra = x1:
imm = 24
rs1 = 10
funct4 = 0000 (CALLM)
rd = 1
format = 1
Encoding: 1 000000011000 001010 0000 000001 1111111
JALPC x1, .Lfar_function — direct call to a target 200 KiB ahead, beyond JAL's ±1 MiB range but within JALPC's ±512 KiB:
PC offset = +204800 bytes / 2 = 102400 = imm 0x19000
funct3 = 100 (JALPC)
rd = 1 (ra)
format = 0
Encoding: 0 0011001000000000000 100 000001 1111111
8.9 Compiler Patterns
The compiler should recognise and emit PIC family instructions for:
| Source pattern | Emitted (wide section, target in range) |
|---|---|
&global_var |
LAPC rd, global_var |
global_long_var |
LDPC rd, global_long_var |
global_int_var (signed) |
LWPC rd, global_int_var |
global_uint_var |
LWUPC rd, global_uint_var |
extern_function() (direct call, in range) |
JALPC ra, extern_function |
vtbl->method() (virtual call) |
CALLM ra, off(vtbl) |
goto *fp (computed goto, indirect) |
JMPM zero, 0(fp) |
tail_call_thunk() (tail call through pointer) |
JMPM zero, off(rs1) |
Out-of-range targets fall back to the standard AUIPC + ADDI/LD/JALR pair, which the linker leaves un-relaxed.
In wide mode, Xcrisp instructions participate in the standard extension nibble scheme. The nibble bits map to register fields according to which format the instruction uses:
| Family | Format | rd ext | rs1 ext | rs2 ext | rs3 ext (op-store) | Spare |
|---|---|---|---|---|---|---|
| Auto-inc loads (§3) | I-type | bit[32] | bit[33] | — | — | bits[35:34] |
| Auto-inc stores (§4) | S-type | — | bit[33] | bit[34] | — | bits[35], bit[32] |
| Load-op (§5.2) | R-type | bit[32] | bit[33] | bit[34] | — | bit[35] |
| Op-store (§5.3) | R-type | — | bit[33] | bit[34] | bit[32] (rd-as-rs3) | bit[35] |
| Load-op-store (§5.4) | R-type | bit[32] (rd-as-dest-addr) | bit[33] | bit[34] | — | bit[35] |
| Block memory (§5.5) | R-type | bit[32] | bit[33] | bit[34] | — | bit[35] |
| Compare-mem-branch (§6) | B-type | — | bit[33] | bit[34] | — | bits[35], bit[32] |
(For op-store, bit[32] extends the field interpreted as rs3 rather than a destination. For load-op-store, the same bit[32] extends the destination memory base address; the field is read from the register file but the register itself is not written.)
In narrow mode, all register operands are restricted to x0–x31 / f0–f31 as usual. The nibble bits do not exist — the instruction occupies a standard 32-bit slot in DDR3, and the assembler rejects any operand naming x32–x63.
9. Indexed Addressing (Wide-Mode Only)
9.1 Motivation
Standard RV64GC has no scaled-indexed addressing mode. Every non-sequential array access requires an explicit shift-and-add sequence:
slli t0, idx, 2 ; idx * 4 (byte offset for 32-bit elements)
add t0, base, t0
lw val, 0(t0)
The Zba extension's sh1add, sh2add, sh3add collapse the shift-add pair into one instruction for the common ×2, ×4, ×8 scales, reducing the sequence to:
sh2add t0, idx, base
lw val, 0(t0)
This is good for one-dimensional arrays, but Zba does not directly cover the load itself, and its scales stop at ×8. Multi-dimensional access and larger element strides remain multi-instruction sequences:
slli t0, row, 6 ; row * 64 (row stride for int matrix[16][16])
add t0, base, t0
slli t1, col, 2
add t0, t0, t1
lw val, 0(t0)
Five instructions for a 2D array access, none of them fusable by Zba alone.
The Xcrisp indexed addressing family (this section) provides single-instruction load + scale + add for the full set of integer widths, with scales ×1 through ×128. The primary workloads served are:
- Hash table probes and sparse array access — non-sequential reads where the index doesn't fit a loop induction pattern.
- Jump table dispatch for
switchstatements, state machines, and interpreters — covered by JALXPC (§8.4) rather than the load family. - 2D matrix and struct-array access where stride is a power of two larger than 8.
- Generated code (JIT, dynamic linkers, byte-code interpreters) that resolves addresses at runtime through tables.
The family is wide-mode only because it uses encoding bits in the 0x7F escape (§8.2) that do not exist in narrow-mode 32-bit fetches. Narrow-mode code continues to use Zba shift-add sequences. Code that wants the indexed forms places itself in .text.wide.
9.2 X-Type Format
The indexed-load instructions occupy a sub-encoding of the W-type format (§8.3) gated by funct3 = 110. When the W-type decoder sees funct3 = 110, the 19-bit imm field is reinterpreted as the X-type payload:
35 34 29 28 23 22 20 19 17 16 15 13 12 7 6 0
+---+------+------+------+------+----+------+--------+-------------+
| 0 | rs1 | rs2 |scale | w+s | r | 110 | rd | 1111111 |
+---+------+------+------+------+----+------+--------+-------------+
| Field | Bits | Meaning |
|---|---|---|
| format | [35] | 0 = W-type family |
rs1 |
[34:29] | 6-bit base register (x0–x63) |
rs2 |
[28:23] | 6-bit index register (x0–x63) |
scale |
[22:20] | 3-bit scale selector (×1, ×2, ×4, ×8, ×16, ×32, ×64, ×128) |
w+s |
[19:17] | 3-bit width-and-sign selector (see §8.3) |
r |
[16] | reserved, must be zero in v0.1 |
funct3 |
[15:13] | 110 (X-type dispatch within W-type) |
rd |
[12:7] | 6-bit destination register |
| opcode | [6:0] | 0x7F |
The effective address is computed as:
addr = rs1 + zext64(rs2) × (1 << scale_log2)
where scale_log2 is the value of the scale field (0–7). The index is zero-extended to 64 bits before scaling — array indices are unsigned by convention; negative indices require the user to pre-sign-extend into the index register.
The scale set covers:
- ×1: byte-precise random access (rare; mostly for symmetry).
- ×2, ×4, ×8: matches Zba's
sh1add/sh2add/sh3addscales and the natural element sizes for halfword/word/dword arrays. - ×16: 16-byte structures (a common C struct stride for AoS data).
- ×32: 32-byte cache-line-aligned records.
- ×64, ×128: row strides for 16- and 32-wide matrix layouts.
9.3 Indexed Loads
The width-and-sign field selects the access:
| w+s | Mnemonic | Width | Sign | Operation |
|---|---|---|---|---|
000 |
LBX | byte | signed | rd = sext8(mem8[addr]) |
001 |
LBUX | byte | unsigned | rd = zext8(mem8[addr]) |
010 |
LHX | half | signed | rd = sext16(mem16[addr]) |
011 |
LHUX | half | unsigned | rd = zext16(mem16[addr]) |
100 |
LWX | word | signed | rd = sext32(mem32[addr]) |
101 |
LWUX | word | unsigned | rd = zext32(mem32[addr]) |
110 |
LDX | dword | (n/a) | rd = mem64[addr] |
111 |
reserved | — | — | illegal-instruction |
Each instruction is a single 32-bit operation that completes in the same cycles as a standard load (2 cycles latency, 1-cycle throughput in steady state on the reference pipeline).
The reserved w+s = 111 slot is held for a possible future 128-bit load (LQX) or a sign-extended halfword-into-32-bit form. No specific allocation in v0.1.
Assembler Syntax
LWX rd, (rs1, rs2, scale) ; canonical
LWX rd, scale(rs1, rs2) ; 68k-style alternative
LWX rd, [rs1 + rs2 * scale_factor] ; verbose form (scale_factor = 1, 2, 4, ..., 128)
The assembler accepts any of these forms and normalises to the canonical representation.
9.4 JALXPC — Indexed PC-Relative Jump
For switch dispatch and other table-indexed jumps, JALXPC reads a target address from a PC-relative jump table and transfers control to it. Encoded as W-type funct3 = 101:
35 34 29 28 16 15 13 12 7 6 0
+---+------------+--------------+------+--------+-------------+
| 0 | rs2 | imm[12:0] | 101 | rd | 1111111 |
+---+------------+--------------+------+--------+-------------+
| Field | Bits | Meaning |
|---|---|---|
| format | [35] | 0 = W-type family |
rs2 |
[34:29] | 6-bit index register (zero-extended) |
imm[12:0] |
[28:16] | 13-bit signed PC-relative offset, scaled ×8 (±32 KiB) |
funct3 |
[15:13] | 101 |
rd |
[12:7] | Link register; x0 for plain jump, otherwise receives PC+4 |
| opcode | [6:0] | 0x7F |
Operation:
table_addr = PC + sext(imm) × 8
target = mem64[table_addr + zext64(rs2) × 8]
if (rd != x0):
rd = PC + 4
PC = target
When rd = x0, the assembler emits the mnemonic JMPXPC (no-link form). When rd != x0, the mnemonic is JALXPC and the link is captured.
The 13-bit ×8-scaled immediate gives ±32 KiB of reach from the dispatching instruction to the base of the jump table. Switch tables are typically allocated in .rodata near the function emitting the dispatch, and 32 KiB comfortably covers most cases; for very large code modules with distant tables, the compiler falls back to LAPC + LDX + JALR.
9.5 Examples
8.5.1 Single-Dimensional Array Load
int load_elem(int *arr, size_t idx) {
return arr[idx];
}
Standard RV64GC (Zba):
load_elem:
sh2add t0, a1, a0
lw a0, 0(t0)
ret
3 instructions.
FireStorm (wide mode, X-type):
load_elem:
LWX a0, (a0, a1, ×4) ; a0 = sext32(mem32[a0 + a1 * 4])
ret
2 instructions. One fewer instruction, no temporary register consumed.
8.5.2 2D Matrix Access
int load_cell(int matrix[16][16], size_t row, size_t col) {
return matrix[row][col];
}
The row stride is 16 ints = 64 bytes; the column scale is ×4.
Standard RV64GC:
load_cell:
slli t0, a1, 6 ; row * 64
add t0, a0, t0 ; &matrix[row][0]
sh2add t0, a2, t0 ; + col * 4
lw a0, 0(t0) ; load
ret
5 instructions (using Zba sh2add for the inner stride).
FireStorm (wide mode, X-type final column step):
load_cell:
slli t0, a1, 6 ; row * 64 (no Zba scale for 64; must use slli)
add t0, a0, t0 ; &matrix[row][0]
LWX a0, (t0, a2, ×4) ; a0 = sext32(mem32[t0 + col * 4])
ret
4 instructions. One fewer instruction than Zba; the X-type fuses the column scale-and-load.
For 2D access X-type captures only part of the win because there is no indexed address-materialise form (no "LAX") — the row stride still requires an explicit slli/add pair to compute the row base. Adding LAX is a v0.2 candidate (see §13). With LAX:
load_cell:
LAX t0, (a0, a1, ×64) ; t0 = a0 + row * 64 — hypothetical v0.2
LWX a0, (t0, a2, ×4) ; load column-scaled — single instruction
ret
would bring this to 3 instructions. v0.1 stops at the 4-instruction form.
8.5.3 Switch Statement Dispatch
int dispatch(int op, int arg) {
switch (op) {
case 0: return op_add(arg);
case 1: return op_sub(arg);
case 2: return op_mul(arg);
case 3: return op_div(arg);
default: return 0;
}
}
Compiler emits a jump table and dispatches via it (assuming bounds-checked op).
Standard RV64GC (PIC, no Xcrisp PIC):
dispatch:
li t0, 4
bgeu a0, t0, .Ldefault
.La:
auipc t0, %pcrel_hi(jump_table)
addi t0, t0, %pcrel_lo(.La)
sh3add t0, a0, t0
ld t0, 0(t0)
jr t0
.Ldefault:
li a0, 0
ret
The dispatch path: 7 instructions total (bounds check + 5 for the table jump).
FireStorm with JALXPC:
dispatch:
li t0, 4
bgeu a0, t0, .Ldefault
JMPXPC a0, jump_table ; PC = mem64[jump_table + a0 * 8]
.Ldefault:
li a0, 0
ret
Dispatch path: 3 instructions. ~57% reduction for the dispatch step, plus all the entries in jump_table can themselves use JALPC for the inner-handler relative jumps.
This pattern is hot in interpreters (Z-machine, bytecode VMs), parsers, OS syscall dispatch, and any code with a high-fan-out conditional. For the AntOS syscall path, halving the dispatch instruction count is a measurable kernel-entry latency improvement.
8.5.4 Hash Table Probe
entry_t *probe(entry_t *table, uint64_t hash, uint64_t mask) {
return &table[hash & mask];
}
Assuming entry_t is 16 bytes:
Standard RV64GC:
probe:
and t0, a1, a2
slli t0, t0, 4 ; *16; no Zba ×16
add a0, a0, t0
ret
4 instructions.
FireStorm:
probe:
and t0, a1, a2
; need an "LAX" address-compute, which doesn't exist in v0.1.
; Fallback: compute address explicitly then load if needed.
slli t0, t0, 4
add a0, a0, t0
ret
Same 4 instructions — X-type doesn't help if we want the address rather than the loaded value.
If the caller does entry->field right after, the load can be folded:
int probe_load(entry_t *table, uint64_t hash, uint64_t mask) {
return table[hash & mask].first_field; /* assume first_field is int at offset 0 */
}
probe_load:
and t0, a1, a2
LWX a0, (a0, t0, ×16) ; a0 = sext32(mem32[a0 + t0 * 16])
ret
3 instructions versus 5 for standard (and/slli/add/lw/ret with Zba — sh4add is not in Zba, so the slli is mandatory).
1–2 instructions saved per probe, depending on Zba availability and exact element size.
9.6 Compiler Patterns
The compiler emits X-type instructions for:
- Loop-variant array access (not loop-induction):
a[idx]whereidxis computed inside the loop body butais loop-invariant. Loop-induction patterns (a[i++]) prefer LWPI / LDPI for the auto-inc. - Hash table and dictionary probes where the entry size is power-of-two ≤ 128 bytes.
- Lookup table access with scaled-index addressing (sine tables, cosine tables, log/exp LUTs).
switchstatement dispatch via JALXPC when the jump table is within ±32 KiB.- Function pointer table dispatch via JALXPC when the table is PC-resident.
The compiler does not emit X-type for:
- Sequential array iteration (LWPI / LDPI families do this better — single-cycle, no index register needed).
- 2D matrix access with non-power-of-two strides (the indexed form doesn't apply; falls back to mul/shift + add).
- Narrow-mode code (the entire 0x7F escape is wide-mode only).
9.7 Wide-Mode-Only Restriction
The X-type and JALXPC instructions live in the 0x7F escape (§8.2). They are undefined behaviour in narrow mode and the assembler rejects them in narrow sections. Code that targets both modes must provide a narrow-mode fallback using Zba shift-add + standard load sequences.
The wide-register-file access (x0–x63) is available natively because all register fields in the X-type and JALXPC encodings are 6 bits wide.
9.8 Indexed Stores (Deferred to v0.2)
Indexed stores (SWX rs2_index, val, base) face an encoding challenge: the destination is memory, not a register, so the X-type's rd field has no natural use. Repurposing rd as the source value (i.e., "S-X-type") is straightforward but adds a third pipeline read port for the index instruction, which raises the implementation cost.
The use case for indexed stores is scatter writes (e.g., bucket-sort placement, hash table insertion, sparse vector update). These are real patterns but less hot than the gather (load) case in the workloads FireStorm targets. v0.1 omits indexed stores; v0.2 will revisit based on workload data.
For now, scatter writes use the standard Zba shift-add followed by a store:
sh2add t0, idx, base
sw val, 0(t0)
10. ABI and Calling-Convention Interaction
Xcrisp instructions do not change the calling convention. They are pure local operations within a function:
- Auto-inc forms update
rs1(a register); the caller/callee category ofrs1is unchanged from the standard RV64 ABI. - Load-op, op-store, load-op-store, and block memory write back to standard registers (or memory); same ABI rules apply.
- Load-op-store writes only to memory; the
rdfield names a register that is read (as a base address), not written. ABI roles of pointer registers are unaffected. - Block memory instructions are interruptible and may take many cycles, but they hold no hidden architectural state — only the named registers carry progress. Function-call semantics are unaffected.
A function compiled with +xcrisp is fully ABI-compatible with one compiled without; both observe the standard lp64d calling convention. The only externally visible consequence of +xcrisp is that the emitted code may contain Xcrisp instructions, which require an Xcrisp-aware decoder to execute.
A vanilla RV64GC implementation receiving Xcrisp code will trap on the first Xcrisp instruction (illegal-instruction in the custom opcode space). Mixed-mode deployment requires either runtime feature gating (test mxcrisp CSR before entering an Xcrisp code path) or build-time selection (separate object files).
11. Compiler and Toolchain Integration
11.1 Target Flags
The +xcrisp target feature, alone or as part of +xfirestorm (= +xwide,+xcrisp), enables Xcrisp emission. With +xcrisp alone, the compiler emits Xcrisp instructions in standard .text (DDR3); with +xfirestorm, functions marked or detected as wide-eligible are placed in .text.wide (SRAM) and use both Xcrisp and the wide register file.
Per-function annotations: __attribute__((target("xcrisp"))) enables Xcrisp emission for a single function regardless of global flags.
11.2 Auto-Vectorization Patterns
The compiler should recognise the following C patterns and emit the corresponding Xcrisp sequences:
| Source pattern | Emitted sequence |
|---|---|
*p++ = v (post-inc store) |
SDPI v, k(p) for the natural element width |
v = *p++ (post-inc load) |
LDPI v, k(p) |
*--p = v (pre-dec store) |
SDPD v, k(p) |
sum += *p (accumulator) |
LDADD sum, (p), sum |
*p &= mask (mask-in-place) |
MMDAND [p], [p], mask (load-op-store, true in-place) |
*p ^= v (xor-in-place) |
MMDXOR [p], [p], v |
dst[i] = src[i] + bias (element-wise transform) |
MMWADD [d], [s], bias then advance pointers |
dst[i] = src[i] << k (scaling pass) |
MMWSLL [d], [s], k |
while (*p != term) p++ |
LBPI v, 1(p); BNEM v, (term_reg), loop |
memcpy(d, s, n) |
BMCPY d, s, n |
memset(d, c, n) |
BMSET d, c, n |
The accumulator and string-scan patterns yield the largest density wins per inner-loop iteration. The load-op-store patterns yield the largest performance wins on memory-bound kernels (audio, image, vector) where the inner loop is dst[i] = f(src[i], constant): each iteration's load → ALU → store collapses into one instruction with no register-file traffic.
11.3 Inline Assembly Constraints
GCC/Clang inline-assembly constraints for Xcrisp:
=r(register destination),r(register operand): unchanged from base RV64.Q: memory operand allowed for Xcrisp memory-operand instructions; expanded to(rs1)form without offset.- A new
Xcconstraint may be added for "this operand must be a register addressable by Xcrisp" — in practice equivalent torsince all Xcrisp register operands are unrestricted within their bank.
Compilers may emit Xcrisp instructions even when not explicitly requested via inline asm, if the C-level operation matches one of the patterns in §10.2 and +xcrisp is enabled.
12. Implementation Guidance
12.1 Pipeline Considerations
The fused instructions are designed to deliver both code density and performance wins. An implementation that captures only the density side leaves significant performance on the table; the notes below identify the microarchitectural paths that turn each fused form into a real cycle saving.
-
Load-op fusion is most valuable when the implementation forwards the load result directly into the ALU input without writing the intermediate to the register file. A two-cycle issue (load-result available at cycle +1, ALU op at cycle +2) is the typical pattern; the fused instruction occupies one issue slot but two execution slots, which the implementation may schedule freely. Win sources: front-end throughput (one decode instead of two), register-file write port freed for an independent op, one fewer live temporary for the allocator, smaller prefetch-buffer footprint.
-
Op-store fusion delivers latency and density wins simultaneously. The ALU result should be forwarded directly into the store-buffer entry, bypassing the register file entirely — the temporary never exists architecturally and need not exist physically. On a simple in-order pipeline this saves one cycle versus separate
add; sw(no register-file write-then-read on the temp); on a multi-issue pipeline it relieves write-port pressure and frees an issue slot. The allocator also wins one fewer live temporary per pattern, which on a tight basic block may avoid a spill. The store-buffer entry should be allocated at decode time so the ALU result writes directly into it. -
Load-op-store fusion is the most aggressive Xcrisp instruction and the one with the largest possible performance win. The instruction does the work of three (
lw; add/op; sw) in one instruction-fetch slot, with no architectural temporary, no register-file read/write of the intermediate value, and one fewer architectural register live across the operation. Recommended implementation: a three-stage internal pipeline (load → ALU → store) that allows back-to-back load-op-store instructions to overlap and achieve one-per-cycle throughput in steady state. The store buffer entry should be allocated at decode time so the ALU forwards directly into it (same path as op-store), and the load result should be forwarded into the ALU input without ever being written to a physical register (same path as load-op). A minimal implementation that decodes load-op-store into three sequential micro-ops still gets the density win and avoids the architectural-temporary write; the performance win scales with how much of the pipeline overlap is implemented. -
Auto-increment instructions write two registers (
rdandrs1for loads; justrs1updated for stores plus the memory write). A single-write-port register file must serialise the two-register update across two cycles; a two-write-port file (already required by some standard extensions) takes it in stride. The density win is unconditional; the performance win arrives when the second write port is available, otherwise it's a wash on cycles but still a win on fetch/decode bandwidth. -
Compare-mem-branch issues a load and a branch in one instruction. The load-latency critical path is unchanged from a separate load + branch (the branch must still wait for the loaded value), so the raw cycle count for this single pattern is identical. The wins come from the surrounding context: one fewer issue slot consumed, one fewer register-file read (the loaded value is consumed directly by the compare unit, never written back), one fewer architectural register live across the load (helpful for register pressure in tight loops), and the smaller prefetch-buffer footprint. On loops that are fetch-bound or register-pressured — which sentinel scans and table walks often are — this is a real throughput win, not just a code-size one.
-
Block memory instructions are intended as DMA-like primitives. A minimal implementation iterates byte-by-byte (slow but correct); a serious implementation uses wider transfers when alignment permits and the operands don't overlap unsafely. A well-tuned BMCPY should approach one DRAM-burst per cycle of throughput on aligned operands, dwarfing any libc inline-expanded loop.
12.2 Trap Restart for Block Operations
A block-memory instruction trapped mid-execution must:
- Update
rd,rs1,rs2to reflect progress (bytes completed). - Leave
mepcpointing at the block instruction (not past it). - Discard any uncommitted internal transfer state.
The trap handler does nothing special — mret returns to the same mepc, the instruction re-executes with the partially-advanced register state, and the byte-loop resumes from where it stopped. A buggy implementation that fails to update the registers atomically with the memory write will cause silent data corruption; this is the single most important invariant for block-memory correctness.
12.3 CSR Allocation
| CSR | Address | Type | Description |
|---|---|---|---|
mxcrisp |
0xFC1 (suggested) |
M-mode RO | Xcrisp version & feature bits |
Bit layout of mxcrisp (proposed):
| Bits | Field | Meaning |
|---|---|---|
[0] |
PRESENT | 1 if Xcrisp implemented |
[7:1] |
VERSION | Xcrisp version (1 = v0.1) |
[8] |
HAS_BMOPS | 1 if synchronous block memory ops (BMCPY/BMSET) implemented |
[9] |
HAS_MMOPS | 1 if load-op-store ops implemented |
[10] |
HAS_PIC | 1 if PC-relative PIC instructions implemented (wide-mode only) |
[11] |
HAS_DMA | 1 if asynchronous DMA ops (DMACPY/DMASET) implemented |
[15:12] |
DMA_QUEUE_DEPTH | log₂ of DMA queue depth, 0 if no DMA. (010 = 4 entries, 011 = 8 entries) |
[16] |
HAS_INDEXED | 1 if X-type indexed addressing implemented (wide-mode only) |
[63:17] |
reserved | — |
A reduced FireStorm variant lacking block-memory hardware may set HAS_BMOPS = 0 while keeping the rest of Xcrisp. Variants lacking load-op-store, PIC, DMA, or indexed-load support clear the corresponding bits similarly. A variant implementing synchronous block ops but no DMA engine sets HAS_BMOPS = 1, HAS_DMA = 0, and DMA_QUEUE_DEPTH = 0.
13. Encoding Summary
13.1 At-a-Glance Opcode Map
| Mnemonic | opcode | funct3 | funct7 / sub | Format | Mode |
|---|---|---|---|---|---|
| LBPI..LWUPI | 0x0B |
000–110 | (none) | I-type | both |
| LBPD..LWUPD | 0x0B |
111 | imm[11:9]=width | I-type sub | both |
| SBPI..SDPD | 0x2B |
000–111 | (none) | S-type | both |
| LWADD..LWUSLTU | 0x5B |
000 | width+aluop | R-type | both |
| LDADD..LDSLTU | 0x5B |
000 | width+aluop | R-type | both |
| ADDSW..SRASD | 0x5B |
001 | width+aluop | R-type (rd=rs3) | both |
| BMCPY, BMSET | 0x5B |
010 | 0000000, 0000001 | R-type | both |
| DMACPY, DMASET | 0x5B |
010 | 0000010, 0000011 | R-type | both |
| BSRCH.B/H/W/D | 0x5B |
010 | 0010xxx (width selector) | R-type | both |
| BSCAN.B/H/W/D | 0x5B |
010 | 0011xxx (width selector) | R-type | both |
| BSHIFTR/L.B/H/W/D | 0x5B |
010 | 0100xxd (width + direction) | R-type | both |
| MMWADD..MMWUSLTU | 0x5B |
011 | width+aluop | R-type (rd=dest-addr) | both |
| MMDADD..MMDSLTU | 0x5B |
011 | width+aluop | R-type (rd=dest-addr) | both |
| BEQM..BGEUM | 0x7B |
000–111 | (none) | B-type | both |
| LDPC, LWPC, LWUPC, LAPC, JALPC | 0x7F |
000–100 | bit[35]=0 | W-type | wide only |
| JALXPC, JMPXPC | 0x7F |
101 | bit[35]=0 | W-type | wide only |
| LBX..LDX | 0x7F |
110 | bit[35]=0, scale+w+s | X-type (sub of W) | wide only |
| CALLM, JMPM | 0x7F |
(funct4 at [16:13]) | bit[35]=1 | WI-type | wide only |
13.2 Reserved Spaces
The following encoding spaces are reserved for future Xcrisp expansion and must not be used by any other FireStorm extension or vendor implementation:
0x0Bwith funct3 =111and imm[11:9] =1110x2B: no reserved space in v0.1 (all funct3 slots in use)0x5Bfunct3 =100–111(reserved sub-families)0x5Bfunct3 =000/001/011withwidth = 11oraluop[4:0]≥010100x5Bfunct3 =010with funct7 ≥0101000(B-tree expansion space — BMCMP at0000100reserved for v0.2; BSRCH/BSCAN/BSHIFT occupy0010xxx–0100xxx)0x7B: no reserved space in v0.1 (all funct3 slots in use; dword ordered comparisons require new opcode in v0.2)0x7FW-type with funct3 =111(reserved PC-relative variant)0x7FW-type funct3 =110(X-type) withw+s = 111(reserved indexed-load width/sign)0x7FW-type funct3 =110(X-type) with bit[16] = 1(reserved sub-encoding bit)0x7FWI-type with funct4 =0010–1111(reserved register-indirect variants)
The 0x7F opcode itself (the wide-mode escape mechanism, §8.2) is the general-purpose lane for any wide-mode-only extension. Future FireStorm extensions may allocate sub-encodings within the 29-bit payload by coordinating with the format bit at [35] and choosing dispatch sub-fields that don't conflict with allocated W-type funct3 or WI-type funct4 values.
14. Open Items
mxcrispCSR address. Suggested0xFC1; needs to align with the wide-dirty CSR allocation (open item §16 of the parent doc) and any other FireStorm-specific CSRs.- Dword ordered compare-mem-branch. Requires either an additional opcode or a width-modifier bit somewhere; deferred to v0.2.
- 16-bit memory width for load-op and op-store. The
width = 11slot is reserved; semantics TBD. - BMCMP semantics. Result encoding (remaining-count, flag, or both) TBD.
- Overlapping BMCPY. Implementation-defined in v0.1; may be tightened to
memmove-compatible in v0.2 if cost is acceptable. - Alignment hints for block memory. Future
width[1:0]use in BM-family funct7. - Sub-word load-op forms. Byte and halfword load-op (e.g.
LBADD,LHUADD) are not in v0.1; if added later they'd consumewidth = 10/11in load-op funct7 with redefined semantics. - PIC and Zcmp interaction. Zcmp's
cm.popretis a tail-return; combining withJMPMfor full tail-call sequences is an optimisation worth exploring. - PIC linker relaxation pass. Specification of the relaxation algorithm and the new relocation types needed to mark
AUIPC + LD/JALR/ADDIpairs as relaxable. - Indexed stores (SBX..SDX). v0.1 omits indexed stores; the S-X-type encoding repurposing rd as the stored-value source is sketched in §8.8 but not finalised. v0.2 to revisit based on workload data.
- LAX (indexed address materialise). An "X-type but result is the computed address, not the loaded value" form would help 2D access and address-of-array-element patterns. Encoding space exists (X-type w+s = 111 is reserved). Deferred to v0.2.
- Index sign-extension option. Current X-type zero-extends rs2 (the index). Some patterns (e.g., signed offsets from a midpoint) want sign-extension. A scale-field bit could select, at the cost of halving the scale set. Open for v0.2.
- JALXPC reach extension. ±32 KiB is sufficient for most switch tables but not for very large generated dispatch tables (e.g., character-class tables in regex engines). A wider-immediate variant would consume a second funct3 slot in W-type. Open for v0.2.
- DMA fault reporting. The mechanism for surfacing DMA-engine memory faults (unmapped page, write to RO, bus error) is sketched in §5.5.2 but not finalised. Candidate design: a status CSR
mxdma_statusholding the faulting DMA's tag-register name, fault address, and fault cause; M-mode trap with cause =DMA_FAULT. Final encoding TBD. - DMA priority. Should the DMA engine support priority levels so audio-critical transfers preempt bulk ones? A 2-bit priority field could be carried in the spare funct7 bits of DMACPY/DMASET. Deferred to v0.2 pending workload data.
- DMA-to-MMIO ordering. §5.5.2 suggests MMIO DMAs are strictly ordered relative to CPU MMIO stores; the exact memory-ordering model (RVWMO interaction, fences required) needs formalisation.
- Multiple-consumer DMA tag table. Currently one tag-table entry per outstanding DMA. If two DMAs name the same
rs2, the second stalls until the first completes. An alternative (allocate fresh tag entry on each issue, with a "scoreboard" mapping register names to multiple entries) is more flexible but costlier. v0.1 takes the simpler one-tag-per-register approach.
15. Glossary
| Term | Meaning |
|---|---|
| Auto-increment | Load or store that updates its base register as a side effect (post-inc or pre-dec). |
| Load-op fusion | Combined load and ALU operation: rd = mem[rs1] OP rs2. |
| Op-store fusion | Combined ALU operation and store: mem[rs1] = rs2 OP rs3. No register destination. |
| Compare-mem-branch | Branch on the result of comparing a register with a memory value. |
| Block memory | Variable-length memory transfer instruction (BMCPY, BMSET); interruptible with register-held restart state. |
| rs3 (in op-store) | The R-type rd field [11:7] repurposed as a third source register; no architectural register is written. |
| mxcrisp | M-mode CSR exposing Xcrisp implementation and feature bits. |
End of document. See also: FireStorm CPU ISA — base architecture and wide-mode encoding.