FireStorm Xcrisp Extension — Instruction Encodings

Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA See also: FireStorm Performance Examples for worked comparisons

1. Overview

The Xcrisp extension is FireStorm's set of CRISP-influenced custom instructions, designed to raise the performance and code density of compiler-generated C code without breaking the RV64GC baseline. It is available in both narrow and wide modes (§3 of the parent doc): vanilla DDR3-resident code may use Xcrisp instructions exactly as SRAM-resident wide-mode code may. In wide mode, Xcrisp register operands extend via the standard extension nibble scheme.

The extension contains four instruction families:

Family	Purpose	Opcode	Format
Auto-increment loads	`p++` / `--p` read patterns	`0x0B` (custom-0)	I-type
Auto-increment stores	`p++` / `--p` write patterns	`0x2B` (custom-1)	S-type
Memory-fused arithmetic	load-op, op-store, block-memory	`0x5B` (custom-2)	R-type
Compare-mem-branch	sentinel scans, table walks	`0x7B` (custom-3)	B-type

The opcode-to-family mapping is deliberately aligned with the standard RISC-V opcode bit pattern at [6:5]:

`[6:5]`	Standard	Xcrisp
`00`	LOAD (`0x03`)	auto-inc loads (`0x0B`)
`01`	STORE (`0x23`) / OP (`0x33`)	auto-inc stores (`0x2B`)
`10`	reserved	memory-fused R-type (`0x5B`)
`11`	BRANCH (`0x63`)	compare-mem-branch (`0x7B`)

This alignment lets a FireStorm decoder reuse the standard rs1/rs2/rd/imm extract logic for Xcrisp encodings; only the funct3/funct7 decode tables expand.

2. Feature Detection

The presence of Xcrisp is indicated by a non-zero value in the implementation-defined CSR mxcrisp (machine custom read-only, address 0xFC1, suggested). Bit [0] of mxcrisp is the Xcrisp version (1 for v0.1). A reduced FireStorm variant without Xcrisp returns zero; a CRISP instruction issued on such a variant traps as illegal-instruction.

Compilers normally rely on the +xcrisp target-feature flag rather than runtime detection. Detection is reserved for runtime libraries that may be deployed on multiple FireStorm variants.

3. Auto-Increment Loads (custom-0, opcode `0x0B`)

3.1 Encoding

Standard I-type layout:

 31                  20 19    15 14   12 11     7 6           0
+----------------------+--------+-------+--------+-------------+
|       imm[11:0]      |  rs1   | funct3|   rd   |  0001011    |
+----------------------+--------+-------+--------+-------------+

Field	Bits	Meaning
`imm[11:0]`	[31:20]	Signed 12-bit increment/decrement amount (post-inc forms) or width-extended pre-dec sub-encoding (see §3.3)
`rs1`	[19:15]	Base address register; also updated by the instruction
`funct3`	[14:12]	Operation: width + direction (see §3.2)
`rd`	[11:7]	Load destination register
opcode	[6:0]	`0x0B` (custom-0)

The instruction writes to two architectural registers: rd (the loaded value) and rs1 (the updated base). If rs1 == rd, the load value wins (the increment to rs1 is suppressed) — this matches the standard RISC-V convention for instructions that would otherwise have ambiguous semantics.

3.2 Post-Increment Loads (`funct3` 000–110)

For funct3 ≠ 111, the immediate is a 12-bit signed offset/increment, and the operation is:

rd  = sext_or_zext_W(mem[rs1])
rs1 = rs1 + sext(imm)

funct3	Mnemonic	Width	Sign	Operation
`000`	LBPI	byte	signed	`rd = sext8(mem8[rs1]); rs1 += sext(imm)`
`001`	LHPI	half	signed	`rd = sext16(mem16[rs1]); rs1 += sext(imm)`
`010`	LWPI	word	signed	`rd = sext32(mem32[rs1]); rs1 += sext(imm)`
`011`	LDPI	dword	n/a	`rd = mem64[rs1]; rs1 += sext(imm)`
`100`	LBUPI	byte	unsigned	`rd = zext8(mem8[rs1]); rs1 += sext(imm)`
`101`	LHUPI	half	unsigned	`rd = zext16(mem16[rs1]); rs1 += sext(imm)`
`110`	LWUPI	word	unsigned	`rd = zext32(mem32[rs1]); rs1 += sext(imm)`

Funct3 assignments match standard RV64I load encodings: bit [14] = unsigned, bits [13:12] = width (00=byte, 01=half, 10=word, 11=dword). This means an existing RV64I load decoder can route through a single funct3 path with only the opcode and "auto-inc" flag changing.

3.3 Pre-Decrement Loads (`funct3 = 111`)

For funct3 = 111, the 12-bit immediate field is repurposed as {width[2:0], offset[8:0]}:

 31    29 28              20
+--------+-------------------+
| width  |    offset[8:0]    |
+--------+-------------------+

width[2:0] (imm[11:9]): one of seven width/sign variants, matching the post-inc funct3 numbering.
offset[8:0] (imm[8:0]): signed 9-bit decrement amount, range −256..+255.

width	Mnemonic	Operation
`000`	LBPD	`rs1 -= sext(offset); rd = sext8(mem8[rs1])`
`001`	LHPD	`rs1 -= sext(offset); rd = sext16(mem16[rs1])`
`010`	LWPD	`rs1 -= sext(offset); rd = sext32(mem32[rs1])`
`011`	LDPD	`rs1 -= sext(offset); rd = mem64[rs1]`
`100`	LBUPD	`rs1 -= sext(offset); rd = zext8(mem8[rs1])`
`101`	LHUPD	`rs1 -= sext(offset); rd = zext16(mem16[rs1])`
`110`	LWUPD	`rs1 -= sext(offset); rd = zext32(mem32[rs1])`
`111`	reserved	illegal-instruction

A negative offset is permitted but produces unusual semantics (rs1 is incremented before the load); compilers should not emit this and disassemblers may flag it.

3.4 Examples

LWPI x10, 4(x11) — read 32-bit word at [x11], sign-extend into x10, advance x11 by 4:

imm  = 0x004
rs1  = 11   (0b01011)
funct3 = 010
rd   = 10   (0b01010)
opcode = 0x0B
Encoding: 0x004_5A50B = 0000 0000 0100 01011 010 01010 0001011

LDPD x14, 8(x15) — decrement x15 by 8, then read 64-bit dword into x14:

funct3 = 111 (pre-dec marker)
width  = 011 (dword)
offset = 0x008
imm[11:0] = {011, 000001000} = 0x608
rs1  = 15
rd   = 14
Encoding: 0x608_7B70B = 0110 0000 1000 01111 111 01110 0001011

4. Auto-Increment Stores (custom-1, opcode `0x2B`)

4.1 Encoding

Standard S-type layout:

 31         25 24    20 19    15 14   12 11        7 6           0
+-------------+--------+--------+-------+-----------+-------------+
| imm[11:5]   |  rs2   |  rs1   | funct3| imm[4:0]  |  0101011    |
+-------------+--------+--------+-------+-----------+-------------+

Field	Bits	Meaning
`imm[11:0]`	[31:25] ‖ [11:7]	Signed 12-bit increment/decrement amount
`rs2`	[24:20]	Source register (value to store)
`rs1`	[19:15]	Base address register; also updated by the instruction
`funct3`	[14:12]	Width + direction (see §4.2)
opcode	[6:0]	`0x2B` (custom-1)

The instruction writes to one register (rs1 updated) and one memory location.

4.2 Funct3 Encoding

The store-side encoding partitions funct3 as {direction[1], width[2:0]}:

funct3	Mnemonic	Width	Direction	Operation
`000`	SBPI	byte	post-inc	`mem8[rs1] = rs2[7:0]; rs1 += sext(imm)`
`001`	SHPI	half	post-inc	`mem16[rs1] = rs2[15:0]; rs1 += sext(imm)`
`010`	SWPI	word	post-inc	`mem32[rs1] = rs2[31:0]; rs1 += sext(imm)`
`011`	SDPI	dword	post-inc	`mem64[rs1] = rs2; rs1 += sext(imm)`
`100`	SBPD	byte	pre-dec	`rs1 -= sext(imm); mem8[rs1] = rs2[7:0]`
`101`	SHPD	half	pre-dec	`rs1 -= sext(imm); mem16[rs1] = rs2[15:0]`
`110`	SWPD	word	pre-dec	`rs1 -= sext(imm); mem32[rs1] = rs2[31:0]`
`111`	SDPD	dword	pre-dec	`rs1 -= sext(imm); mem64[rs1] = rs2`

Standard stores have no unsigned variants (there is no sign-extension on a store), so the funct3 space is cleanly halved between post-inc and pre-dec — no sub-encoding needed.

4.3 Examples

SDPI x12, 8(x13) — store x12 to mem64[x13], then advance x13 by 8:

imm     = 0x008  → imm[11:5]=0x00, imm[4:0]=0x08
rs2     = 12
rs1     = 13
funct3  = 011
Encoding: 0x00C68_40_2B
        = 0000000 01100 01101 011 01000 0101011

SWPD x4, 4(x2) — pre-decrement stack pointer x2 by 4, then store low 32 bits of x4:

imm     = 0x004
rs2     = 4
rs1     = 2  (sp)
funct3  = 110
Encoding: 0000000 00100 00010 110 00100 0101011

A common stack-push idiom: SWPD rs2, 4(sp) (sp -= 4, then write).

5. Memory-Fused Arithmetic (custom-2, opcode `0x5B`)

This opcode hosts four sub-families dispatched by funct3:

funct3	Sub-family	Section
`000`	Load-op fusion (`rd = mem[rs1] OP rs2`)	§5.2
`001`	Op-store fusion (`mem[rs1] = rs2 OP rs3`)	§5.3
`010`	Block memory operations	§5.5
`011`	Load-op-store fusion (`mem[rd] = mem[rs1] OP rs2`)	§5.4
`100`–`111`	reserved	—

(The funct3 numbering doesn't match the section order: load-op-store at funct3 011 is presented in §5.4 because it is topologically the successor of op-store, while block memory at funct3 010 is conceptually distinct and presented in §5.5.)

All four use the R-type encoding:

 31        25 24    20 19    15 14   12 11     7 6           0
+-----------+--------+--------+-------+--------+-------------+
|  funct7   |  rs2   |  rs1   | funct3|   rd   |  1011011    |
+-----------+--------+--------+-------+--------+-------------+

The interpretation of rs2, rs1, and rd varies by sub-family. funct7 selects width and ALU operation within each sub-family.

5.1 Common funct7 Layout (Load-Op, Op-Store, Load-Op-Store)

For the arithmetic sub-families (funct3 000, 001, and 011), funct7 is structured as {width[1:0], aluop[4:0]}:

 31    30 29              25
+--------+------------------+
| width  |   aluop[4:0]     |
+--------+------------------+

`width[1:0]`	Memory width / sign
`00`	32-bit word, sign-extended on load (load-op) / low 32 bits stored (op-store)
`01`	64-bit dword
`10`	32-bit word, zero-extended on load (load-op only; same as `00` for op-store)
`11`	reserved (future 16-bit support)

`aluop[4:0]`	Operation
`00000`	ADD
`00001`	SUB
`00010`	AND
`00011`	OR
`00100`	XOR
`00101`	SLL (shift left logical)
`00110`	SRL (shift right logical)
`00111`	SRA (shift right arithmetic)
`01000`	SLT (set less than, signed)
`01001`	SLTU (set less than, unsigned)
`01010`–`11111`	reserved

5.2 Load-Op Fusion (funct3 = `000`)

Operation: rd = (mem[rs1] of selected width) ALUOP rs2.

The memory access uses rs1 directly as the base address; there is no immediate offset (use a separate ADDI first if non-zero offset needed, or compose with the auto-inc load instructions of §3).

Mnemonic	width	aluop	Operation
LWADD	`00`	`00000`	`rd = sext32(mem32[rs1]) + rs2`
LWSUB	`00`	`00001`	`rd = sext32(mem32[rs1]) - rs2`
LWAND	`00`	`00010`	`rd = sext32(mem32[rs1]) & rs2`
LWOR	`00`	`00011`	`rd = sext32(mem32[rs1]) \\| rs2`
LWXOR	`00`	`00100`	`rd = sext32(mem32[rs1]) ^ rs2`
LWSLL	`00`	`00101`	`rd = sext32(mem32[rs1]) << (rs2 & 63)`
LWSRL	`00`	`00110`	`rd = sext32(mem32[rs1]) >>L (rs2 & 63)`
LWSRA	`00`	`00111`	`rd = sext32(mem32[rs1]) >>A (rs2 & 63)`
LWSLT	`00`	`01000`	`rd = (sext32(mem32[rs1]) < rs2) ? 1 : 0` signed
LWSLTU	`00`	`01001`	`rd = (sext32(mem32[rs1]) < rs2) ? 1 : 0` unsigned
LDADD	`01`	`00000`	`rd = mem64[rs1] + rs2`
LDSUB	`01`	`00001`	`rd = mem64[rs1] - rs2`
LDAND	`01`	`00010`	`rd = mem64[rs1] & rs2`
LDOR	`01`	`00011`	`rd = mem64[rs1] \\| rs2`
LDXOR	`01`	`00100`	`rd = mem64[rs1] ^ rs2`
LDSLL	`01`	`00101`	`rd = mem64[rs1] << (rs2 & 63)`
LDSRL	`01`	`00110`	`rd = mem64[rs1] >>L (rs2 & 63)`
LDSRA	`01`	`00111`	`rd = mem64[rs1] >>A (rs2 & 63)`
LDSLT	`01`	`01000`	signed compare
LDSLTU	`01`	`01001`	unsigned compare
LWUADD	`10`	`00000`	`rd = zext32(mem32[rs1]) + rs2`
...	`10`	`00001`–`01001`	unsigned-word variants of the above

(The full unsigned-word table mirrors the signed-word table line-for-line.)

5.3 Op-Store Fusion (funct3 = `001`)

Operation: mem[rs1] = rs2 ALUOP rs3. The rd field of the R-type encoding is repurposed as rs3 (a third source register); no architectural register is written by this class.

Mnemonic	width	aluop	Operation
ADDSW	`00`	`00000`	`mem32[rs1] = (rs2 + rs3)[31:0]`
SUBSW	`00`	`00001`	`mem32[rs1] = (rs2 - rs3)[31:0]`
ANDSW	`00`	`00010`	`mem32[rs1] = (rs2 & rs3)[31:0]`
ORSW	`00`	`00011`	`mem32[rs1] = (rs2 \\| rs3)[31:0]`
XORSW	`00`	`00100`	`mem32[rs1] = (rs2 ^ rs3)[31:0]`
SLLSW	`00`	`00101`	`mem32[rs1] = (rs2 << (rs3 & 31))[31:0]`
SRLSW	`00`	`00110`	`mem32[rs1] = (rs2 >>L (rs3 & 31))[31:0]`
SRASW	`00`	`00111`	`mem32[rs1] = (rs2 >>A (rs3 & 31))[31:0]`
ADDSD	`01`	`00000`	`mem64[rs1] = rs2 + rs3`
SUBSD	`01`	`00001`	`mem64[rs1] = rs2 - rs3`
ANDSD	`01`	`00010`	`mem64[rs1] = rs2 & rs3`
ORSD	`01`	`00011`	`mem64[rs1] = rs2 \\| rs3`
XORSD	`01`	`00100`	`mem64[rs1] = rs2 ^ rs3`
SLLSD	`01`	`00101`	`mem64[rs1] = rs2 << (rs3 & 63)`
SRLSD	`01`	`00110`	`mem64[rs1] = rs2 >>L (rs3 & 63)`
SRASD	`01`	`00111`	`mem64[rs1] = rs2 >>A (rs3 & 63)`

SLT/SLTU forms are omitted for op-store (storing a 0/1 flag to memory is unusual; existing slt + sw is preferable for clarity).

Assembler convention. The op-store mnemonics are written with the memory destination in brackets to clarify that the third operand is read, not written:

ADDSW [x5], x6, x7    ; mem32[x5] = x6 + x7

Disassemblers should follow the same convention.

5.4 Load-Op-Store Fusion (funct3 = `011`)

Operation: mem[rd] = (mem[rs1] of selected width) ALUOP rs2. Two memory operands plus one register operand. No architectural register is written.

This is the closest a 32-bit RISC-V slot can get to true memory-to-memory operation in the CRISP/Hobbit tradition: one fetch, one decode, both the load result and the ALU result flow directly through the pipeline without ever entering the register file, then the result is stored. The fused encoding replaces a three-instruction sequence (lw t0, (rs1); add/op t0, t0, rs2; sw t0, (rd)) without ever materialising the temporary.

The encoding repurposes the R-type rd field as the destination memory base address (read, not written). The rs1 field is the source memory base address. Funct7 partitioning is identical to load-op (§5.1).

Encoding:

 31        25 24    20 19    15 14   12 11     7 6           0
+-----------+--------+--------+-------+--------+-------------+
|  funct7   |  rs2   |  rs1   |  011  |   rd   |  1011011    |
+-----------+--------+--------+-------+--------+-------------+

Field	Bits	Meaning
`funct7[6:5]`	[31:30]	Width (`00`=word signed load, `01`=dword, `10`=word unsigned load, `11`=reserved)
`funct7[4:0]`	[29:25]	ALU operation (same encoding as §5.1)
`rs2`	[24:20]	Register-held ALU operand
`rs1`	[19:15]	Source memory base address
funct3	[14:12]	`011`
`rd`	[11:7]	Destination memory base address (read-only with respect to the register file)
opcode	[6:0]	`0x5B`

5.4.1 Variant Table

Mnemonic	width	aluop	Operation
MMWADD	`00`	`00000`	`mem32[rd] = (sext32(mem32[rs1]) + rs2)[31:0]`
MMWSUB	`00`	`00001`	`mem32[rd] = (sext32(mem32[rs1]) - rs2)[31:0]`
MMWAND	`00`	`00010`	`mem32[rd] = (sext32(mem32[rs1]) & rs2)[31:0]`
MMWOR	`00`	`00011`	`mem32[rd] = (sext32(mem32[rs1]) \\| rs2)[31:0]`
MMWXOR	`00`	`00100`	`mem32[rd] = (sext32(mem32[rs1]) ^ rs2)[31:0]`
MMWSLL	`00`	`00101`	`mem32[rd] = (sext32(mem32[rs1]) << (rs2 & 31))[31:0]`
MMWSRL	`00`	`00110`	`mem32[rd] = (sext32(mem32[rs1]) >>L (rs2 & 31))[31:0]`
MMWSRA	`00`	`00111`	`mem32[rd] = (sext32(mem32[rs1]) >>A (rs2 & 31))[31:0]`
MMWSLT	`00`	`01000`	`mem32[rd] = (sext32(mem32[rs1]) < rs2) ? 1 : 0` (signed)
MMWSLTU	`00`	`01001`	`mem32[rd] = (sext32(mem32[rs1]) < rs2) ? 1 : 0` (unsigned)
MMDADD	`01`	`00000`	`mem64[rd] = mem64[rs1] + rs2`
MMDSUB	`01`	`00001`	`mem64[rd] = mem64[rs1] - rs2`
MMDAND	`01`	`00010`	`mem64[rd] = mem64[rs1] & rs2`
MMDOR	`01`	`00011`	`mem64[rd] = mem64[rs1] \\| rs2`
MMDXOR	`01`	`00100`	`mem64[rd] = mem64[rs1] ^ rs2`
MMDSLL	`01`	`00101`	`mem64[rd] = mem64[rs1] << (rs2 & 63)`
MMDSRL	`01`	`00110`	`mem64[rd] = mem64[rs1] >>L (rs2 & 63)`
MMDSRA	`01`	`00111`	`mem64[rd] = mem64[rs1] >>A (rs2 & 63)`
MMDSLT	`01`	`01000`	`mem64[rd] = (mem64[rs1] < rs2) ? 1 : 0` (signed)
MMDSLTU	`01`	`01001`	`mem64[rd] = (mem64[rs1] < rs2) ? 1 : 0` (unsigned)
MMWUADD	`10`	`00000`	`mem32[rd] = (zext32(mem32[rs1]) + rs2)[31:0]`
...	`10`	`00001`–`01001`	unsigned-word-load variants of the above

The unsigned-word variants (width = 10) only differ from the signed-word forms (width = 00) for operations where the load sign-extension matters: MMWSRA, MMWSRL, MMWSLT, MMWSLTU. For ADD, SUB, AND, OR, XOR, and SLL the result bits are identical; the assembler may accept the unsigned form as a synonym or flag it as redundant.

5.4.2 Assembler Convention

The instruction takes three register operands. The first and second name memory locations (the destination and source base addresses, both bracketed in source code); the third names a register-held ALU operand:

MMWADD [x10], [x11], x12   ; mem32[x10] = mem32[x11] + x12
MMDOR  [x10], [x10], x12   ; mem64[x10] |= x12 (in-place, rd == rs1)

Both bracketed operands are read from the register file as pointers; neither is modified by the instruction. Compose with auto-increment loads/stores (§3, §4) on the surrounding code if pointer advance is needed.

5.4.3 In-Place Updates (rd == rs1)

When rd and rs1 name the same register, the instruction performs an in-place memory update: the load reads from the location, the ALU computes the new value, and the store writes back to the same location. This is well-defined: the load completes before the store begins, and there is exactly one memory location involved. The pattern matches C idioms like:

arr[i] &= mask;     // MMWAND [p], [p], mask     where p = &arr[i]
counter += step;    // MMDADD [p], [p], step
buf[i] ^= 0x80;     // MMWXOR [p], [p], x80

When rd != rs1, the load and store target distinct memory locations and the operation moves data with transformation:

dst[i] = src[i] + bias;     // MMWADD [d], [s], bias

5.4.4 Trap Restart

Unlike block memory (§5.5), load-op-store carries no partial progress across traps. The instruction completes atomically or not at all from an architectural perspective:

Trap before the load completes (e.g., load page fault): no architectural state has changed; PC remains at the instruction; retry re-executes from the beginning.
Trap between the load and the store (e.g., timer interrupt mid-instruction): the loaded value lives only in pipeline internal state, never in any architectural register; discarding it is safe. PC remains at the instruction; retry re-executes the load (idempotent — the source memory has not been written), the ALU op, and the store.
Trap during the store (e.g., store page fault on the destination): the load has already completed but its result is internal; the store has not committed to architectural memory; retry from the beginning.
No trap: PC advances normally past the instruction.

The implementation must guarantee that the store is not visible to other harts or to the memory system until it is the next architectural commit. Standard store-buffer pipelining with commit-at-retire satisfies this.

5.4.5 Wide-Mode Extension

In wide mode, the extension nibble extends rd, rs1, and rs2 exactly as for a normal R-type:

Bit	Extends
bit[32]	`rd` (destination memory base)
bit[33]	`rs1` (source memory base)
bit[34]	`rs2` (register-held ALU operand)
bit[35]	spare (reserved)

A wide-mode load-op-store may name any combination of x0–x63 for all three operands.

5.4.6 Worked Example

MMWADD [x20], [x21], x12 — read 32-bit word at [x21], add x12, store to [x20]:

width  = 00
aluop  = 00000
funct7 = 0000000
rs2    = 12, rs1 = 21, funct3 = 011, rd = 20
Encoding: 0000000 01100 10101 011 10100 1011011

MMDOR [x10], [x10], x14 — in-place 64-bit OR of mem64[x10] with x14:

width  = 01
aluop  = 00011
funct7 = 0100011
rs2    = 14, rs1 = 10, funct3 = 011, rd = 10
Encoding: 0100011 01110 01010 011 01010 1011011

5.4.7 Implementation Cost

Load-op-store is the most expensive Xcrisp instruction and the one most likely to drive microarchitectural complexity. The instruction requires:

One memory read from [rs1]
One ALU op on the load result and rs2
One memory write to [rd]

The architecture permits any of three implementation strategies:

Sequential micro-ops. Decode into three internal operations (load, ALU, store) and issue them sequentially. Simplest to implement; latency 3+ cycles, throughput 1 per 3 cycles. Suitable for compact pipelines where load-op-store is rare.
Pipelined load → ALU → store. Treat the instruction as occupying three pipeline stages in sequence, but allow successive load-op-store instructions to overlap (one in load, one in ALU, one in store). Steady-state throughput one per cycle; latency 3 cycles per instruction. Requires separate memory read and write ports on the load/store unit (or a dual-pumped path to main memory). The natural FireStorm target.
Same-cycle load/ALU/store. A wide single-cycle implementation reads the source, computes the result, and issues the store all in one cycle, completing in 1 cycle of latency. Requires a very fast critical path and may bottleneck on the memory port count. Likely impractical without substantial pipelining work.

Strategy 2 captures most of the performance win at modest implementation cost and is the recommended baseline. An implementation may freely choose to fall back to strategy 1 for unaligned accesses or other corner cases.

5.5 Block Memory Operations (funct3 = `010`)

Block memory operations come in two flavours: synchronous (BMCPY, BMSET) execute on the CPU's load/store ports and are interruptible per §5.5.1; asynchronous (DMACPY, DMASET) hand the work to a hardware DMA queue and return immediately per §5.5.2. The choice is made per call site: small copies use the synchronous path (no DMA setup overhead, predictable latency); large transfers use DMA to overlap with CPU work.

For this sub-family, funct7 directly selects the operation; width bits are reserved (must be zero).

funct7	Mnemonic	Operands	Operation	Section
`0000000`	BMCPY	`rd, rs1, rs2`	Synchronous copy: `rs2` bytes from `rs1` to `rd`. All three registers advance to reflect progress; on completion `rs2 = 0`.	§5.5.1
`0000001`	BMSET	`rd, rs1, rs2`	Synchronous fill: write `rs1[7:0]` to `mem8[rd]` × `rs2`. `rd` advances, `rs2` counts down.	§5.5.1
`0000010`	DMACPY	`rd, rs1, rs2`	Asynchronous copy: enqueue copy of `rs2` bytes from `rs1` to `rd` on the DMA queue; CPU continues. `rs2` becomes DMA-tagged.	§5.5.2
`0000011`	DMASET	`rd, rs1, rs2`	Asynchronous fill: enqueue fill of `rs2` bytes at `rd` with byte `rs1[7:0]`. `rs2` becomes DMA-tagged.	§5.5.2
`0000100`	BMCMP	`rd, rs1, rs2`	Compare `rs2` bytes at `rd` vs `rs1`. Reserved for v0.2.	—
`0000101`–`1111111`	reserved	—	illegal-instruction	—

5.5.1 Synchronous Block Operations (BMCPY, BMSET)

Synchronous block operations are interruptible: they update their register operands as they progress, and a trap mid-execution leaves the registers in a consistent restartable state.

Restart semantics. On an asynchronous trap mid-block-op, the architectural state holds:

rd, rs1 advanced past completed bytes
rs2 reduced to the byte count remaining
PC pointing at the block instruction (not past it)

Returning from the trap re-executes the instruction with the partially-advanced register state, resuming from where it stopped. This requires the trap return path to use mret with the same PC, which is standard behaviour.

Overlap behaviour. For BMCPY, overlapping source and destination regions where rd > rs1 (forward overlap that would corrupt the source) is implementation-defined: a conservative implementation falls back to byte-at-a-time copy; an optimistic implementation uses wider transfers and is undefined for overlapping ranges. Code requiring guaranteed forward overlap (memmove semantics) should test and either swap direction or use a library routine.

Width hint (future). Although width[1:0] in funct7 is reserved for v0.1, a future revision may use it to express "minimum guaranteed alignment of the operands" (e.g., 01 = both pointers 8-byte aligned, allowing the implementation to issue 64-bit transfers). For v0.1, the implementation infers alignment from the runtime values.

5.5.2 Asynchronous DMA Operations (DMACPY, DMASET)

DMA operations enqueue a memory transfer onto a hardware DMA queue and return in one cycle, leaving the CPU free to execute other instructions while the transfer proceeds in parallel. The CPU and the DMA engine synchronise through the count register, which is hardware-tagged "DMA-pending" until the operation completes.

Operand Capture at Issue

When DMACPY or DMASET issues, the DMA engine captures the values of rs1 (source pointer or fill byte), rd (destination pointer), and rs2 (byte count). The DMA engine owns these copies until the operation completes; the CPU's source and destination registers (rs1, rd) are not subsequently modified and may be freely reused for other work on the next cycle.

The count register (rs2) is given special treatment — it remains architecturally bound to the DMA's live progress counter for the duration of the operation. See Register Tagging below.

Register Tagging Semantics

At issue, the named rs2 register is recorded in a small hardware DMA tag table (one entry per outstanding DMA, sized to the queue depth). The tag binds the register to the DMA's internal byte-count counter.

While the tag is active:

CPU reads of the tagged register return the live remaining byte count from the DMA engine. The register-file read port has a forwarding path from the DMA engine's count register; reads complete in the normal load-use cycles. This permits progress polling without stalling.
CPU writes to the tagged register stall the pipeline until the DMA completes and the tag clears. The pending write then takes effect on the (now-untagged) register. This is the canonical wait-for-completion mechanism.

When the DMA finishes, the engine drains its final byte count (0) into the register file, clears the tag, and any blocked write proceeds.

DMAWAIT Idiom

The assembler provides a pseudo-instruction:

DMAWAIT  rs    ; expands to:  ADDI rs, rs, 0

When rs is currently DMA-tagged, this stalls the CPU until the DMA completes. When rs is not tagged, it is a no-op. The pseudo makes the intent explicit in source code without consuming a separate encoding.

Common usage:

DMACPY  a0, a1, a2      ; queue 1MB copy; a2 holds count (becomes tagged)
; ... CPU does other useful work for hundreds of cycles ...
DMAWAIT a2              ; block until copy complete
ld      t0, 0(a0)       ; safe to read destination now

Progress Polling

Because reads of the tagged register do not stall, software can monitor DMA progress for use-cases like watchdog timeouts or partial-completion processing:

DMACPY  a0, a1, a2      ; large copy, a2 = 1048576
.Lwait:
    mv      t0, a2          ; t0 = live remaining count (no stall)
    bnez    t0, .Lcheck     ; ... or process partial data, etc.
    j       .Ldone
.Lcheck:
    ; ... do some work that does NOT depend on the destination region ...
    j       .Lwait
.Ldone:

Queue Depth and Back-Pressure

The DMA engine has a queue of pending operations whose depth is implementation-defined. Suggested values:

FireStorm variant	DMA queue depth
All models	8

When the queue is full, a new DMACPY/DMASET issue stalls until a slot becomes free. The queue capacity is reported in the mxcrisp CSR (§12.3, DMA_QUEUE_DEPTH field).

The DMA tag table has the same number of entries as the queue, so each outstanding DMA can tag a distinct register. Issuing a new DMA naming a still-tagged register stalls until that register's tag clears — this is the natural back-pressure mechanism for serialised DMA dispatch from a single producer.

Cache Coherence

FireStorm has a small 8 KB direct-mapped write-through D-cache covering DDR3 data accesses (§5.2 of the parent doc); the hot data structures (Xstack frames, Xctx contexts, scratchpad-resident voice state, etc.) live in dedicated BSRAM and do not pass through the cache.

DMA coherence is handled automatically at two levels:

D-cache lines covering DMA-target addresses are auto-invalidated. For each DMA write to address A, the cache index (A >> 5) & 0xFF is computed and that line's valid bit is cleared. No software flush is needed.
Prefetch buffer ranges overlapping DMA-target addresses are auto-invalidated (§4.7 of the parent doc). This handles DMA-to-code coherence for JIT compilers and dynamic loaders.

Scratchpad-targeted DMA and BSRAM-region DMA (Xstack, Xctx) need no special coherence handling — they bypass the cache entirely.

DMA reads from DRAM see current data because the write-through D-cache always streams stores to DRAM.

External DMA agents that bypass the FireStorm DMA engine (e.g., a DMA controller on the chipset writing through a different path) must coordinate explicitly via mxdcache_flush_addr and mxbuf_flush_addr.

Trap and Interrupt Behaviour

DMA operations are independent of CPU traps and continue running through interrupts. A trap handler may freely issue DMACPY/DMASET (subject to the queue depth). If the CPU traps while stalled on a tagged-register write, the trap is taken normally and the still-pending DMA continues — the handler observes the still-tagged register and may either ignore it (the original stall resumes after return) or itself wait on it via DMAWAIT.

If the DMA encounters a memory fault mid-transfer (unmapped page, write to read-only region, etc.), the engine raises an asynchronous trap and reports the faulting address and DMA ID in a status CSR (TBD; see open items §13). The associated count register retains the remaining byte count at the fault; the tag is not cleared until software acknowledges. This permits a recovery handler to inspect and either retry or abort.

Operand Edge Cases

rs2 = 0 (or rs2 = x0): zero-byte DMA, no-op. The DMA may still consume a queue slot briefly; tags clear immediately.
rs1 = rd for DMACPY: defined as a no-op copy (copy region overlaps itself trivially).
Overlapping source/destination regions for DMACPY: same implementation-defined behaviour as BMCPY (§5.5.1).
DMA targeting memory-mapped I/O: permitted and useful (e.g., streaming audio buffers to a DAC FIFO). Implementation-defined whether the DMA engine respects MMIO ordering constraints; the suggested behaviour is "treat MMIO writes as strictly ordered, identical to CPU MMIO stores."

Worked Examples

1 MB memory clear, overlapped with CPU work:

        li      a2, 1048576              ; count = 1 MB
        li      a1, 0                    ; fill byte = 0
        mv      a0, buffer               ; destination pointer
        DMASET  a0, a1, a2               ; queue the clear, a2 now tagged
        ; --- CPU does ~hundreds of microseconds of other work here ---
        jal     ra, prepare_next_frame
        jal     ra, run_audio_callback
        ; --- finally need the buffer ---
        DMAWAIT a2                       ; ensure clear is complete
        ; buffer now zeroed; safe to use

If the "other work" takes longer than the DMA, the DMAWAIT is a no-op; if it takes less, the wait stalls for the remainder. Either way the total wall time is max(DMA_time, CPU_work_time) rather than their sum.

Double-buffered audio block render:

render_loop:
        DMACPY  out_a, render_a, blk_bytes_a    ; flush previous block; blk_bytes_a tagged
        ; while DMA copies block A to output:
        jal     ra, render_block_b              ; CPU renders block B into render_b
        DMAWAIT blk_bytes_a                     ; wait for A's copy to finish
        ; swap A/B pointers
        mv      tmp, render_a
        mv      render_a, render_b
        mv      render_b, tmp
        mv      tmp, out_a
        mv      out_a, out_b
        mv      out_b, tmp
        j       render_loop

The DMA copy of the previous block overlaps with the CPU's rendering of the next. Throughput is determined by the slower of (DMA bandwidth, CPU render time) rather than their sum.

Encoding example — DMACPY x10, x11, x12 (copy x12 bytes from [x11] to [x10]):

funct7 = 0000010
rs2    = 12, rs1 = 11, funct3 = 010, rd = 10
Encoding: 0000010 01100 01011 010 01010 1011011

5.6 Examples

LDADD x8, (x10), x12 — read 64-bit dword from [x10], add x12, write to x8:

width  = 01
aluop  = 00000
funct7 = 0100000
rs2    = 12, rs1 = 10, funct3 = 000, rd = 8
Encoding: 0100000 01100 01010 000 01000 1011011

ADDSW [x5], x6, x7 — write x6 + x7 (low 32 bits) to mem32[x5]:

width  = 00
aluop  = 00000
funct7 = 0000000
rs2    = 7  (note: rs2 is the second-named operand)
rs1    = 5  (the bracketed destination)
funct3 = 001
rd     = 6  (interpreted as rs3 — the first-named operand after the bracketed dest)
Encoding: 0000000 00111 00101 001 00110 1011011

Assembler operand order: ADDSW [rs1], rs2, rs3. The encoded bit positions place rs2 at [24:20] and rs3 (as rd field) at [11:7]. Either operand order convention is fine in the assembler grammar; the encoding fixes which bit-position each operand occupies.

BMCPY x10, x11, x12 — copy x12 bytes from [x11] to [x10]:

funct7 = 0000000
rs2    = 12
rs1    = 11
funct3 = 010
rd     = 10
Encoding: 0000000 01100 01011 010 01010 1011011

6. Compare-Mem-Branch (custom-3, opcode `0x7B`)

6.1 Encoding

Standard B-type layout:

 31           25 24    20 19    15 14   12 11           7 6           0
+--------------+--------+--------+-------+--------------+-------------+
| imm[12|10:5] |  rs2   |  rs1   | funct3| imm[4:1|11]  |  1111011    |
+--------------+--------+--------+-------+--------------+-------------+

Field	Bits	Meaning
`imm[12:1]`	[31:25] ‖ [11:7] (scrambled, standard B-type pattern)	Signed 13-bit branch offset (bit 0 is implicit zero, half-word alignment)
`rs1`	[19:15]	First operand (a register value)
`rs2`	[24:20]	Second operand (interpreted as a base address — memory is read from `mem[rs2]`)
`funct3`	[14:12]	Condition + width
opcode	[6:0]	`0x7B` (custom-3)

The branch range follows the underlying B-type encoding rules:

Narrow mode: standard B-type ±4 KiB (imm12 with ×2 byte scaling, bit[0] implicit zero).
Wide mode: imm14 with ×4 slot scaling (bits[1:0] implicit zero), giving ±32 KiB. The compare-mem-branch instruction inherits the wide-mode immediate extension and slot-indexed PC convention described in §7.3.2 and §8.6 of ee_cpu.

6.2 Funct3 Encoding

The condition encoding mirrors standard branches; the funct3 = 010 and 011 slots (unused in standard RV) host dword variants:

funct3	Mnemonic	Condition
`000`	BEQM	`(rs1 as 32-bit) == sext32(mem32[rs2])`
`001`	BNEM	`(rs1 as 32-bit) != sext32(mem32[rs2])`
`010`	BEQMD	`rs1 == mem64[rs2]`
`011`	BNEMD	`rs1 != mem64[rs2]`
`100`	BLTM	`(int32)rs1 < sext32(mem32[rs2])`
`101`	BGEM	`(int32)rs1 >= sext32(mem32[rs2])`
`110`	BLTUM	`(uint32)rs1 < zext32(mem32[rs2])`
`111`	BGEUM	`(uint32)rs1 >= zext32(mem32[rs2])`

Ordered comparisons (BLTM, BGEM, BLTUM, BGEUM) are word-only in v0.1; dword ordered compare-mem-branch is reserved for v0.2 (would require an additional opcode partition or width-modifier bit).

6.3 Examples

BEQM x10, (x11), .L1 — branch to .L1 if x10[31:0] equals mem32[x11]:

funct3 = 000
rs1    = 10, rs2 = 11
imm    = offset to .L1
Encoding (with offset = +16): imm field scrambled per B-type:
   imm[12]=0, imm[10:5]=000000, imm[4:1]=1000, imm[11]=0
   → bits [31:25] = 0_000000 = 0x00
   → bits [11:7]  = 1000_0   = 0x10
   Full: 0000000 01011 01010 000 10000 1111011

BNEMD x4, (x5), .L_end — loop exit when x4 != mem64[x5]:

funct3 = 011
rs1    = 4, rs2 = 5

6.4 Common Idioms

Null-terminated string scan:

    ; rs1 = candidate char, x10 = pointer, x11 = 0 (terminator value)
loop:
    LBPI    x12, 1(x10)       ; read byte, advance pointer
    BNEM    x12, (x11), loop   ; loop while not terminator
    ; ... at end: x10 points past terminator, x12 = 0

A two-instruction inner loop, one cycle per byte on a forwarding implementation.

Lookup-table walk:

    ; x10 = key, x11 = table base, x12 = entry stride
loop:
    BEQM    x10, (x11), found
    ADD     x11, x11, x12
    BNE     x11, x13, loop      ; standard branch on table-end pointer
found:
    ; x11 = pointer to matching entry

7. B-tree Primitives (custom-2, opcode `0x5B`)

B-trees and related sorted-array data structures (sorted vectors, sorted hash buckets, ordered indexes) are fundamental to databases, key-value stores, set/map containers, and any code that maintains a sorted collection. The hot operation in every B-tree is find the first key ≥ target within a node — a sequential or branchy binary search through a small sorted array, typically 16–64 keys.

Standard RV64GC implements this as a comparison loop with conditional branches, which:

Misspredicts roughly half the time (the comparison outcome depends on the data),
Takes one cycle per key compared,
Pollutes the branch predictor with high-entropy branches.

For a 16-key B-tree node, software search is 16 compares + 8–16 branches + ~5 mispredicts × ~15 cycles each = ~100 cycles per node visit. With a tree depth of 5–7, a single B-tree lookup costs 500–700 cycles, dominated by branch mispredicts.

The Xcrisp B-tree primitives provide fixed-width parallel search of a sorted array of keys, returning the first position satisfying key[i] >= target. The operation is parallelised across an entire cache line (or two) per instruction, with no branches.

7.1 BSRCH Family — Parallel Sorted-Array Search

Mnemonic	Key width	Keys per instruction	rd width	Operation
`BSRCH.B rd, rs1, rs2`	8-bit	64	7-bit position (0–64)	Find first 8-bit key in `mem[rs2..rs2+63]` ≥ low byte of rs1
`BSRCH.H rd, rs1, rs2`	16-bit	32	6-bit position (0–32)	Find first 16-bit key ≥ low halfword of rs1
`BSRCH.W rd, rs1, rs2`	32-bit	16	5-bit position (0–16)	Find first 32-bit key ≥ low word of rs1
`BSRCH.D rd, rs1, rs2`	64-bit	8	4-bit position (0–8)	Find first 64-bit key ≥ rs1

All variants:

Read 64 bytes (one cache line) starting at the address in rs2. The address must be 64-byte aligned; misaligned addresses trap.
Compare each key against the search target in rs1 in parallel.
Return in rd the index of the lowest position where key[i] >= target. If no key satisfies, return the total count (sentinel "not found within this node").
Keys are assumed sorted ascending. If unsorted, the result is the first matching position but no ordering is implied.

Latency: 4 cycles (load 64 bytes from D-cache + 16/32/64 parallel compares + priority encoder + writeback). On a cache miss the load latency dominates and the operation effectively takes the cache-fill time.

Throughput: 1 per cycle pipelined; 1 per 4 cycles in dependency chain.

Mode availability: both narrow and wide. Narrow mode addresses up to 32 keys per call (BSRCH.B/H/W return position fitting in 5 bits); BSRCH.B returning position 33–64 requires wide mode's 6-bit return register width.

Encoding

BSRCH.X: opcode = 0x5B, funct3 = 010, funct7 = 0010xxx
   funct7[2:0] = width selector
      000 = .B (64 keys × 8-bit)
      001 = .H (32 keys × 16-bit)
      010 = .W (16 keys × 32-bit)
      011 = .D ( 8 keys × 64-bit)
      100–111 = reserved (future widths: 128 keys × 4-bit for sub-byte indexes, etc.)

Example: B-tree Node Search

// C version: linear search
int find_position(int32_t target, int32_t *keys, int n) {
    int i = 0;
    while (i < n && keys[i] < target) i++;
    return i;
}

Standard RV64GC with n=16 mispredicts roughly half the iterations. Average cost: ~50 cycles for n=16, dominated by mispredicts.

With BSRCH:

    ; a0 = target, a1 = key array (aligned 64 bytes)
    BSRCH.W  a2, a0, a1                ; a2 = position (0..16), 4 cycles
    ret

One instruction, 4 cycles, no branches, no mispredicts. ~12× speedup on the inner search; per full B-tree lookup at depth 5, total speedup ~10× since the search dominates.

7.2 BSCAN Family — First-Match Search

A variant of BSRCH that searches for an exact-match key (returns position of first key[i] == target, or N if not found). Used in hash table chaining, dictionary lookups within a small bucket, and validation paths in indexed structures.

Mnemonic	Key width	Keys per instruction
`BSCAN.B rd, rs1, rs2`	8-bit	64
`BSCAN.H rd, rs1, rs2`	16-bit	32
`BSCAN.W rd, rs1, rs2`	32-bit	16
`BSCAN.D rd, rs1, rs2`	64-bit	8

Same encoding family as BSRCH, with funct7[5:3] distinguishing operation:

BSRCH: funct7 = 0010xxx
BSCAN: funct7 = 0011xxx

Latency and semantics identical to BSRCH except the comparison is equality rather than ≥.

Use Case: Hash Bucket Probe

// 8-way hash bucket with 16-bit fingerprints; probe for match
int probe_bucket(uint16_t fingerprint, uint16_t *bucket) {
    BSCAN.H pos, fingerprint, bucket;    // 4 cycles
    if (pos < 8) return bucket_values[pos];
    return MISS;
}

For an in-memory hash table with cuckoo or chaining within bucket-sized arrays, BSCAN.H + BSCAN.B replace the multi-cycle scan and branch sequence with single-cycle probes.

7.3 BSHIFT — Block Shift for Insert/Delete

When inserting a new key into a sorted B-tree node, all keys at and after the insertion position must shift right by one slot. When deleting, the keys after the deletion position shift left. This is fundamentally a memmove operation on a small fixed-size range.

Existing Xcrisp BMCPY (§5.5) handles general overlap-aware block memory copy and is the right tool for shifts on multi-cache-line nodes. For the common case of B-tree node shifts within a 64-byte cache line, BSHIFT is a single-instruction primitive:

Mnemonic	Operation
`BSHIFTR.X rd, rs1, rs2`	Shift keys in `mem[rs1]` right (toward higher addresses) by `rs2` slots, starting at position 0
`BSHIFTL.X rd, rs1, rs2`	Shift keys in `mem[rs1]` left (toward lower addresses) by `rs2` slots, starting at position rs2

Variants for width .X ∈ {B, H, W, D} match the key sizes of BSRCH.

rs2 is the shift count (typically 1 for insert-one or delete-one operations). rd returns the actual number of slots shifted (lower than rs2 if the shift would have moved data outside the 64-byte window — useful for chain insertion across nodes).

Encoding

BSHIFT: opcode = 0x5B, funct3 = 010, funct7 = 0100xxd
   funct7[2:1] = width selector (00=.B, 01=.H, 10=.W, 11=.D)
   funct7[0]   = direction (0 = right/insert, 1 = left/delete)

Latency: 5 cycles (load 64 bytes + barrel shift + store 64 bytes back). Throughput: 1 per 5 cycles.

Example: B-tree Insert at Found Position

    ; Find insertion position
    BSRCH.W  pos, key, node_keys           ; 4 cycles
    ; Shift keys at pos..end right by 1
    addi     shift_base, node_keys, 0      ; (already in register)
    li       count, 1
    BSHIFTR.W rd_count, shift_base, count  ; 5 cycles
    ; Now slot at pos is "free" — write new key
    slli     offset, pos, 2                 ; pos × 4 = byte offset
    add      slot_addr, node_keys, offset
    sw       key, 0(slot_addr)
    ; Done — 9 cycles for full insert (excluding cache effects)

vs standard RV64GC (which uses scalar memmove with conditional branches): ~60 cycles for the equivalent operation.

~7× speedup on B-tree insertion, sustained at every level of the tree during an insert path.

7.4 Performance Impact on Database / Index Workloads

For a typical in-memory ordered index (B+ tree with 32-key nodes, 5-level tree, 10M-entry index):

Operation	Standard RV64GC	With BSRCH/BSHIFT	Speedup
Point lookup (find one key)	~600 cycles	~60 cycles	10×
Sequential range scan (init)	~600 cycles (find start)	~60 cycles	10×
Insert one key	~1200 cycles	~150 cycles	8×
Delete one key	~1100 cycles	~140 cycles	8×
Bulk-load 1M entries	~10 s at 380 MHz	~1.3 s	8×

For workloads dominated by index access — relational query engines, key-value stores, sorted-set caches, ordered-merge joins — these speedups translate directly to overall throughput improvements.

The 5K LUTs + 2 BSRAM blocks of dedicated B-tree hardware represents one of the highest LUT-per-speedup ratios in any FireStorm extension. For a system that hosts a serious database or indexed query engine, the B-tree primitives are likely the single most valuable instruction family.

7.5 Implementation Cost

Hardware structure:

64-byte staging register (one cache line, 512 bits) — read from D-cache or scratchpad in one cycle.
64 parallel 8-bit comparators, partitionable into 32×16-bit, 16×32-bit, or 8×64-bit modes for the different BSRCH widths.
Priority encoder producing the lowest set position from the comparator outputs.
Barrel shifter (64-byte rotate) for BSHIFT.
Write-back path to D-cache or scratchpad (for BSHIFT only).

Component	LUTs
64 × 8-bit comparator array (with width-mode mux)	~1500
Priority encoder (with width-mode mask)	~300
64-byte barrel shifter	~2000
Register-file interface and result formatting	~200
Decoder and dispatcher	~150
Total	~4150 LUTs

Plus 2 BSRAM blocks for the staging register (one for load, one for store-back).

On the GW5AST-138: ~3% of LUT budget.

7.6 Mode Behaviour and Composability

Both BSRCH and BSHIFT are available in narrow and wide modes. In wide mode:

Register fields use the extension nibble for 6-bit register indices (rd, rs1, rs2).
Xcond predication (bit 35 = PRED-EN) applies normally — predicated B-tree search is useful for skipping work in deleted/empty nodes.
The 6-bit rd accommodates BSRCH.B's full 0–64 result range; in narrow mode, BSRCH.B is restricted to a 32-key staging area (returning 0–32) since the 5-bit narrow rd cannot represent positions 33–64.

In narrow mode, BSRCH.B should be used with caution — the 32-key restriction is a structural constraint, not a hardware capability difference. Code that needs to search the full 64-byte staging area uses wide mode, or uses BSRCH.H (32×16-bit) instead.

The B-tree primitives compose naturally with:

Xcrisp BMCPY for cross-node shifts (when an insert overflows the 64-byte BSHIFT window).
Xcond predication for conditional searches in compressed/sparse indexes.
Xcrisp X-type indexed loads (wide mode) for following child pointers after a search.

8. Position-Independent Code (Wide-Mode Only)

8.1 Motivation

Standard RV64GC requires two-instruction sequences for every PC-relative access: AUIPC rd, hi20; ADDI rd, rd, lo12 for address materialisation, AUIPC t, hi20; LD rd, lo12(t) for global loads, and AUIPC t, hi20; JALR rd, lo12(t) for long-range direct calls. For a modular system where hot-path code dispatches through GOTs, vtables, or PLT trampolines, these pairs dominate the cross-module instruction count.

The Xcrisp PIC family compresses each of these patterns into a single 32-bit instruction, available only in wide mode (36-bit SRAM fetch). The encoding uses the 0x7F escape mechanism (§8.2) rather than consuming one of the four remaining funct3 slots in custom-2, which keeps that space available for future narrow-mode extensions and gives the PIC instructions a much larger immediate field than they could obtain within standard RV64 opcode layout.

8.2 The 0x7F Escape Mechanism

The standard RISC-V opcode encoding reserves bits[6:2] = 11111 for instructions ≥48 bits wide. In a 32-bit slot, a value of bits[6:0] = 0x7F is therefore unused and traps as illegal-instruction in any standard RV64 implementation.

FireStorm wide mode (36-bit fetch only) repurposes 0x7F as a wide-mode extension marker:

 35 34                                7 6           0
+---+----------------------------------+-------------+
| F |       29-bit custom payload      |  1111111    |
+---+----------------------------------+-------------+

Bit [35] selects between two top-level wide-PIC formats (§8.3, §8.4).
Bits [34:7] are the 29-bit instruction payload.
Bits [6:0] = 0x7F mark the wide-extension instruction.

In narrow mode (DDR3 fetch), 0x7F remains illegal-instruction — the escape is invisible to standard RV64 code. In wide mode (36-bit SRAM fetch), the decoder sees the marker and dispatches to the wide-extension decode path.

The escape mechanism is a general-purpose lane for wide-mode-only instructions. The current Xcrisp PIC family uses it; future FireStorm extensions (DSP primitives, accelerators, etc.) may reuse the same mechanism by allocating sub-encodings within the 29-bit payload, coordinated through the format bit at [35] and a per-format dispatch sub-field.

8.3 W-Type Format (bit[35] = 0) — PC-Relative

For PC-relative loads, address materialisation, and direct calls:

 35 34         16 15  13 12     7 6           0
+---+-------------+------+--------+-------------+
| 0 |  imm[18:0]  |funct3|   rd   |  1111111    |
+---+-------------+------+--------+-------------+

Field	Bits	Meaning
format	[35]	`0` = W-type
`imm[18:0]`	[34:16]	19-bit signed PC-relative offset (scaled per-instruction)
`funct3`	[15:13]	Variant selector
`rd`	[12:7]	6-bit destination register (wide-register-file access)
opcode	[6:0]	`0x7F`

The W-type provides 19 bits of signed immediate, scaled by the natural unit of the operation (byte / halfword / word / dword) to give an effective reach of ±256 KiB to ±2 MiB depending on variant.

funct3	Mnemonic	Scaling	Effective range	Operation
`000`	LDPC	×8 (dword)	±2 MiB	`rd = mem64[PC + sext(imm) × 8]`
`001`	LWPC	×4 (word)	±1 MiB	`rd = sext32(mem32[PC + sext(imm) × 4])`
`010`	LWUPC	×4 (word)	±1 MiB	`rd = zext32(mem32[PC + sext(imm) × 4])`
`011`	LAPC	×1 (byte)	±256 KiB	`rd = PC + sext(imm)` (address materialisation)
`100`	JALPC	×2 (hword)	±512 KiB	`rd = PC + 4; PC = PC + sext(imm) × 2`
`101`	JALXPC	×8 (dword)	±32 KiB	Indexed PC-relative jump-and-link; see §8.4
`110`	X-type dispatch	—	—	Indexed memory load; imm field repurposed (§8)
`111`	reserved	—	—	illegal-instruction

The scaling factors match each operation's natural alignment: dwords are 8-byte aligned in the global data area, words are 4-byte aligned, byte-precision is needed for &char_data patterns, halfword precision is sufficient for RVC-aware call targets.

The rd field is 6 bits wide, giving direct access to all 64 wide-mode integer registers (x0–x63) without any further extension mechanism. This is a property of the W-type's roomier encoding compared to standard RV64 instructions.

Semantics note. The PC value used for relative addressing is the address of the PIC instruction itself, matching AUIPC convention. The immediate is sign-extended from 19 bits to 64 bits and then scaled.

8.4 WI-Type Format (bit[35] = 1) — Register-Indirect

For indirect calls (vtable dispatch, PLT, function pointer through a structure):

 35 34       23 22  17 16  13 12     7 6           0
+---+-----------+-------+------+--------+-------------+
| 1 | imm[11:0] |  rs1  |funct4|   rd   |  1111111    |
+---+-----------+-------+------+--------+-------------+

Field	Bits	Meaning
format	[35]	`1` = WI-type
`imm[11:0]`	[34:23]	12-bit signed byte-precise offset
`rs1`	[22:17]	6-bit base register
`funct4`	[16:13]	Variant selector
`rd`	[12:7]	6-bit destination register
opcode	[6:0]	`0x7F`

Both rs1 and rd are 6-bit fields, giving full x0–x63 access without an extension nibble.

funct4	Mnemonic	Operation	Replaces
`0000`	CALLM	`rd = PC + 4; PC = mem64[rs1 + sext(imm)]`	`ld t, off(rs1); jalr rd, t`
`0001`	JMPM	`PC = mem64[rs1 + sext(imm)]` (no return-address save)	`ld t, off(rs1); jr t`
`0010`–`1111`	reserved	—	—

CALLM is the canonical vtable / PLT dispatch instruction. JMPM is the tail-call variant (no return address saved); compilers emit it for goto *fp patterns and for the final hop of trampolines.

The 12-bit byte-precise offset gives ±2 KiB of reach within the indirected table, comfortably covering vtables of up to 256 entries (8 bytes per slot) or PLT-sized dispatch tables.

8.5 Linker Relaxation

Toolchains targeting +xfirestorm emit standard AUIPC + ADDI/LD/JALR pairs against PIC relocations. The linker examines each pair after final layout:

If the target address is within the W-type reach for the operation, the linker relaxes the pair into a single W-type instruction (replacing 8 bytes of pair with 4 bytes of PIC instruction plus 4 bytes of NOP, or compacting the section if alignment permits).
If the resulting code is in a wide section (.text.wide), relaxation is permitted; in narrow sections (.text, .text.crisp), the standard pair is kept.
If the target is out of range, the pair is left as-is.

This means existing PIC-aware code recompiled with +xfirestorm and placed in wide sections automatically gains the density and performance wins, without source-level changes. The linker's PIC relaxation pass operates per-section after layout, similar to existing RISC-V relaxation for JAL ↔ AUIPC+JALR.

8.6 Wide-Mode-Only Restriction

The 0x7F escape, and therefore the entire PIC instruction family, is undefined behaviour in narrow mode. A standard RV64 implementation receiving a 0x7F instruction will trap as illegal-instruction (the spec-defined behaviour for unallocated 32-bit opcodes). FireStorm's narrow-mode decoder follows the spec.

Consequences:

PIC instructions live only in .text.wide. A linker that attempts to place a W-type or WI-type instruction in a narrow section is in error; toolchains must enforce this.
DDR3-resident code keeps using standard AUIPC + ADDI/LD/JALR. Modular dynamic-loaded code where modules are in DDR3 is unaffected by this extension and continues to work exactly as on any RV64 implementation.
Module trampolines benefit asymmetrically. A trampoline placed in 36-bit SRAM that bridges DDR3 modules can use CALLM to dispatch into the target module in one instruction, while the modules themselves remain narrow-PIC. This is a clean fit for AntOS-style module dispatch.

8.7 Wide Register Extension

Unlike Xcrisp instructions in standard RV64 opcode space (§3–§6), the PIC family does not use the extension-nibble scheme. The W-type and WI-type encodings already provide 6-bit register fields natively (rd at [12:7], rs1 at [22:17]), addressing x0–x63 directly.

The 36-bit fetch still carries 4 bits beyond the standard 32-bit instruction word, but for PIC instructions those bits are entirely consumed by the format bit [35] and the larger immediate/funct4 fields; there are no spare bits available for hints or sub-encoding.

8.8 Worked Examples

LDPC x40, .Lglobal_table — load a global pointer from a table 1024 bytes ahead:

PC offset = +1024 bytes = +128 dwords = imm 0x080
funct3 = 000 (LDPC)
rd     = 40   (6-bit field = 0b101000)
imm    = 0x00080
format = 0
Encoding bits [35:0]: 0 0000000000010000000 000 101000 1111111
                    = 0x008000147F (36-bit slot)

LAPC x12, .Lstring_const — materialise the address of a string constant 7 bytes ahead (byte-precise):

imm    = 7
funct3 = 011 (LAPC)
rd     = 12
format = 0
Encoding: 0 0000000000000000111 011 001100 1111111

CALLM x1, 24(x10) — vtable dispatch: load function pointer at offset 24 from x10, call with ra = x1:

imm    = 24
rs1    = 10
funct4 = 0000 (CALLM)
rd     = 1
format = 1
Encoding: 1 000000011000 001010 0000 000001 1111111

JALPC x1, .Lfar_function — direct call to a target 200 KiB ahead, beyond JAL's ±1 MiB range but within JALPC's ±512 KiB:

PC offset = +204800 bytes / 2 = 102400 = imm 0x19000
funct3 = 100 (JALPC)
rd     = 1   (ra)
format = 0
Encoding: 0 0011001000000000000 100 000001 1111111

8.9 Compiler Patterns

The compiler should recognise and emit PIC family instructions for:

Source pattern	Emitted (wide section, target in range)
`&global_var`	`LAPC rd, global_var`
`global_long_var`	`LDPC rd, global_long_var`
`global_int_var` (signed)	`LWPC rd, global_int_var`
`global_uint_var`	`LWUPC rd, global_uint_var`
`extern_function()` (direct call, in range)	`JALPC ra, extern_function`
`vtbl->method()` (virtual call)	`CALLM ra, off(vtbl)`
`goto *fp` (computed goto, indirect)	`JMPM zero, 0(fp)`
`tail_call_thunk()` (tail call through pointer)	`JMPM zero, off(rs1)`

Out-of-range targets fall back to the standard AUIPC + ADDI/LD/JALR pair, which the linker leaves un-relaxed.

In wide mode, Xcrisp instructions participate in the standard extension nibble scheme. The nibble bits map to register fields according to which format the instruction uses:

Family	Format	rd ext	rs1 ext	rs2 ext	rs3 ext (op-store)	Spare
Auto-inc loads (§3)	I-type	bit[32]	bit[33]	—	—	bits[35:34]
Auto-inc stores (§4)	S-type	—	bit[33]	bit[34]	—	bits[35], bit[32]
Load-op (§5.2)	R-type	bit[32]	bit[33]	bit[34]	—	bit[35]
Op-store (§5.3)	R-type	—	bit[33]	bit[34]	bit[32] (rd-as-rs3)	bit[35]
Load-op-store (§5.4)	R-type	bit[32] (rd-as-dest-addr)	bit[33]	bit[34]	—	bit[35]
Block memory (§5.5)	R-type	bit[32]	bit[33]	bit[34]	—	bit[35]
Compare-mem-branch (§6)	B-type	—	bit[33]	bit[34]	—	bits[35], bit[32]

(For op-store, bit[32] extends the field interpreted as rs3 rather than a destination. For load-op-store, the same bit[32] extends the destination memory base address; the field is read from the register file but the register itself is not written.)

In narrow mode, all register operands are restricted to x0–x31 / f0–f31 as usual. The nibble bits do not exist — the instruction occupies a standard 32-bit slot in DDR3, and the assembler rejects any operand naming x32–x63.

9. Indexed Addressing (Wide-Mode Only)

9.1 Motivation

Standard RV64GC has no scaled-indexed addressing mode. Every non-sequential array access requires an explicit shift-and-add sequence:

slli   t0, idx, 2          ; idx * 4 (byte offset for 32-bit elements)
add    t0, base, t0
lw     val, 0(t0)

The Zba extension's sh1add, sh2add, sh3add collapse the shift-add pair into one instruction for the common ×2, ×4, ×8 scales, reducing the sequence to:

sh2add t0, idx, base
lw     val, 0(t0)

This is good for one-dimensional arrays, but Zba does not directly cover the load itself, and its scales stop at ×8. Multi-dimensional access and larger element strides remain multi-instruction sequences:

slli   t0, row, 6          ; row * 64 (row stride for int matrix[16][16])
add    t0, base, t0
slli   t1, col, 2
add    t0, t0, t1
lw     val, 0(t0)

Five instructions for a 2D array access, none of them fusable by Zba alone.

The Xcrisp indexed addressing family (this section) provides single-instruction load + scale + add for the full set of integer widths, with scales ×1 through ×128. The primary workloads served are:

Hash table probes and sparse array access — non-sequential reads where the index doesn't fit a loop induction pattern.
Jump table dispatch for switch statements, state machines, and interpreters — covered by JALXPC (§8.4) rather than the load family.
2D matrix and struct-array access where stride is a power of two larger than 8.
Generated code (JIT, dynamic linkers, byte-code interpreters) that resolves addresses at runtime through tables.

The family is wide-mode only because it uses encoding bits in the 0x7F escape (§8.2) that do not exist in narrow-mode 32-bit fetches. Narrow-mode code continues to use Zba shift-add sequences. Code that wants the indexed forms places itself in .text.wide.

9.2 X-Type Format

The indexed-load instructions occupy a sub-encoding of the W-type format (§8.3) gated by funct3 = 110. When the W-type decoder sees funct3 = 110, the 19-bit imm field is reinterpreted as the X-type payload:

 35 34   29 28   23 22  20 19  17 16   15  13 12     7 6           0
+---+------+------+------+------+----+------+--------+-------------+
| 0 | rs1  | rs2  |scale | w+s  | r  | 110  |   rd   |  1111111    |
+---+------+------+------+------+----+------+--------+-------------+

Field	Bits	Meaning
format	[35]	`0` = W-type family
`rs1`	[34:29]	6-bit base register (x0–x63)
`rs2`	[28:23]	6-bit index register (x0–x63)
`scale`	[22:20]	3-bit scale selector (×1, ×2, ×4, ×8, ×16, ×32, ×64, ×128)
`w+s`	[19:17]	3-bit width-and-sign selector (see §8.3)
`r`	[16]	reserved, must be zero in v0.1
`funct3`	[15:13]	`110` (X-type dispatch within W-type)
`rd`	[12:7]	6-bit destination register
opcode	[6:0]	`0x7F`

The effective address is computed as:

addr = rs1 + zext64(rs2) × (1 << scale_log2)

where scale_log2 is the value of the scale field (0–7). The index is zero-extended to 64 bits before scaling — array indices are unsigned by convention; negative indices require the user to pre-sign-extend into the index register.

The scale set covers:

×1: byte-precise random access (rare; mostly for symmetry).
×2, ×4, ×8: matches Zba's sh1add/sh2add/sh3add scales and the natural element sizes for halfword/word/dword arrays.
×16: 16-byte structures (a common C struct stride for AoS data).
×32: 32-byte cache-line-aligned records.
×64, ×128: row strides for 16- and 32-wide matrix layouts.

9.3 Indexed Loads

The width-and-sign field selects the access:

w+s	Mnemonic	Width	Sign	Operation
`000`	LBX	byte	signed	`rd = sext8(mem8[addr])`
`001`	LBUX	byte	unsigned	`rd = zext8(mem8[addr])`
`010`	LHX	half	signed	`rd = sext16(mem16[addr])`
`011`	LHUX	half	unsigned	`rd = zext16(mem16[addr])`
`100`	LWX	word	signed	`rd = sext32(mem32[addr])`
`101`	LWUX	word	unsigned	`rd = zext32(mem32[addr])`
`110`	LDX	dword	(n/a)	`rd = mem64[addr]`
`111`	reserved	—	—	illegal-instruction

Each instruction is a single 32-bit operation that completes in the same cycles as a standard load (2 cycles latency, 1-cycle throughput in steady state on the reference pipeline).

The reserved w+s = 111 slot is held for a possible future 128-bit load (LQX) or a sign-extended halfword-into-32-bit form. No specific allocation in v0.1.

Assembler Syntax

LWX    rd, (rs1, rs2, scale)            ; canonical
LWX    rd, scale(rs1, rs2)              ; 68k-style alternative
LWX    rd, [rs1 + rs2 * scale_factor]   ; verbose form (scale_factor = 1, 2, 4, ..., 128)

The assembler accepts any of these forms and normalises to the canonical representation.

9.4 JALXPC — Indexed PC-Relative Jump

For switch dispatch and other table-indexed jumps, JALXPC reads a target address from a PC-relative jump table and transfers control to it. Encoded as W-type funct3 = 101:

 35 34         29 28          16 15  13 12     7 6           0
+---+------------+--------------+------+--------+-------------+
| 0 |    rs2     | imm[12:0]    | 101  |   rd   |  1111111    |
+---+------------+--------------+------+--------+-------------+

Field	Bits	Meaning
format	[35]	`0` = W-type family
`rs2`	[34:29]	6-bit index register (zero-extended)
`imm[12:0]`	[28:16]	13-bit signed PC-relative offset, scaled ×8 (±32 KiB)
`funct3`	[15:13]	`101`
`rd`	[12:7]	Link register; `x0` for plain jump, otherwise receives PC+4
opcode	[6:0]	`0x7F`

Operation:

table_addr = PC + sext(imm) × 8
target     = mem64[table_addr + zext64(rs2) × 8]
if (rd != x0):
    rd = PC + 4
PC = target

When rd = x0, the assembler emits the mnemonic JMPXPC (no-link form). When rd != x0, the mnemonic is JALXPC and the link is captured.

The 13-bit ×8-scaled immediate gives ±32 KiB of reach from the dispatching instruction to the base of the jump table. Switch tables are typically allocated in .rodata near the function emitting the dispatch, and 32 KiB comfortably covers most cases; for very large code modules with distant tables, the compiler falls back to LAPC + LDX + JALR.

9.5 Examples

8.5.1 Single-Dimensional Array Load

int load_elem(int *arr, size_t idx) {
    return arr[idx];
}

Standard RV64GC (Zba):

load_elem:
    sh2add  t0, a1, a0
    lw      a0, 0(t0)
    ret

3 instructions.

FireStorm (wide mode, X-type):

load_elem:
    LWX    a0, (a0, a1, ×4)         ; a0 = sext32(mem32[a0 + a1 * 4])
    ret

2 instructions. One fewer instruction, no temporary register consumed.

8.5.2 2D Matrix Access

int load_cell(int matrix[16][16], size_t row, size_t col) {
    return matrix[row][col];
}

The row stride is 16 ints = 64 bytes; the column scale is ×4.

Standard RV64GC:

load_cell:
    slli    t0, a1, 6                ; row * 64
    add     t0, a0, t0               ; &matrix[row][0]
    sh2add  t0, a2, t0               ; + col * 4
    lw      a0, 0(t0)                ; load
    ret

5 instructions (using Zba sh2add for the inner stride).

FireStorm (wide mode, X-type final column step):

load_cell:
    slli    t0, a1, 6                ; row * 64 (no Zba scale for 64; must use slli)
    add     t0, a0, t0               ; &matrix[row][0]
    LWX     a0, (t0, a2, ×4)         ; a0 = sext32(mem32[t0 + col * 4])
    ret

4 instructions. One fewer instruction than Zba; the X-type fuses the column scale-and-load.

For 2D access X-type captures only part of the win because there is no indexed address-materialise form (no "LAX") — the row stride still requires an explicit slli/add pair to compute the row base. Adding LAX is a v0.2 candidate (see §13). With LAX:

load_cell:
    LAX     t0, (a0, a1, ×64)        ; t0 = a0 + row * 64        — hypothetical v0.2
    LWX     a0, (t0, a2, ×4)         ; load column-scaled         — single instruction
    ret

would bring this to 3 instructions. v0.1 stops at the 4-instruction form.

8.5.3 Switch Statement Dispatch

int dispatch(int op, int arg) {
    switch (op) {
        case 0: return op_add(arg);
        case 1: return op_sub(arg);
        case 2: return op_mul(arg);
        case 3: return op_div(arg);
        default: return 0;
    }
}

Compiler emits a jump table and dispatches via it (assuming bounds-checked op).

Standard RV64GC (PIC, no Xcrisp PIC):

dispatch:
    li      t0, 4
    bgeu    a0, t0, .Ldefault
.La:
    auipc   t0, %pcrel_hi(jump_table)
    addi    t0, t0, %pcrel_lo(.La)
    sh3add  t0, a0, t0
    ld      t0, 0(t0)
    jr      t0
.Ldefault:
    li      a0, 0
    ret

The dispatch path: 7 instructions total (bounds check + 5 for the table jump).

FireStorm with JALXPC:

dispatch:
    li      t0, 4
    bgeu    a0, t0, .Ldefault
    JMPXPC  a0, jump_table              ; PC = mem64[jump_table + a0 * 8]
.Ldefault:
    li      a0, 0
    ret

Dispatch path: 3 instructions. ~57% reduction for the dispatch step, plus all the entries in jump_table can themselves use JALPC for the inner-handler relative jumps.

This pattern is hot in interpreters (Z-machine, bytecode VMs), parsers, OS syscall dispatch, and any code with a high-fan-out conditional. For the AntOS syscall path, halving the dispatch instruction count is a measurable kernel-entry latency improvement.

8.5.4 Hash Table Probe

entry_t *probe(entry_t *table, uint64_t hash, uint64_t mask) {
    return &table[hash & mask];
}

Assuming entry_t is 16 bytes:

Standard RV64GC:

probe:
    and     t0, a1, a2
    slli    t0, t0, 4               ; *16; no Zba ×16
    add     a0, a0, t0
    ret

4 instructions.

FireStorm:

probe:
    and     t0, a1, a2
    ; need an "LAX" address-compute, which doesn't exist in v0.1.
    ; Fallback: compute address explicitly then load if needed.
    slli    t0, t0, 4
    add     a0, a0, t0
    ret

Same 4 instructions — X-type doesn't help if we want the address rather than the loaded value.

If the caller does entry->field right after, the load can be folded:

int probe_load(entry_t *table, uint64_t hash, uint64_t mask) {
    return table[hash & mask].first_field;     /* assume first_field is int at offset 0 */
}

probe_load:
    and     t0, a1, a2
    LWX     a0, (a0, t0, ×16)        ; a0 = sext32(mem32[a0 + t0 * 16])
    ret

3 instructions versus 5 for standard (and/slli/add/lw/ret with Zba — sh4add is not in Zba, so the slli is mandatory).

1–2 instructions saved per probe, depending on Zba availability and exact element size.

9.6 Compiler Patterns

The compiler emits X-type instructions for:

Loop-variant array access (not loop-induction): a[idx] where idx is computed inside the loop body but a is loop-invariant. Loop-induction patterns (a[i++]) prefer LWPI / LDPI for the auto-inc.
Hash table and dictionary probes where the entry size is power-of-two ≤ 128 bytes.
Lookup table access with scaled-index addressing (sine tables, cosine tables, log/exp LUTs).
switch statement dispatch via JALXPC when the jump table is within ±32 KiB.
Function pointer table dispatch via JALXPC when the table is PC-resident.

The compiler does not emit X-type for:

Sequential array iteration (LWPI / LDPI families do this better — single-cycle, no index register needed).
2D matrix access with non-power-of-two strides (the indexed form doesn't apply; falls back to mul/shift + add).
Narrow-mode code (the entire 0x7F escape is wide-mode only).

9.7 Wide-Mode-Only Restriction

The X-type and JALXPC instructions live in the 0x7F escape (§8.2). They are undefined behaviour in narrow mode and the assembler rejects them in narrow sections. Code that targets both modes must provide a narrow-mode fallback using Zba shift-add + standard load sequences.

The wide-register-file access (x0–x63) is available natively because all register fields in the X-type and JALXPC encodings are 6 bits wide.

9.8 Indexed Stores (Deferred to v0.2)

Indexed stores (SWX rs2_index, val, base) face an encoding challenge: the destination is memory, not a register, so the X-type's rd field has no natural use. Repurposing rd as the source value (i.e., "S-X-type") is straightforward but adds a third pipeline read port for the index instruction, which raises the implementation cost.

The use case for indexed stores is scatter writes (e.g., bucket-sort placement, hash table insertion, sparse vector update). These are real patterns but less hot than the gather (load) case in the workloads FireStorm targets. v0.1 omits indexed stores; v0.2 will revisit based on workload data.

For now, scatter writes use the standard Zba shift-add followed by a store:

sh2add  t0, idx, base
sw      val, 0(t0)

10. ABI and Calling-Convention Interaction

Xcrisp instructions do not change the calling convention. They are pure local operations within a function:

Auto-inc forms update rs1 (a register); the caller/callee category of rs1 is unchanged from the standard RV64 ABI.
Load-op, op-store, load-op-store, and block memory write back to standard registers (or memory); same ABI rules apply.
Load-op-store writes only to memory; the rd field names a register that is read (as a base address), not written. ABI roles of pointer registers are unaffected.
Block memory instructions are interruptible and may take many cycles, but they hold no hidden architectural state — only the named registers carry progress. Function-call semantics are unaffected.

A function compiled with +xcrisp is fully ABI-compatible with one compiled without; both observe the standard lp64d calling convention. The only externally visible consequence of +xcrisp is that the emitted code may contain Xcrisp instructions, which require an Xcrisp-aware decoder to execute.

A vanilla RV64GC implementation receiving Xcrisp code will trap on the first Xcrisp instruction (illegal-instruction in the custom opcode space). Mixed-mode deployment requires either runtime feature gating (test mxcrisp CSR before entering an Xcrisp code path) or build-time selection (separate object files).

11. Compiler and Toolchain Integration

11.1 Target Flags

The +xcrisp target feature, alone or as part of +xfirestorm (= +xwide,+xcrisp), enables Xcrisp emission. With +xcrisp alone, the compiler emits Xcrisp instructions in standard .text (DDR3); with +xfirestorm, functions marked or detected as wide-eligible are placed in .text.wide (SRAM) and use both Xcrisp and the wide register file.

Per-function annotations: __attribute__((target("xcrisp"))) enables Xcrisp emission for a single function regardless of global flags.

11.2 Auto-Vectorization Patterns

The compiler should recognise the following C patterns and emit the corresponding Xcrisp sequences:

Source pattern	Emitted sequence
`*p++ = v` (post-inc store)	`SDPI v, k(p)` for the natural element width
`v = *p++` (post-inc load)	`LDPI v, k(p)`
`*--p = v` (pre-dec store)	`SDPD v, k(p)`
`sum += *p` (accumulator)	`LDADD sum, (p), sum`
`*p &= mask` (mask-in-place)	`MMDAND [p], [p], mask` (load-op-store, true in-place)
`*p ^= v` (xor-in-place)	`MMDXOR [p], [p], v`
`dst[i] = src[i] + bias` (element-wise transform)	`MMWADD [d], [s], bias` then advance pointers
`dst[i] = src[i] << k` (scaling pass)	`MMWSLL [d], [s], k`
`while (*p != term) p++`	`LBPI v, 1(p); BNEM v, (term_reg), loop`
`memcpy(d, s, n)`	`BMCPY d, s, n`
`memset(d, c, n)`	`BMSET d, c, n`

The accumulator and string-scan patterns yield the largest density wins per inner-loop iteration. The load-op-store patterns yield the largest performance wins on memory-bound kernels (audio, image, vector) where the inner loop is dst[i] = f(src[i], constant): each iteration's load → ALU → store collapses into one instruction with no register-file traffic.

11.3 Inline Assembly Constraints

GCC/Clang inline-assembly constraints for Xcrisp:

=r (register destination), r (register operand): unchanged from base RV64.
Q: memory operand allowed for Xcrisp memory-operand instructions; expanded to (rs1) form without offset.
A new Xc constraint may be added for "this operand must be a register addressable by Xcrisp" — in practice equivalent to r since all Xcrisp register operands are unrestricted within their bank.

Compilers may emit Xcrisp instructions even when not explicitly requested via inline asm, if the C-level operation matches one of the patterns in §10.2 and +xcrisp is enabled.

12. Implementation Guidance

12.1 Pipeline Considerations

The fused instructions are designed to deliver both code density and performance wins. An implementation that captures only the density side leaves significant performance on the table; the notes below identify the microarchitectural paths that turn each fused form into a real cycle saving.

Load-op fusion is most valuable when the implementation forwards the load result directly into the ALU input without writing the intermediate to the register file. A two-cycle issue (load-result available at cycle +1, ALU op at cycle +2) is the typical pattern; the fused instruction occupies one issue slot but two execution slots, which the implementation may schedule freely. Win sources: front-end throughput (one decode instead of two), register-file write port freed for an independent op, one fewer live temporary for the allocator, smaller prefetch-buffer footprint.
Op-store fusion delivers latency and density wins simultaneously. The ALU result should be forwarded directly into the store-buffer entry, bypassing the register file entirely — the temporary never exists architecturally and need not exist physically. On a simple in-order pipeline this saves one cycle versus separate add; sw (no register-file write-then-read on the temp); on a multi-issue pipeline it relieves write-port pressure and frees an issue slot. The allocator also wins one fewer live temporary per pattern, which on a tight basic block may avoid a spill. The store-buffer entry should be allocated at decode time so the ALU result writes directly into it.
Load-op-store fusion is the most aggressive Xcrisp instruction and the one with the largest possible performance win. The instruction does the work of three (lw; add/op; sw) in one instruction-fetch slot, with no architectural temporary, no register-file read/write of the intermediate value, and one fewer architectural register live across the operation. Recommended implementation: a three-stage internal pipeline (load → ALU → store) that allows back-to-back load-op-store instructions to overlap and achieve one-per-cycle throughput in steady state. The store buffer entry should be allocated at decode time so the ALU forwards directly into it (same path as op-store), and the load result should be forwarded into the ALU input without ever being written to a physical register (same path as load-op). A minimal implementation that decodes load-op-store into three sequential micro-ops still gets the density win and avoids the architectural-temporary write; the performance win scales with how much of the pipeline overlap is implemented.
Auto-increment instructions write two registers (rd and rs1 for loads; just rs1 updated for stores plus the memory write). A single-write-port register file must serialise the two-register update across two cycles; a two-write-port file (already required by some standard extensions) takes it in stride. The density win is unconditional; the performance win arrives when the second write port is available, otherwise it's a wash on cycles but still a win on fetch/decode bandwidth.
Compare-mem-branch issues a load and a branch in one instruction. The load-latency critical path is unchanged from a separate load + branch (the branch must still wait for the loaded value), so the raw cycle count for this single pattern is identical. The wins come from the surrounding context: one fewer issue slot consumed, one fewer register-file read (the loaded value is consumed directly by the compare unit, never written back), one fewer architectural register live across the load (helpful for register pressure in tight loops), and the smaller prefetch-buffer footprint. On loops that are fetch-bound or register-pressured — which sentinel scans and table walks often are — this is a real throughput win, not just a code-size one.
Block memory instructions are intended as DMA-like primitives. A minimal implementation iterates byte-by-byte (slow but correct); a serious implementation uses wider transfers when alignment permits and the operands don't overlap unsafely. A well-tuned BMCPY should approach one DRAM-burst per cycle of throughput on aligned operands, dwarfing any libc inline-expanded loop.

12.2 Trap Restart for Block Operations

A block-memory instruction trapped mid-execution must:

Update rd, rs1, rs2 to reflect progress (bytes completed).
Leave mepc pointing at the block instruction (not past it).
Discard any uncommitted internal transfer state.

The trap handler does nothing special — mret returns to the same mepc, the instruction re-executes with the partially-advanced register state, and the byte-loop resumes from where it stopped. A buggy implementation that fails to update the registers atomically with the memory write will cause silent data corruption; this is the single most important invariant for block-memory correctness.

12.3 CSR Allocation

CSR	Address	Type	Description
`mxcrisp`	`0xFC1` (suggested)	M-mode RO	Xcrisp version & feature bits

Bit layout of mxcrisp (proposed):

Bits	Field	Meaning
`[0]`	PRESENT	1 if Xcrisp implemented
`[7:1]`	VERSION	Xcrisp version (1 = v0.1)
`[8]`	HAS_BMOPS	1 if synchronous block memory ops (BMCPY/BMSET) implemented
`[9]`	HAS_MMOPS	1 if load-op-store ops implemented
`[10]`	HAS_PIC	1 if PC-relative PIC instructions implemented (wide-mode only)
`[11]`	HAS_DMA	1 if asynchronous DMA ops (DMACPY/DMASET) implemented
`[15:12]`	DMA_QUEUE_DEPTH	log₂ of DMA queue depth, 0 if no DMA. (`010` = 4 entries, `011` = 8 entries)
`[16]`	HAS_INDEXED	1 if X-type indexed addressing implemented (wide-mode only)
`[63:17]`	reserved	—

A reduced FireStorm variant lacking block-memory hardware may set HAS_BMOPS = 0 while keeping the rest of Xcrisp. Variants lacking load-op-store, PIC, DMA, or indexed-load support clear the corresponding bits similarly. A variant implementing synchronous block ops but no DMA engine sets HAS_BMOPS = 1, HAS_DMA = 0, and DMA_QUEUE_DEPTH = 0.

13. Encoding Summary

13.1 At-a-Glance Opcode Map

Mnemonic	opcode	funct3	funct7 / sub	Format	Mode
LBPI..LWUPI	`0x0B`	000–110	(none)	I-type	both
LBPD..LWUPD	`0x0B`	111	imm[11:9]=width	I-type sub	both
SBPI..SDPD	`0x2B`	000–111	(none)	S-type	both
LWADD..LWUSLTU	`0x5B`	000	width+aluop	R-type	both
LDADD..LDSLTU	`0x5B`	000	width+aluop	R-type	both
ADDSW..SRASD	`0x5B`	001	width+aluop	R-type (rd=rs3)	both
BMCPY, BMSET	`0x5B`	010	0000000, 0000001	R-type	both
DMACPY, DMASET	`0x5B`	010	0000010, 0000011	R-type	both
BSRCH.B/H/W/D	`0x5B`	010	0010xxx (width selector)	R-type	both
BSCAN.B/H/W/D	`0x5B`	010	0011xxx (width selector)	R-type	both
BSHIFTR/L.B/H/W/D	`0x5B`	010	0100xxd (width + direction)	R-type	both
MMWADD..MMWUSLTU	`0x5B`	011	width+aluop	R-type (rd=dest-addr)	both
MMDADD..MMDSLTU	`0x5B`	011	width+aluop	R-type (rd=dest-addr)	both
BEQM..BGEUM	`0x7B`	000–111	(none)	B-type	both
LDPC, LWPC, LWUPC, LAPC, JALPC	`0x7F`	000–100	bit[35]=0	W-type	wide only
JALXPC, JMPXPC	`0x7F`	101	bit[35]=0	W-type	wide only
LBX..LDX	`0x7F`	110	bit[35]=0, scale+w+s	X-type (sub of W)	wide only
CALLM, JMPM	`0x7F`	(funct4 at [16:13])	bit[35]=1	WI-type	wide only

13.2 Reserved Spaces

The following encoding spaces are reserved for future Xcrisp expansion and must not be used by any other FireStorm extension or vendor implementation:

0x0B with funct3 = 111 and imm[11:9] = 111
0x2B: no reserved space in v0.1 (all funct3 slots in use)
0x5B funct3 = 100–111 (reserved sub-families)
0x5B funct3 = 000/001/011 with width = 11 or aluop[4:0] ≥ 01010
0x5B funct3 = 010 with funct7 ≥ 0101000 (B-tree expansion space — BMCMP at 0000100 reserved for v0.2; BSRCH/BSCAN/BSHIFT occupy 0010xxx–0100xxx)
0x7B: no reserved space in v0.1 (all funct3 slots in use; dword ordered comparisons require new opcode in v0.2)
0x7F W-type with funct3 = 111 (reserved PC-relative variant)
0x7F W-type funct3 = 110 (X-type) with w+s = 111 (reserved indexed-load width/sign)
0x7F W-type funct3 = 110 (X-type) with bit [16] = 1 (reserved sub-encoding bit)
0x7F WI-type with funct4 = 0010–1111 (reserved register-indirect variants)

The 0x7F opcode itself (the wide-mode escape mechanism, §8.2) is the general-purpose lane for any wide-mode-only extension. Future FireStorm extensions may allocate sub-encodings within the 29-bit payload by coordinating with the format bit at [35] and choosing dispatch sub-fields that don't conflict with allocated W-type funct3 or WI-type funct4 values.

14. Open Items

mxcrisp CSR address. Suggested 0xFC1; needs to align with the wide-dirty CSR allocation (open item §16 of the parent doc) and any other FireStorm-specific CSRs.
Dword ordered compare-mem-branch. Requires either an additional opcode or a width-modifier bit somewhere; deferred to v0.2.
16-bit memory width for load-op and op-store. The width = 11 slot is reserved; semantics TBD.
BMCMP semantics. Result encoding (remaining-count, flag, or both) TBD.
Overlapping BMCPY. Implementation-defined in v0.1; may be tightened to memmove-compatible in v0.2 if cost is acceptable.
Alignment hints for block memory. Future width[1:0] use in BM-family funct7.
Sub-word load-op forms. Byte and halfword load-op (e.g. LBADD, LHUADD) are not in v0.1; if added later they'd consume width = 10/11 in load-op funct7 with redefined semantics.
PIC and Zcmp interaction. Zcmp's cm.popret is a tail-return; combining with JMPM for full tail-call sequences is an optimisation worth exploring.
PIC linker relaxation pass. Specification of the relaxation algorithm and the new relocation types needed to mark AUIPC + LD/JALR/ADDI pairs as relaxable.
Indexed stores (SBX..SDX). v0.1 omits indexed stores; the S-X-type encoding repurposing rd as the stored-value source is sketched in §8.8 but not finalised. v0.2 to revisit based on workload data.
LAX (indexed address materialise). An "X-type but result is the computed address, not the loaded value" form would help 2D access and address-of-array-element patterns. Encoding space exists (X-type w+s = 111 is reserved). Deferred to v0.2.
Index sign-extension option. Current X-type zero-extends rs2 (the index). Some patterns (e.g., signed offsets from a midpoint) want sign-extension. A scale-field bit could select, at the cost of halving the scale set. Open for v0.2.
JALXPC reach extension. ±32 KiB is sufficient for most switch tables but not for very large generated dispatch tables (e.g., character-class tables in regex engines). A wider-immediate variant would consume a second funct3 slot in W-type. Open for v0.2.
DMA fault reporting. The mechanism for surfacing DMA-engine memory faults (unmapped page, write to RO, bus error) is sketched in §5.5.2 but not finalised. Candidate design: a status CSR mxdma_status holding the faulting DMA's tag-register name, fault address, and fault cause; M-mode trap with cause = DMA_FAULT. Final encoding TBD.
DMA priority. Should the DMA engine support priority levels so audio-critical transfers preempt bulk ones? A 2-bit priority field could be carried in the spare funct7 bits of DMACPY/DMASET. Deferred to v0.2 pending workload data.
DMA-to-MMIO ordering. §5.5.2 suggests MMIO DMAs are strictly ordered relative to CPU MMIO stores; the exact memory-ordering model (RVWMO interaction, fences required) needs formalisation.
Multiple-consumer DMA tag table. Currently one tag-table entry per outstanding DMA. If two DMAs name the same rs2, the second stalls until the first completes. An alternative (allocate fresh tag entry on each issue, with a "scoreboard" mapping register names to multiple entries) is more flexible but costlier. v0.1 takes the simpler one-tag-per-register approach.

15. Glossary

Term	Meaning
Auto-increment	Load or store that updates its base register as a side effect (post-inc or pre-dec).
Load-op fusion	Combined load and ALU operation: `rd = mem[rs1] OP rs2`.
Op-store fusion	Combined ALU operation and store: `mem[rs1] = rs2 OP rs3`. No register destination.
Compare-mem-branch	Branch on the result of comparing a register with a memory value.
Block memory	Variable-length memory transfer instruction (BMCPY, BMSET); interruptible with register-held restart state.
rs3 (in op-store)	The R-type `rd` field [11:7] repurposed as a third source register; no architectural register is written.
mxcrisp	M-mode CSR exposing Xcrisp implementation and feature bits.

End of document. See also: FireStorm CPU ISA — base architecture and wide-mode encoding.

FireStorm Xcrisp Extension — Instruction Encodings

1. Overview

2. Feature Detection

3. Auto-Increment Loads (custom-0, opcode 0x0B)

3.1 Encoding

3.2 Post-Increment Loads (funct3 000–110)

3.3 Pre-Decrement Loads (funct3 = 111)

3.4 Examples

4. Auto-Increment Stores (custom-1, opcode 0x2B)

4.1 Encoding

4.2 Funct3 Encoding

4.3 Examples

5. Memory-Fused Arithmetic (custom-2, opcode 0x5B)

5.1 Common funct7 Layout (Load-Op, Op-Store, Load-Op-Store)

5.2 Load-Op Fusion (funct3 = 000)

5.3 Op-Store Fusion (funct3 = 001)

5.4 Load-Op-Store Fusion (funct3 = 011)

5.4.1 Variant Table

5.4.2 Assembler Convention

5.4.3 In-Place Updates (rd == rs1)

5.4.4 Trap Restart

5.4.5 Wide-Mode Extension

5.4.6 Worked Example

5.4.7 Implementation Cost

5.5 Block Memory Operations (funct3 = 010)

5.5.1 Synchronous Block Operations (BMCPY, BMSET)

5.5.2 Asynchronous DMA Operations (DMACPY, DMASET)

Operand Capture at Issue

Register Tagging Semantics

DMAWAIT Idiom

Progress Polling

Queue Depth and Back-Pressure

Cache Coherence

Trap and Interrupt Behaviour

Operand Edge Cases

Worked Examples

5.6 Examples

6. Compare-Mem-Branch (custom-3, opcode 0x7B)

6.1 Encoding

6.2 Funct3 Encoding

6.3 Examples

6.4 Common Idioms

7. B-tree Primitives (custom-2, opcode 0x5B)

7.1 BSRCH Family — Parallel Sorted-Array Search

Encoding

Example: B-tree Node Search

7.2 BSCAN Family — First-Match Search

Use Case: Hash Bucket Probe

7.3 BSHIFT — Block Shift for Insert/Delete

Encoding

Example: B-tree Insert at Found Position

7.4 Performance Impact on Database / Index Workloads

7.5 Implementation Cost

7.6 Mode Behaviour and Composability

8. Position-Independent Code (Wide-Mode Only)

8.1 Motivation

8.2 The 0x7F Escape Mechanism

8.3 W-Type Format (bit[35] = 0) — PC-Relative

8.4 WI-Type Format (bit[35] = 1) — Register-Indirect

8.5 Linker Relaxation

8.6 Wide-Mode-Only Restriction

8.7 Wide Register Extension

8.8 Worked Examples

8.9 Compiler Patterns

9. Indexed Addressing (Wide-Mode Only)

9.1 Motivation

9.2 X-Type Format

9.3 Indexed Loads

Assembler Syntax

9.4 JALXPC — Indexed PC-Relative Jump

9.5 Examples

8.5.1 Single-Dimensional Array Load

8.5.2 2D Matrix Access

8.5.3 Switch Statement Dispatch

8.5.4 Hash Table Probe

9.6 Compiler Patterns

9.7 Wide-Mode-Only Restriction

9.8 Indexed Stores (Deferred to v0.2)

10. ABI and Calling-Convention Interaction

11. Compiler and Toolchain Integration

3. Auto-Increment Loads (custom-0, opcode `0x0B`)

3.2 Post-Increment Loads (`funct3` 000–110)

3.3 Pre-Decrement Loads (`funct3 = 111`)

4. Auto-Increment Stores (custom-1, opcode `0x2B`)

5. Memory-Fused Arithmetic (custom-2, opcode `0x5B`)

5.2 Load-Op Fusion (funct3 = `000`)

5.3 Op-Store Fusion (funct3 = `001`)

5.4 Load-Op-Store Fusion (funct3 = `011`)

5.5 Block Memory Operations (funct3 = `010`)

6. Compare-Mem-Branch (custom-3, opcode `0x7B`)

7. B-tree Primitives (custom-2, opcode `0x5B`)