FireStorm Xcrisp Extension — Instruction Encodings

Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA See also: FireStorm Performance Examples for worked comparisons


1. Overview

The Xcrisp extension is FireStorm's set of CRISP-influenced custom instructions, designed to raise the performance and code density of compiler-generated C code without breaking the RV64GC baseline. It is available in both narrow and wide modes (§3 of the parent doc): vanilla DDR3-resident code may use Xcrisp instructions exactly as SRAM-resident wide-mode code may. In wide mode, Xcrisp register operands extend via the standard extension nibble scheme.

The extension contains four instruction families:

Family Purpose Opcode Format
Auto-increment loads *p++ / *--p read patterns 0x0B (custom-0) I-type
Auto-increment stores *p++ / *--p write patterns 0x2B (custom-1) S-type
Memory-fused arithmetic load-op, op-store, block-memory 0x5B (custom-2) R-type
Compare-mem-branch sentinel scans, table walks 0x7B (custom-3) B-type

The opcode-to-family mapping is deliberately aligned with the standard RISC-V opcode bit pattern at [6:5]:

[6:5] Standard Xcrisp
00 LOAD (0x03) auto-inc loads (0x0B)
01 STORE (0x23) / OP (0x33) auto-inc stores (0x2B)
10 reserved memory-fused R-type (0x5B)
11 BRANCH (0x63) compare-mem-branch (0x7B)

This alignment lets a FireStorm decoder reuse the standard rs1/rs2/rd/imm extract logic for Xcrisp encodings; only the funct3/funct7 decode tables expand.


2. Feature Detection

The presence of Xcrisp is indicated by a non-zero value in the implementation-defined CSR mxcrisp (machine custom read-only, address 0xFC1, suggested). Bit [0] of mxcrisp is the Xcrisp version (1 for v0.1). A reduced FireStorm variant without Xcrisp returns zero; a CRISP instruction issued on such a variant traps as illegal-instruction.

Compilers normally rely on the +xcrisp target-feature flag rather than runtime detection. Detection is reserved for runtime libraries that may be deployed on multiple FireStorm variants.


3. Auto-Increment Loads (custom-0, opcode 0x0B)

3.1 Encoding

Standard I-type layout:

 31                  20 19    15 14   12 11     7 6           0
+----------------------+--------+-------+--------+-------------+
|       imm[11:0]      |  rs1   | funct3|   rd   |  0001011    |
+----------------------+--------+-------+--------+-------------+
Field Bits Meaning
imm[11:0] [31:20] Signed 12-bit increment/decrement amount (post-inc forms) or width-extended pre-dec sub-encoding (see §3.3)
rs1 [19:15] Base address register; also updated by the instruction
funct3 [14:12] Operation: width + direction (see §3.2)
rd [11:7] Load destination register
opcode [6:0] 0x0B (custom-0)

The instruction writes to two architectural registers: rd (the loaded value) and rs1 (the updated base). If rs1 == rd, the load value wins (the increment to rs1 is suppressed) — this matches the standard RISC-V convention for instructions that would otherwise have ambiguous semantics.

3.2 Post-Increment Loads (funct3 000–110)

For funct3111, the immediate is a 12-bit signed offset/increment, and the operation is:

rd  = sext_or_zext_W(mem[rs1])
rs1 = rs1 + sext(imm)
funct3 Mnemonic Width Sign Operation
000 LBPI byte signed rd = sext8(mem8[rs1]); rs1 += sext(imm)
001 LHPI half signed rd = sext16(mem16[rs1]); rs1 += sext(imm)
010 LWPI word signed rd = sext32(mem32[rs1]); rs1 += sext(imm)
011 LDPI dword n/a rd = mem64[rs1]; rs1 += sext(imm)
100 LBUPI byte unsigned rd = zext8(mem8[rs1]); rs1 += sext(imm)
101 LHUPI half unsigned rd = zext16(mem16[rs1]); rs1 += sext(imm)
110 LWUPI word unsigned rd = zext32(mem32[rs1]); rs1 += sext(imm)

Funct3 assignments match standard RV64I load encodings: bit [14] = unsigned, bits [13:12] = width (00=byte, 01=half, 10=word, 11=dword). This means an existing RV64I load decoder can route through a single funct3 path with only the opcode and "auto-inc" flag changing.

3.3 Pre-Decrement Loads (funct3 = 111)

For funct3 = 111, the 12-bit immediate field is repurposed as {width[2:0], offset[8:0]}:

 31    29 28              20
+--------+-------------------+
| width  |    offset[8:0]    |
+--------+-------------------+
  • width[2:0] (imm[11:9]): one of seven width/sign variants, matching the post-inc funct3 numbering.
  • offset[8:0] (imm[8:0]): signed 9-bit decrement amount, range −256..+255.
width Mnemonic Operation
000 LBPD rs1 -= sext(offset); rd = sext8(mem8[rs1])
001 LHPD rs1 -= sext(offset); rd = sext16(mem16[rs1])
010 LWPD rs1 -= sext(offset); rd = sext32(mem32[rs1])
011 LDPD rs1 -= sext(offset); rd = mem64[rs1]
100 LBUPD rs1 -= sext(offset); rd = zext8(mem8[rs1])
101 LHUPD rs1 -= sext(offset); rd = zext16(mem16[rs1])
110 LWUPD rs1 -= sext(offset); rd = zext32(mem32[rs1])
111 reserved illegal-instruction

A negative offset is permitted but produces unusual semantics (rs1 is incremented before the load); compilers should not emit this and disassemblers may flag it.

3.4 Examples

LWPI x10, 4(x11) — read 32-bit word at [x11], sign-extend into x10, advance x11 by 4:

imm  = 0x004
rs1  = 11   (0b01011)
funct3 = 010
rd   = 10   (0b01010)
opcode = 0x0B
Encoding: 0x004_5A50B = 0000 0000 0100 01011 010 01010 0001011

LDPD x14, 8(x15) — decrement x15 by 8, then read 64-bit dword into x14:

funct3 = 111 (pre-dec marker)
width  = 011 (dword)
offset = 0x008
imm[11:0] = {011, 000001000} = 0x608
rs1  = 15
rd   = 14
Encoding: 0x608_7B70B = 0110 0000 1000 01111 111 01110 0001011

4. Auto-Increment Stores (custom-1, opcode 0x2B)

4.1 Encoding

Standard S-type layout:

 31         25 24    20 19    15 14   12 11        7 6           0
+-------------+--------+--------+-------+-----------+-------------+
| imm[11:5]   |  rs2   |  rs1   | funct3| imm[4:0]  |  0101011    |
+-------------+--------+--------+-------+-----------+-------------+
Field Bits Meaning
imm[11:0] [31:25] ‖ [11:7] Signed 12-bit increment/decrement amount
rs2 [24:20] Source register (value to store)
rs1 [19:15] Base address register; also updated by the instruction
funct3 [14:12] Width + direction (see §4.2)
opcode [6:0] 0x2B (custom-1)

The instruction writes to one register (rs1 updated) and one memory location.

4.2 Funct3 Encoding

The store-side encoding partitions funct3 as {direction[1], width[2:0]}:

funct3 Mnemonic Width Direction Operation
000 SBPI byte post-inc mem8[rs1] = rs2[7:0]; rs1 += sext(imm)
001 SHPI half post-inc mem16[rs1] = rs2[15:0]; rs1 += sext(imm)
010 SWPI word post-inc mem32[rs1] = rs2[31:0]; rs1 += sext(imm)
011 SDPI dword post-inc mem64[rs1] = rs2; rs1 += sext(imm)
100 SBPD byte pre-dec rs1 -= sext(imm); mem8[rs1] = rs2[7:0]
101 SHPD half pre-dec rs1 -= sext(imm); mem16[rs1] = rs2[15:0]
110 SWPD word pre-dec rs1 -= sext(imm); mem32[rs1] = rs2[31:0]
111 SDPD dword pre-dec rs1 -= sext(imm); mem64[rs1] = rs2

Standard stores have no unsigned variants (there is no sign-extension on a store), so the funct3 space is cleanly halved between post-inc and pre-dec — no sub-encoding needed.

4.3 Examples

SDPI x12, 8(x13) — store x12 to mem64[x13], then advance x13 by 8:

imm     = 0x008  → imm[11:5]=0x00, imm[4:0]=0x08
rs2     = 12
rs1     = 13
funct3  = 011
Encoding: 0x00C68_40_2B
        = 0000000 01100 01101 011 01000 0101011

SWPD x4, 4(x2) — pre-decrement stack pointer x2 by 4, then store low 32 bits of x4:

imm     = 0x004
rs2     = 4
rs1     = 2  (sp)
funct3  = 110
Encoding: 0000000 00100 00010 110 00100 0101011

A common stack-push idiom: SWPD rs2, 4(sp) (sp -= 4, then write).


5. Memory-Fused Arithmetic (custom-2, opcode 0x5B)

This opcode hosts four sub-families dispatched by funct3:

funct3 Sub-family Section
000 Load-op fusion (rd = mem[rs1] OP rs2) §5.2
001 Op-store fusion (mem[rs1] = rs2 OP rs3) §5.3
010 Block memory operations §5.5
011 Load-op-store fusion (mem[rd] = mem[rs1] OP rs2) §5.4
100111 reserved

(The funct3 numbering doesn't match the section order: load-op-store at funct3 011 is presented in §5.4 because it is topologically the successor of op-store, while block memory at funct3 010 is conceptually distinct and presented in §5.5.)

All four use the R-type encoding:

 31        25 24    20 19    15 14   12 11     7 6           0
+-----------+--------+--------+-------+--------+-------------+
|  funct7   |  rs2   |  rs1   | funct3|   rd   |  1011011    |
+-----------+--------+--------+-------+--------+-------------+

The interpretation of rs2, rs1, and rd varies by sub-family. funct7 selects width and ALU operation within each sub-family.

5.1 Common funct7 Layout (Load-Op, Op-Store, Load-Op-Store)

For the arithmetic sub-families (funct3 000, 001, and 011), funct7 is structured as {width[1:0], aluop[4:0]}:

 31    30 29              25
+--------+------------------+
| width  |   aluop[4:0]     |
+--------+------------------+
width[1:0] Memory width / sign
00 32-bit word, sign-extended on load (load-op) / low 32 bits stored (op-store)
01 64-bit dword
10 32-bit word, zero-extended on load (load-op only; same as 00 for op-store)
11 reserved (future 16-bit support)
aluop[4:0] Operation
00000 ADD
00001 SUB
00010 AND
00011 OR
00100 XOR
00101 SLL (shift left logical)
00110 SRL (shift right logical)
00111 SRA (shift right arithmetic)
01000 SLT (set less than, signed)
01001 SLTU (set less than, unsigned)
0101011111 reserved

5.2 Load-Op Fusion (funct3 = 000)

Operation: rd = (mem[rs1] of selected width) ALUOP rs2.

The memory access uses rs1 directly as the base address; there is no immediate offset (use a separate ADDI first if non-zero offset needed, or compose with the auto-inc load instructions of §3).

Mnemonic width aluop Operation
LWADD 00 00000 rd = sext32(mem32[rs1]) + rs2
LWSUB 00 00001 rd = sext32(mem32[rs1]) - rs2
LWAND 00 00010 rd = sext32(mem32[rs1]) & rs2
LWOR 00 00011 rd = sext32(mem32[rs1]) \| rs2
LWXOR 00 00100 rd = sext32(mem32[rs1]) ^ rs2
LWSLL 00 00101 rd = sext32(mem32[rs1]) << (rs2 & 63)
LWSRL 00 00110 rd = sext32(mem32[rs1]) >>L (rs2 & 63)
LWSRA 00 00111 rd = sext32(mem32[rs1]) >>A (rs2 & 63)
LWSLT 00 01000 rd = (sext32(mem32[rs1]) < rs2) ? 1 : 0 signed
LWSLTU 00 01001 rd = (sext32(mem32[rs1]) < rs2) ? 1 : 0 unsigned
LDADD 01 00000 rd = mem64[rs1] + rs2
LDSUB 01 00001 rd = mem64[rs1] - rs2
LDAND 01 00010 rd = mem64[rs1] & rs2
LDOR 01 00011 rd = mem64[rs1] \| rs2
LDXOR 01 00100 rd = mem64[rs1] ^ rs2
LDSLL 01 00101 rd = mem64[rs1] << (rs2 & 63)
LDSRL 01 00110 rd = mem64[rs1] >>L (rs2 & 63)
LDSRA 01 00111 rd = mem64[rs1] >>A (rs2 & 63)
LDSLT 01 01000 signed compare
LDSLTU 01 01001 unsigned compare
LWUADD 10 00000 rd = zext32(mem32[rs1]) + rs2
... 10 0000101001 unsigned-word variants of the above

(The full unsigned-word table mirrors the signed-word table line-for-line.)

5.3 Op-Store Fusion (funct3 = 001)

Operation: mem[rs1] = rs2 ALUOP rs3. The rd field of the R-type encoding is repurposed as rs3 (a third source register); no architectural register is written by this class.

Mnemonic width aluop Operation
ADDSW 00 00000 mem32[rs1] = (rs2 + rs3)[31:0]
SUBSW 00 00001 mem32[rs1] = (rs2 - rs3)[31:0]
ANDSW 00 00010 mem32[rs1] = (rs2 & rs3)[31:0]
ORSW 00 00011 mem32[rs1] = (rs2 \| rs3)[31:0]
XORSW 00 00100 mem32[rs1] = (rs2 ^ rs3)[31:0]
SLLSW 00 00101 mem32[rs1] = (rs2 << (rs3 & 31))[31:0]
SRLSW 00 00110 mem32[rs1] = (rs2 >>L (rs3 & 31))[31:0]
SRASW 00 00111 mem32[rs1] = (rs2 >>A (rs3 & 31))[31:0]
ADDSD 01 00000 mem64[rs1] = rs2 + rs3
SUBSD 01 00001 mem64[rs1] = rs2 - rs3
ANDSD 01 00010 mem64[rs1] = rs2 & rs3
ORSD 01 00011 mem64[rs1] = rs2 \| rs3
XORSD 01 00100 mem64[rs1] = rs2 ^ rs3
SLLSD 01 00101 mem64[rs1] = rs2 << (rs3 & 63)
SRLSD 01 00110 mem64[rs1] = rs2 >>L (rs3 & 63)
SRASD 01 00111 mem64[rs1] = rs2 >>A (rs3 & 63)

SLT/SLTU forms are omitted for op-store (storing a 0/1 flag to memory is unusual; existing slt + sw is preferable for clarity).

Assembler convention. The op-store mnemonics are written with the memory destination in brackets to clarify that the third operand is read, not written:

ADDSW [x5], x6, x7    ; mem32[x5] = x6 + x7

Disassemblers should follow the same convention.

5.4 Load-Op-Store Fusion (funct3 = 011)

Operation: mem[rd] = (mem[rs1] of selected width) ALUOP rs2. Two memory operands plus one register operand. No architectural register is written.

This is the closest a 32-bit RISC-V slot can get to true memory-to-memory operation in the CRISP/Hobbit tradition: one fetch, one decode, both the load result and the ALU result flow directly through the pipeline without ever entering the register file, then the result is stored. The fused encoding replaces a three-instruction sequence (lw t0, (rs1); add/op t0, t0, rs2; sw t0, (rd)) without ever materialising the temporary.

The encoding repurposes the R-type rd field as the destination memory base address (read, not written). The rs1 field is the source memory base address. Funct7 partitioning is identical to load-op (§5.1).

Encoding:

 31        25 24    20 19    15 14   12 11     7 6           0
+-----------+--------+--------+-------+--------+-------------+
|  funct7   |  rs2   |  rs1   |  011  |   rd   |  1011011    |
+-----------+--------+--------+-------+--------+-------------+
Field Bits Meaning
funct7[6:5] [31:30] Width (00=word signed load, 01=dword, 10=word unsigned load, 11=reserved)
funct7[4:0] [29:25] ALU operation (same encoding as §5.1)
rs2 [24:20] Register-held ALU operand
rs1 [19:15] Source memory base address
funct3 [14:12] 011
rd [11:7] Destination memory base address (read-only with respect to the register file)
opcode [6:0] 0x5B

5.4.1 Variant Table

Mnemonic width aluop Operation
MMWADD 00 00000 mem32[rd] = (sext32(mem32[rs1]) + rs2)[31:0]
MMWSUB 00 00001 mem32[rd] = (sext32(mem32[rs1]) - rs2)[31:0]
MMWAND 00 00010 mem32[rd] = (sext32(mem32[rs1]) & rs2)[31:0]
MMWOR 00 00011 mem32[rd] = (sext32(mem32[rs1]) \| rs2)[31:0]
MMWXOR 00 00100 mem32[rd] = (sext32(mem32[rs1]) ^ rs2)[31:0]
MMWSLL 00 00101 mem32[rd] = (sext32(mem32[rs1]) << (rs2 & 31))[31:0]
MMWSRL 00 00110 mem32[rd] = (sext32(mem32[rs1]) >>L (rs2 & 31))[31:0]
MMWSRA 00 00111 mem32[rd] = (sext32(mem32[rs1]) >>A (rs2 & 31))[31:0]
MMWSLT 00 01000 mem32[rd] = (sext32(mem32[rs1]) < rs2) ? 1 : 0 (signed)
MMWSLTU 00 01001 mem32[rd] = (sext32(mem32[rs1]) < rs2) ? 1 : 0 (unsigned)
MMDADD 01 00000 mem64[rd] = mem64[rs1] + rs2
MMDSUB 01 00001 mem64[rd] = mem64[rs1] - rs2
MMDAND 01 00010 mem64[rd] = mem64[rs1] & rs2
MMDOR 01 00011 mem64[rd] = mem64[rs1] \| rs2
MMDXOR 01 00100 mem64[rd] = mem64[rs1] ^ rs2
MMDSLL 01 00101 mem64[rd] = mem64[rs1] << (rs2 & 63)
MMDSRL 01 00110 mem64[rd] = mem64[rs1] >>L (rs2 & 63)
MMDSRA 01 00111 mem64[rd] = mem64[rs1] >>A (rs2 & 63)
MMDSLT 01 01000 mem64[rd] = (mem64[rs1] < rs2) ? 1 : 0 (signed)
MMDSLTU 01 01001 mem64[rd] = (mem64[rs1] < rs2) ? 1 : 0 (unsigned)
MMWUADD 10 00000 mem32[rd] = (zext32(mem32[rs1]) + rs2)[31:0]
... 10 0000101001 unsigned-word-load variants of the above

The unsigned-word variants (width = 10) only differ from the signed-word forms (width = 00) for operations where the load sign-extension matters: MMWSRA, MMWSRL, MMWSLT, MMWSLTU. For ADD, SUB, AND, OR, XOR, and SLL the result bits are identical; the assembler may accept the unsigned form as a synonym or flag it as redundant.

5.4.2 Assembler Convention

The instruction takes three register operands. The first and second name memory locations (the destination and source base addresses, both bracketed in source code); the third names a register-held ALU operand:

MMWADD [x10], [x11], x12   ; mem32[x10] = mem32[x11] + x12
MMDOR  [x10], [x10], x12   ; mem64[x10] |= x12 (in-place, rd == rs1)

Both bracketed operands are read from the register file as pointers; neither is modified by the instruction. Compose with auto-increment loads/stores (§3, §4) on the surrounding code if pointer advance is needed.

5.4.3 In-Place Updates (rd == rs1)

When rd and rs1 name the same register, the instruction performs an in-place memory update: the load reads from the location, the ALU computes the new value, and the store writes back to the same location. This is well-defined: the load completes before the store begins, and there is exactly one memory location involved. The pattern matches C idioms like:

arr[i] &= mask;     // MMWAND [p], [p], mask     where p = &arr[i]
counter += step;    // MMDADD [p], [p], step
buf[i] ^= 0x80;     // MMWXOR [p], [p], x80

When rd != rs1, the load and store target distinct memory locations and the operation moves data with transformation:

dst[i] = src[i] + bias;     // MMWADD [d], [s], bias

5.4.4 Trap Restart

Unlike block memory (§5.5), load-op-store carries no partial progress across traps. The instruction completes atomically or not at all from an architectural perspective:

  • Trap before the load completes (e.g., load page fault): no architectural state has changed; PC remains at the instruction; retry re-executes from the beginning.
  • Trap between the load and the store (e.g., timer interrupt mid-instruction): the loaded value lives only in pipeline internal state, never in any architectural register; discarding it is safe. PC remains at the instruction; retry re-executes the load (idempotent — the source memory has not been written), the ALU op, and the store.
  • Trap during the store (e.g., store page fault on the destination): the load has already completed but its result is internal; the store has not committed to architectural memory; retry from the beginning.
  • No trap: PC advances normally past the instruction.

The implementation must guarantee that the store is not visible to other harts or to the memory system until it is the next architectural commit. Standard store-buffer pipelining with commit-at-retire satisfies this.

5.4.5 Wide-Mode Extension

In wide mode, the extension nibble extends rd, rs1, and rs2 exactly as for a normal R-type:

Bit Extends
bit[32] rd (destination memory base)
bit[33] rs1 (source memory base)
bit[34] rs2 (register-held ALU operand)
bit[35] spare (reserved)

A wide-mode load-op-store may name any combination of x0–x63 for all three operands.

5.4.6 Worked Example

MMWADD [x20], [x21], x12 — read 32-bit word at [x21], add x12, store to [x20]:

width  = 00
aluop  = 00000
funct7 = 0000000
rs2    = 12, rs1 = 21, funct3 = 011, rd = 20
Encoding: 0000000 01100 10101 011 10100 1011011

MMDOR [x10], [x10], x14 — in-place 64-bit OR of mem64[x10] with x14:

width  = 01
aluop  = 00011
funct7 = 0100011
rs2    = 14, rs1 = 10, funct3 = 011, rd = 10
Encoding: 0100011 01110 01010 011 01010 1011011

5.4.7 Implementation Cost

Load-op-store is the most expensive Xcrisp instruction and the one most likely to drive microarchitectural complexity. The instruction requires:

  • One memory read from [rs1]
  • One ALU op on the load result and rs2
  • One memory write to [rd]

The architecture permits any of three implementation strategies:

  1. Sequential micro-ops. Decode into three internal operations (load, ALU, store) and issue them sequentially. Simplest to implement; latency 3+ cycles, throughput 1 per 3 cycles. Suitable for compact pipelines where load-op-store is rare.

  2. Pipelined load → ALU → store. Treat the instruction as occupying three pipeline stages in sequence, but allow successive load-op-store instructions to overlap (one in load, one in ALU, one in store). Steady-state throughput one per cycle; latency 3 cycles per instruction. Requires separate memory read and write ports on the load/store unit (or a dual-pumped path to main memory). The natural FireStorm target.

  3. Same-cycle load/ALU/store. A wide single-cycle implementation reads the source, computes the result, and issues the store all in one cycle, completing in 1 cycle of latency. Requires a very fast critical path and may bottleneck on the memory port count. Likely impractical without substantial pipelining work.

Strategy 2 captures most of the performance win at modest implementation cost and is the recommended baseline. An implementation may freely choose to fall back to strategy 1 for unaligned accesses or other corner cases.

5.5 Block Memory Operations (funct3 = 010)

Block memory operations come in two flavours: synchronous (BMCPY, BMSET) execute on the CPU's load/store ports and are interruptible per §5.5.1; asynchronous (DMACPY, DMASET) hand the work to a hardware DMA queue and return immediately per §5.5.2. The choice is made per call site: small copies use the synchronous path (no DMA setup overhead, predictable latency); large transfers use DMA to overlap with CPU work.

For this sub-family, funct7 directly selects the operation; width bits are reserved (must be zero).

funct7 Mnemonic Operands Operation Section
0000000 BMCPY rd, rs1, rs2 Synchronous copy: rs2 bytes from rs1 to rd. All three registers advance to reflect progress; on completion rs2 = 0. §5.5.1
0000001 BMSET rd, rs1, rs2 Synchronous fill: write rs1[7:0] to mem8[rd] × rs2. rd advances, rs2 counts down. §5.5.1
0000010 DMACPY rd, rs1, rs2 Asynchronous copy: enqueue copy of rs2 bytes from rs1 to rd on the DMA queue; CPU continues. rs2 becomes DMA-tagged. §5.5.2
0000011 DMASET rd, rs1, rs2 Asynchronous fill: enqueue fill of rs2 bytes at rd with byte rs1[7:0]. rs2 becomes DMA-tagged. §5.5.2
0000100 BMCMP rd, rs1, rs2 Compare rs2 bytes at rd vs rs1. Reserved for v0.2.
00001011111111 reserved illegal-instruction

5.5.1 Synchronous Block Operations (BMCPY, BMSET)

Synchronous block operations are interruptible: they update their register operands as they progress, and a trap mid-execution leaves the registers in a consistent restartable state.

Restart semantics. On an asynchronous trap mid-block-op, the architectural state holds:

  • rd, rs1 advanced past completed bytes
  • rs2 reduced to the byte count remaining
  • PC pointing at the block instruction (not past it)

Returning from the trap re-executes the instruction with the partially-advanced register state, resuming from where it stopped. This requires the trap return path to use mret with the same PC, which is standard behaviour.

Overlap behaviour. For BMCPY, overlapping source and destination regions where rd > rs1 (forward overlap that would corrupt the source) is implementation-defined: a conservative implementation falls back to byte-at-a-time copy; an optimistic implementation uses wider transfers and is undefined for overlapping ranges. Code requiring guaranteed forward overlap (memmove semantics) should test and either swap direction or use a library routine.

Width hint (future). Although width[1:0] in funct7 is reserved for v0.1, a future revision may use it to express "minimum guaranteed alignment of the operands" (e.g., 01 = both pointers 8-byte aligned, allowing the implementation to issue 64-bit transfers). For v0.1, the implementation infers alignment from the runtime values.

5.5.2 Asynchronous DMA Operations (DMACPY, DMASET)

DMA operations enqueue a memory transfer onto a hardware DMA queue and return in one cycle, leaving the CPU free to execute other instructions while the transfer proceeds in parallel. The CPU and the DMA engine synchronise through the count register, which is hardware-tagged "DMA-pending" until the operation completes.

Operand Capture at Issue

When DMACPY or DMASET issues, the DMA engine captures the values of rs1 (source pointer or fill byte), rd (destination pointer), and rs2 (byte count). The DMA engine owns these copies until the operation completes; the CPU's source and destination registers (rs1, rd) are not subsequently modified and may be freely reused for other work on the next cycle.

The count register (rs2) is given special treatment — it remains architecturally bound to the DMA's live progress counter for the duration of the operation. See Register Tagging below.

Register Tagging Semantics

At issue, the named rs2 register is recorded in a small hardware DMA tag table (one entry per outstanding DMA, sized to the queue depth). The tag binds the register to the DMA's internal byte-count counter.

While the tag is active:

  • CPU reads of the tagged register return the live remaining byte count from the DMA engine. The register-file read port has a forwarding path from the DMA engine's count register; reads complete in the normal load-use cycles. This permits progress polling without stalling.
  • CPU writes to the tagged register stall the pipeline until the DMA completes and the tag clears. The pending write then takes effect on the (now-untagged) register. This is the canonical wait-for-completion mechanism.

When the DMA finishes, the engine drains its final byte count (0) into the register file, clears the tag, and any blocked write proceeds.

DMAWAIT Idiom

The assembler provides a pseudo-instruction:

DMAWAIT  rs    ; expands to:  ADDI rs, rs, 0

When rs is currently DMA-tagged, this stalls the CPU until the DMA completes. When rs is not tagged, it is a no-op. The pseudo makes the intent explicit in source code without consuming a separate encoding.

Common usage:

DMACPY  a0, a1, a2      ; queue 1MB copy; a2 holds count (becomes tagged)
; ... CPU does other useful work for hundreds of cycles ...
DMAWAIT a2              ; block until copy complete
ld      t0, 0(a0)       ; safe to read destination now
Progress Polling

Because reads of the tagged register do not stall, software can monitor DMA progress for use-cases like watchdog timeouts or partial-completion processing:

DMACPY  a0, a1, a2      ; large copy, a2 = 1048576
.Lwait:
    mv      t0, a2          ; t0 = live remaining count (no stall)
    bnez    t0, .Lcheck     ; ... or process partial data, etc.
    j       .Ldone
.Lcheck:
    ; ... do some work that does NOT depend on the destination region ...
    j       .Lwait
.Ldone:
Queue Depth and Back-Pressure

The DMA engine has a queue of pending operations whose depth is implementation-defined. Suggested values:

FireStorm variant DMA queue depth
All models 8

When the queue is full, a new DMACPY/DMASET issue stalls until a slot becomes free. The queue capacity is reported in the mxcrisp CSR (§12.3, DMA_QUEUE_DEPTH field).

The DMA tag table has the same number of entries as the queue, so each outstanding DMA can tag a distinct register. Issuing a new DMA naming a still-tagged register stalls until that register's tag clears — this is the natural back-pressure mechanism for serialised DMA dispatch from a single producer.

Cache Coherence

FireStorm has a small 8 KB direct-mapped write-through D-cache covering DDR3 data accesses (§5.2 of the parent doc); the hot data structures (Xstack frames, Xctx contexts, scratchpad-resident voice state, etc.) live in dedicated BSRAM and do not pass through the cache.

DMA coherence is handled automatically at two levels:

  • D-cache lines covering DMA-target addresses are auto-invalidated. For each DMA write to address A, the cache index (A >> 5) & 0xFF is computed and that line's valid bit is cleared. No software flush is needed.
  • Prefetch buffer ranges overlapping DMA-target addresses are auto-invalidated (§4.7 of the parent doc). This handles DMA-to-code coherence for JIT compilers and dynamic loaders.

Scratchpad-targeted DMA and BSRAM-region DMA (Xstack, Xctx) need no special coherence handling — they bypass the cache entirely.

DMA reads from DRAM see current data because the write-through D-cache always streams stores to DRAM.

External DMA agents that bypass the FireStorm DMA engine (e.g., a DMA controller on the chipset writing through a different path) must coordinate explicitly via mxdcache_flush_addr and mxbuf_flush_addr.

Trap and Interrupt Behaviour

DMA operations are independent of CPU traps and continue running through interrupts. A trap handler may freely issue DMACPY/DMASET (subject to the queue depth). If the CPU traps while stalled on a tagged-register write, the trap is taken normally and the still-pending DMA continues — the handler observes the still-tagged register and may either ignore it (the original stall resumes after return) or itself wait on it via DMAWAIT.

If the DMA encounters a memory fault mid-transfer (unmapped page, write to read-only region, etc.), the engine raises an asynchronous trap and reports the faulting address and DMA ID in a status CSR (TBD; see open items §13). The associated count register retains the remaining byte count at the fault; the tag is not cleared until software acknowledges. This permits a recovery handler to inspect and either retry or abort.

Operand Edge Cases
  • rs2 = 0 (or rs2 = x0): zero-byte DMA, no-op. The DMA may still consume a queue slot briefly; tags clear immediately.
  • rs1 = rd for DMACPY: defined as a no-op copy (copy region overlaps itself trivially).
  • Overlapping source/destination regions for DMACPY: same implementation-defined behaviour as BMCPY (§5.5.1).
  • DMA targeting memory-mapped I/O: permitted and useful (e.g., streaming audio buffers to a DAC FIFO). Implementation-defined whether the DMA engine respects MMIO ordering constraints; the suggested behaviour is "treat MMIO writes as strictly ordered, identical to CPU MMIO stores."
Worked Examples

1 MB memory clear, overlapped with CPU work:

        li      a2, 1048576              ; count = 1 MB
        li      a1, 0                    ; fill byte = 0
        mv      a0, buffer               ; destination pointer
        DMASET  a0, a1, a2               ; queue the clear, a2 now tagged
        ; --- CPU does ~hundreds of microseconds of other work here ---
        jal     ra, prepare_next_frame
        jal     ra, run_audio_callback
        ; --- finally need the buffer ---
        DMAWAIT a2                       ; ensure clear is complete
        ; buffer now zeroed; safe to use

If the "other work" takes longer than the DMA, the DMAWAIT is a no-op; if it takes less, the wait stalls for the remainder. Either way the total wall time is max(DMA_time, CPU_work_time) rather than their sum.

Double-buffered audio block render:

render_loop:
        DMACPY  out_a, render_a, blk_bytes_a    ; flush previous block; blk_bytes_a tagged
        ; while DMA copies block A to output:
        jal     ra, render_block_b              ; CPU renders block B into render_b
        DMAWAIT blk_bytes_a                     ; wait for A's copy to finish
        ; swap A/B pointers
        mv      tmp, render_a
        mv      render_a, render_b
        mv      render_b, tmp
        mv      tmp, out_a
        mv      out_a, out_b
        mv      out_b, tmp
        j       render_loop

The DMA copy of the previous block overlaps with the CPU's rendering of the next. Throughput is determined by the slower of (DMA bandwidth, CPU render time) rather than their sum.

Encoding example — DMACPY x10, x11, x12 (copy x12 bytes from [x11] to [x10]):

funct7 = 0000010
rs2    = 12, rs1 = 11, funct3 = 010, rd = 10
Encoding: 0000010 01100 01011 010 01010 1011011

5.6 Examples

LDADD x8, (x10), x12 — read 64-bit dword from [x10], add x12, write to x8:

width  = 01
aluop  = 00000
funct7 = 0100000
rs2    = 12, rs1 = 10, funct3 = 000, rd = 8
Encoding: 0100000 01100 01010 000 01000 1011011

ADDSW [x5], x6, x7 — write x6 + x7 (low 32 bits) to mem32[x5]:

width  = 00
aluop  = 00000
funct7 = 0000000
rs2    = 7  (note: rs2 is the second-named operand)
rs1    = 5  (the bracketed destination)
funct3 = 001
rd     = 6  (interpreted as rs3 — the first-named operand after the bracketed dest)
Encoding: 0000000 00111 00101 001 00110 1011011

Assembler operand order: ADDSW [rs1], rs2, rs3. The encoded bit positions place rs2 at [24:20] and rs3 (as rd field) at [11:7]. Either operand order convention is fine in the assembler grammar; the encoding fixes which bit-position each operand occupies.

BMCPY x10, x11, x12 — copy x12 bytes from [x11] to [x10]:

funct7 = 0000000
rs2    = 12
rs1    = 11
funct3 = 010
rd     = 10
Encoding: 0000000 01100 01011 010 01010 1011011

6. Compare-Mem-Branch (custom-3, opcode 0x7B)

6.1 Encoding

Standard B-type layout:

 31           25 24    20 19    15 14   12 11           7 6           0
+--------------+--------+--------+-------+--------------+-------------+
| imm[12|10:5] |  rs2   |  rs1   | funct3| imm[4:1|11]  |  1111011    |
+--------------+--------+--------+-------+--------------+-------------+
Field Bits Meaning
imm[12:1] [31:25] ‖ [11:7] (scrambled, standard B-type pattern) Signed 13-bit branch offset (bit 0 is implicit zero, half-word alignment)
rs1 [19:15] First operand (a register value)
rs2 [24:20] Second operand (interpreted as a base address — memory is read from mem[rs2])
funct3 [14:12] Condition + width
opcode [6:0] 0x7B (custom-3)

The branch range follows the underlying B-type encoding rules:

  • Narrow mode: standard B-type ±4 KiB (imm12 with ×2 byte scaling, bit[0] implicit zero).
  • Wide mode: imm14 with ×4 slot scaling (bits[1:0] implicit zero), giving ±32 KiB. The compare-mem-branch instruction inherits the wide-mode immediate extension and slot-indexed PC convention described in §7.3.2 and §8.6 of ee_cpu.

6.2 Funct3 Encoding

The condition encoding mirrors standard branches; the funct3 = 010 and 011 slots (unused in standard RV) host dword variants:

funct3 Mnemonic Condition
000 BEQM (rs1 as 32-bit) == sext32(mem32[rs2])
001 BNEM (rs1 as 32-bit) != sext32(mem32[rs2])
010 BEQMD rs1 == mem64[rs2]
011 BNEMD rs1 != mem64[rs2]
100 BLTM (int32)rs1 < sext32(mem32[rs2])
101 BGEM (int32)rs1 >= sext32(mem32[rs2])
110 BLTUM (uint32)rs1 < zext32(mem32[rs2])
111 BGEUM (uint32)rs1 >= zext32(mem32[rs2])

Ordered comparisons (BLTM, BGEM, BLTUM, BGEUM) are word-only in v0.1; dword ordered compare-mem-branch is reserved for v0.2 (would require an additional opcode partition or width-modifier bit).

6.3 Examples

BEQM x10, (x11), .L1 — branch to .L1 if x10[31:0] equals mem32[x11]:

funct3 = 000
rs1    = 10, rs2 = 11
imm    = offset to .L1
Encoding (with offset = +16): imm field scrambled per B-type:
   imm[12]=0, imm[10:5]=000000, imm[4:1]=1000, imm[11]=0
   → bits [31:25] = 0_000000 = 0x00
   → bits [11:7]  = 1000_0   = 0x10
   Full: 0000000 01011 01010 000 10000 1111011

BNEMD x4, (x5), .L_end — loop exit when x4 != mem64[x5]:

funct3 = 011
rs1    = 4, rs2 = 5

6.4 Common Idioms

Null-terminated string scan:

    ; rs1 = candidate char, x10 = pointer, x11 = 0 (terminator value)
loop:
    LBPI    x12, 1(x10)       ; read byte, advance pointer
    BNEM    x12, (x11), loop   ; loop while not terminator
    ; ... at end: x10 points past terminator, x12 = 0

A two-instruction inner loop, one cycle per byte on a forwarding implementation.

Lookup-table walk:

    ; x10 = key, x11 = table base, x12 = entry stride
loop:
    BEQM    x10, (x11), found
    ADD     x11, x11, x12
    BNE     x11, x13, loop      ; standard branch on table-end pointer
found:
    ; x11 = pointer to matching entry

7. B-tree Primitives (custom-2, opcode 0x5B)

B-trees and related sorted-array data structures (sorted vectors, sorted hash buckets, ordered indexes) are fundamental to databases, key-value stores, set/map containers, and any code that maintains a sorted collection. The hot operation in every B-tree is find the first key ≥ target within a node — a sequential or branchy binary search through a small sorted array, typically 16–64 keys.

Standard RV64GC implements this as a comparison loop with conditional branches, which:

  • Misspredicts roughly half the time (the comparison outcome depends on the data),
  • Takes one cycle per key compared,
  • Pollutes the branch predictor with high-entropy branches.

For a 16-key B-tree node, software search is 16 compares + 8–16 branches + ~5 mispredicts × ~15 cycles each = ~100 cycles per node visit. With a tree depth of 5–7, a single B-tree lookup costs 500–700 cycles, dominated by branch mispredicts.

The Xcrisp B-tree primitives provide fixed-width parallel search of a sorted array of keys, returning the first position satisfying key[i] >= target. The operation is parallelised across an entire cache line (or two) per instruction, with no branches.

7.1 BSRCH Family — Parallel Sorted-Array Search

Mnemonic Key width Keys per instruction rd width Operation
BSRCH.B rd, rs1, rs2 8-bit 64 7-bit position (0–64) Find first 8-bit key in mem[rs2..rs2+63] ≥ low byte of rs1
BSRCH.H rd, rs1, rs2 16-bit 32 6-bit position (0–32) Find first 16-bit key ≥ low halfword of rs1
BSRCH.W rd, rs1, rs2 32-bit 16 5-bit position (0–16) Find first 32-bit key ≥ low word of rs1
BSRCH.D rd, rs1, rs2 64-bit 8 4-bit position (0–8) Find first 64-bit key ≥ rs1

All variants:

  • Read 64 bytes (one cache line) starting at the address in rs2. The address must be 64-byte aligned; misaligned addresses trap.
  • Compare each key against the search target in rs1 in parallel.
  • Return in rd the index of the lowest position where key[i] >= target. If no key satisfies, return the total count (sentinel "not found within this node").
  • Keys are assumed sorted ascending. If unsorted, the result is the first matching position but no ordering is implied.

Latency: 4 cycles (load 64 bytes from D-cache + 16/32/64 parallel compares + priority encoder + writeback). On a cache miss the load latency dominates and the operation effectively takes the cache-fill time.

Throughput: 1 per cycle pipelined; 1 per 4 cycles in dependency chain.

Mode availability: both narrow and wide. Narrow mode addresses up to 32 keys per call (BSRCH.B/H/W return position fitting in 5 bits); BSRCH.B returning position 33–64 requires wide mode's 6-bit return register width.

Encoding

BSRCH.X: opcode = 0x5B, funct3 = 010, funct7 = 0010xxx
   funct7[2:0] = width selector
      000 = .B (64 keys × 8-bit)
      001 = .H (32 keys × 16-bit)
      010 = .W (16 keys × 32-bit)
      011 = .D ( 8 keys × 64-bit)
      100–111 = reserved (future widths: 128 keys × 4-bit for sub-byte indexes, etc.)

Example: B-tree Node Search

// C version: linear search
int find_position(int32_t target, int32_t *keys, int n) {
    int i = 0;
    while (i < n && keys[i] < target) i++;
    return i;
}

Standard RV64GC with n=16 mispredicts roughly half the iterations. Average cost: ~50 cycles for n=16, dominated by mispredicts.

With BSRCH:

    ; a0 = target, a1 = key array (aligned 64 bytes)
    BSRCH.W  a2, a0, a1                ; a2 = position (0..16), 4 cycles
    ret

One instruction, 4 cycles, no branches, no mispredicts. ~12× speedup on the inner search; per full B-tree lookup at depth 5, total speedup ~10× since the search dominates.

7.2 BSCAN Family — First-Match Search

A variant of BSRCH that searches for an exact-match key (returns position of first key[i] == target, or N if not found). Used in hash table chaining, dictionary lookups within a small bucket, and validation paths in indexed structures.

Mnemonic Key width Keys per instruction
BSCAN.B rd, rs1, rs2 8-bit 64
BSCAN.H rd, rs1, rs2 16-bit 32
BSCAN.W rd, rs1, rs2 32-bit 16
BSCAN.D rd, rs1, rs2 64-bit 8

Same encoding family as BSRCH, with funct7[5:3] distinguishing operation:

  • BSRCH: funct7 = 0010xxx
  • BSCAN: funct7 = 0011xxx

Latency and semantics identical to BSRCH except the comparison is equality rather than ≥.

Use Case: Hash Bucket Probe

// 8-way hash bucket with 16-bit fingerprints; probe for match
int probe_bucket(uint16_t fingerprint, uint16_t *bucket) {
    BSCAN.H pos, fingerprint, bucket;    // 4 cycles
    if (pos < 8) return bucket_values[pos];
    return MISS;
}

For an in-memory hash table with cuckoo or chaining within bucket-sized arrays, BSCAN.H + BSCAN.B replace the multi-cycle scan and branch sequence with single-cycle probes.

7.3 BSHIFT — Block Shift for Insert/Delete

When inserting a new key into a sorted B-tree node, all keys at and after the insertion position must shift right by one slot. When deleting, the keys after the deletion position shift left. This is fundamentally a memmove operation on a small fixed-size range.

Existing Xcrisp BMCPY (§5.5) handles general overlap-aware block memory copy and is the right tool for shifts on multi-cache-line nodes. For the common case of B-tree node shifts within a 64-byte cache line, BSHIFT is a single-instruction primitive:

Mnemonic Operation
BSHIFTR.X rd, rs1, rs2 Shift keys in mem[rs1] right (toward higher addresses) by rs2 slots, starting at position 0
BSHIFTL.X rd, rs1, rs2 Shift keys in mem[rs1] left (toward lower addresses) by rs2 slots, starting at position rs2

Variants for width .X ∈ {B, H, W, D} match the key sizes of BSRCH.

rs2 is the shift count (typically 1 for insert-one or delete-one operations). rd returns the actual number of slots shifted (lower than rs2 if the shift would have moved data outside the 64-byte window — useful for chain insertion across nodes).

Encoding

BSHIFT: opcode = 0x5B, funct3 = 010, funct7 = 0100xxd
   funct7[2:1] = width selector (00=.B, 01=.H, 10=.W, 11=.D)
   funct7[0]   = direction (0 = right/insert, 1 = left/delete)

Latency: 5 cycles (load 64 bytes + barrel shift + store 64 bytes back). Throughput: 1 per 5 cycles.

Example: B-tree Insert at Found Position

    ; Find insertion position
    BSRCH.W  pos, key, node_keys           ; 4 cycles
    ; Shift keys at pos..end right by 1
    addi     shift_base, node_keys, 0      ; (already in register)
    li       count, 1
    BSHIFTR.W rd_count, shift_base, count  ; 5 cycles
    ; Now slot at pos is "free" — write new key
    slli     offset, pos, 2                 ; pos × 4 = byte offset
    add      slot_addr, node_keys, offset
    sw       key, 0(slot_addr)
    ; Done — 9 cycles for full insert (excluding cache effects)

vs standard RV64GC (which uses scalar memmove with conditional branches): ~60 cycles for the equivalent operation.

~7× speedup on B-tree insertion, sustained at every level of the tree during an insert path.

7.4 Performance Impact on Database / Index Workloads

For a typical in-memory ordered index (B+ tree with 32-key nodes, 5-level tree, 10M-entry index):

Operation Standard RV64GC With BSRCH/BSHIFT Speedup
Point lookup (find one key) ~600 cycles ~60 cycles 10×
Sequential range scan (init) ~600 cycles (find start) ~60 cycles 10×
Insert one key ~1200 cycles ~150 cycles
Delete one key ~1100 cycles ~140 cycles
Bulk-load 1M entries ~10 s at 380 MHz ~1.3 s

For workloads dominated by index access — relational query engines, key-value stores, sorted-set caches, ordered-merge joins — these speedups translate directly to overall throughput improvements.

The 5K LUTs + 2 BSRAM blocks of dedicated B-tree hardware represents one of the highest LUT-per-speedup ratios in any FireStorm extension. For a system that hosts a serious database or indexed query engine, the B-tree primitives are likely the single most valuable instruction family.

7.5 Implementation Cost

Hardware structure:

  • 64-byte staging register (one cache line, 512 bits) — read from D-cache or scratchpad in one cycle.
  • 64 parallel 8-bit comparators, partitionable into 32×16-bit, 16×32-bit, or 8×64-bit modes for the different BSRCH widths.
  • Priority encoder producing the lowest set position from the comparator outputs.
  • Barrel shifter (64-byte rotate) for BSHIFT.
  • Write-back path to D-cache or scratchpad (for BSHIFT only).
Component LUTs
64 × 8-bit comparator array (with width-mode mux) ~1500
Priority encoder (with width-mode mask) ~300
64-byte barrel shifter ~2000
Register-file interface and result formatting ~200
Decoder and dispatcher ~150
Total ~4150 LUTs

Plus 2 BSRAM blocks for the staging register (one for load, one for store-back).

On the GW5AST-138: ~3% of LUT budget.

7.6 Mode Behaviour and Composability

Both BSRCH and BSHIFT are available in narrow and wide modes. In wide mode:

  • Register fields use the extension nibble for 6-bit register indices (rd, rs1, rs2).
  • Xcond predication (bit 35 = PRED-EN) applies normally — predicated B-tree search is useful for skipping work in deleted/empty nodes.
  • The 6-bit rd accommodates BSRCH.B's full 0–64 result range; in narrow mode, BSRCH.B is restricted to a 32-key staging area (returning 0–32) since the 5-bit narrow rd cannot represent positions 33–64.

In narrow mode, BSRCH.B should be used with caution — the 32-key restriction is a structural constraint, not a hardware capability difference. Code that needs to search the full 64-byte staging area uses wide mode, or uses BSRCH.H (32×16-bit) instead.

The B-tree primitives compose naturally with:

  • Xcrisp BMCPY for cross-node shifts (when an insert overflows the 64-byte BSHIFT window).
  • Xcond predication for conditional searches in compressed/sparse indexes.
  • Xcrisp X-type indexed loads (wide mode) for following child pointers after a search.

8. Position-Independent Code (Wide-Mode Only)

8.1 Motivation

Standard RV64GC requires two-instruction sequences for every PC-relative access: AUIPC rd, hi20; ADDI rd, rd, lo12 for address materialisation, AUIPC t, hi20; LD rd, lo12(t) for global loads, and AUIPC t, hi20; JALR rd, lo12(t) for long-range direct calls. For a modular system where hot-path code dispatches through GOTs, vtables, or PLT trampolines, these pairs dominate the cross-module instruction count.

The Xcrisp PIC family compresses each of these patterns into a single 32-bit instruction, available only in wide mode (36-bit SRAM fetch). The encoding uses the 0x7F escape mechanism (§8.2) rather than consuming one of the four remaining funct3 slots in custom-2, which keeps that space available for future narrow-mode extensions and gives the PIC instructions a much larger immediate field than they could obtain within standard RV64 opcode layout.

8.2 The 0x7F Escape Mechanism

The standard RISC-V opcode encoding reserves bits[6:2] = 11111 for instructions ≥48 bits wide. In a 32-bit slot, a value of bits[6:0] = 0x7F is therefore unused and traps as illegal-instruction in any standard RV64 implementation.

FireStorm wide mode (36-bit fetch only) repurposes 0x7F as a wide-mode extension marker:

 35 34                                7 6           0
+---+----------------------------------+-------------+
| F |       29-bit custom payload      |  1111111    |
+---+----------------------------------+-------------+
  • Bit [35] selects between two top-level wide-PIC formats (§8.3, §8.4).
  • Bits [34:7] are the 29-bit instruction payload.
  • Bits [6:0] = 0x7F mark the wide-extension instruction.

In narrow mode (DDR3 fetch), 0x7F remains illegal-instruction — the escape is invisible to standard RV64 code. In wide mode (36-bit SRAM fetch), the decoder sees the marker and dispatches to the wide-extension decode path.

The escape mechanism is a general-purpose lane for wide-mode-only instructions. The current Xcrisp PIC family uses it; future FireStorm extensions (DSP primitives, accelerators, etc.) may reuse the same mechanism by allocating sub-encodings within the 29-bit payload, coordinated through the format bit at [35] and a per-format dispatch sub-field.

8.3 W-Type Format (bit[35] = 0) — PC-Relative

For PC-relative loads, address materialisation, and direct calls:

 35 34         16 15  13 12     7 6           0
+---+-------------+------+--------+-------------+
| 0 |  imm[18:0]  |funct3|   rd   |  1111111    |
+---+-------------+------+--------+-------------+
Field Bits Meaning
format [35] 0 = W-type
imm[18:0] [34:16] 19-bit signed PC-relative offset (scaled per-instruction)
funct3 [15:13] Variant selector
rd [12:7] 6-bit destination register (wide-register-file access)
opcode [6:0] 0x7F

The W-type provides 19 bits of signed immediate, scaled by the natural unit of the operation (byte / halfword / word / dword) to give an effective reach of ±256 KiB to ±2 MiB depending on variant.

funct3 Mnemonic Scaling Effective range Operation
000 LDPC ×8 (dword) ±2 MiB rd = mem64[PC + sext(imm) × 8]
001 LWPC ×4 (word) ±1 MiB rd = sext32(mem32[PC + sext(imm) × 4])
010 LWUPC ×4 (word) ±1 MiB rd = zext32(mem32[PC + sext(imm) × 4])
011 LAPC ×1 (byte) ±256 KiB rd = PC + sext(imm) (address materialisation)
100 JALPC ×2 (hword) ±512 KiB rd = PC + 4; PC = PC + sext(imm) × 2
101 JALXPC ×8 (dword) ±32 KiB Indexed PC-relative jump-and-link; see §8.4
110 X-type dispatch Indexed memory load; imm field repurposed (§8)
111 reserved illegal-instruction

The scaling factors match each operation's natural alignment: dwords are 8-byte aligned in the global data area, words are 4-byte aligned, byte-precision is needed for &char_data patterns, halfword precision is sufficient for RVC-aware call targets.

The rd field is 6 bits wide, giving direct access to all 64 wide-mode integer registers (x0–x63) without any further extension mechanism. This is a property of the W-type's roomier encoding compared to standard RV64 instructions.

Semantics note. The PC value used for relative addressing is the address of the PIC instruction itself, matching AUIPC convention. The immediate is sign-extended from 19 bits to 64 bits and then scaled.

8.4 WI-Type Format (bit[35] = 1) — Register-Indirect

For indirect calls (vtable dispatch, PLT, function pointer through a structure):

 35 34       23 22  17 16  13 12     7 6           0
+---+-----------+-------+------+--------+-------------+
| 1 | imm[11:0] |  rs1  |funct4|   rd   |  1111111    |
+---+-----------+-------+------+--------+-------------+
Field Bits Meaning
format [35] 1 = WI-type
imm[11:0] [34:23] 12-bit signed byte-precise offset
rs1 [22:17] 6-bit base register
funct4 [16:13] Variant selector
rd [12:7] 6-bit destination register
opcode [6:0] 0x7F

Both rs1 and rd are 6-bit fields, giving full x0–x63 access without an extension nibble.

funct4 Mnemonic Operation Replaces
0000 CALLM rd = PC + 4; PC = mem64[rs1 + sext(imm)] ld t, off(rs1); jalr rd, t
0001 JMPM PC = mem64[rs1 + sext(imm)] (no return-address save) ld t, off(rs1); jr t
00101111 reserved

CALLM is the canonical vtable / PLT dispatch instruction. JMPM is the tail-call variant (no return address saved); compilers emit it for goto *fp patterns and for the final hop of trampolines.

The 12-bit byte-precise offset gives ±2 KiB of reach within the indirected table, comfortably covering vtables of up to 256 entries (8 bytes per slot) or PLT-sized dispatch tables.

8.5 Linker Relaxation

Toolchains targeting +xfirestorm emit standard AUIPC + ADDI/LD/JALR pairs against PIC relocations. The linker examines each pair after final layout:

  • If the target address is within the W-type reach for the operation, the linker relaxes the pair into a single W-type instruction (replacing 8 bytes of pair with 4 bytes of PIC instruction plus 4 bytes of NOP, or compacting the section if alignment permits).
  • If the resulting code is in a wide section (.text.wide), relaxation is permitted; in narrow sections (.text, .text.crisp), the standard pair is kept.
  • If the target is out of range, the pair is left as-is.

This means existing PIC-aware code recompiled with +xfirestorm and placed in wide sections automatically gains the density and performance wins, without source-level changes. The linker's PIC relaxation pass operates per-section after layout, similar to existing RISC-V relaxation for JALAUIPC+JALR.

8.6 Wide-Mode-Only Restriction

The 0x7F escape, and therefore the entire PIC instruction family, is undefined behaviour in narrow mode. A standard RV64 implementation receiving a 0x7F instruction will trap as illegal-instruction (the spec-defined behaviour for unallocated 32-bit opcodes). FireStorm's narrow-mode decoder follows the spec.

Consequences:

  • PIC instructions live only in .text.wide. A linker that attempts to place a W-type or WI-type instruction in a narrow section is in error; toolchains must enforce this.
  • DDR3-resident code keeps using standard AUIPC + ADDI/LD/JALR. Modular dynamic-loaded code where modules are in DDR3 is unaffected by this extension and continues to work exactly as on any RV64 implementation.
  • Module trampolines benefit asymmetrically. A trampoline placed in 36-bit SRAM that bridges DDR3 modules can use CALLM to dispatch into the target module in one instruction, while the modules themselves remain narrow-PIC. This is a clean fit for AntOS-style module dispatch.

8.7 Wide Register Extension

Unlike Xcrisp instructions in standard RV64 opcode space (§3–§6), the PIC family does not use the extension-nibble scheme. The W-type and WI-type encodings already provide 6-bit register fields natively (rd at [12:7], rs1 at [22:17]), addressing x0–x63 directly.

The 36-bit fetch still carries 4 bits beyond the standard 32-bit instruction word, but for PIC instructions those bits are entirely consumed by the format bit [35] and the larger immediate/funct4 fields; there are no spare bits available for hints or sub-encoding.

8.8 Worked Examples

LDPC x40, .Lglobal_table — load a global pointer from a table 1024 bytes ahead:

PC offset = +1024 bytes = +128 dwords = imm 0x080
funct3 = 000 (LDPC)
rd     = 40   (6-bit field = 0b101000)
imm    = 0x00080
format = 0
Encoding bits [35:0]: 0 0000000000010000000 000 101000 1111111
                    = 0x008000147F (36-bit slot)

LAPC x12, .Lstring_const — materialise the address of a string constant 7 bytes ahead (byte-precise):

imm    = 7
funct3 = 011 (LAPC)
rd     = 12
format = 0
Encoding: 0 0000000000000000111 011 001100 1111111

CALLM x1, 24(x10) — vtable dispatch: load function pointer at offset 24 from x10, call with ra = x1:

imm    = 24
rs1    = 10
funct4 = 0000 (CALLM)
rd     = 1
format = 1
Encoding: 1 000000011000 001010 0000 000001 1111111

JALPC x1, .Lfar_function — direct call to a target 200 KiB ahead, beyond JAL's ±1 MiB range but within JALPC's ±512 KiB:

PC offset = +204800 bytes / 2 = 102400 = imm 0x19000
funct3 = 100 (JALPC)
rd     = 1   (ra)
format = 0
Encoding: 0 0011001000000000000 100 000001 1111111

8.9 Compiler Patterns

The compiler should recognise and emit PIC family instructions for:

Source pattern Emitted (wide section, target in range)
&global_var LAPC rd, global_var
global_long_var LDPC rd, global_long_var
global_int_var (signed) LWPC rd, global_int_var
global_uint_var LWUPC rd, global_uint_var
extern_function() (direct call, in range) JALPC ra, extern_function
vtbl->method() (virtual call) CALLM ra, off(vtbl)
goto *fp (computed goto, indirect) JMPM zero, 0(fp)
tail_call_thunk() (tail call through pointer) JMPM zero, off(rs1)

Out-of-range targets fall back to the standard AUIPC + ADDI/LD/JALR pair, which the linker leaves un-relaxed.

In wide mode, Xcrisp instructions participate in the standard extension nibble scheme. The nibble bits map to register fields according to which format the instruction uses:

Family Format rd ext rs1 ext rs2 ext rs3 ext (op-store) Spare
Auto-inc loads (§3) I-type bit[32] bit[33] bits[35:34]
Auto-inc stores (§4) S-type bit[33] bit[34] bits[35], bit[32]
Load-op (§5.2) R-type bit[32] bit[33] bit[34] bit[35]
Op-store (§5.3) R-type bit[33] bit[34] bit[32] (rd-as-rs3) bit[35]
Load-op-store (§5.4) R-type bit[32] (rd-as-dest-addr) bit[33] bit[34] bit[35]
Block memory (§5.5) R-type bit[32] bit[33] bit[34] bit[35]
Compare-mem-branch (§6) B-type bit[33] bit[34] bits[35], bit[32]

(For op-store, bit[32] extends the field interpreted as rs3 rather than a destination. For load-op-store, the same bit[32] extends the destination memory base address; the field is read from the register file but the register itself is not written.)

In narrow mode, all register operands are restricted to x0–x31 / f0–f31 as usual. The nibble bits do not exist — the instruction occupies a standard 32-bit slot in DDR3, and the assembler rejects any operand naming x32–x63.


9. Indexed Addressing (Wide-Mode Only)

9.1 Motivation

Standard RV64GC has no scaled-indexed addressing mode. Every non-sequential array access requires an explicit shift-and-add sequence:

slli   t0, idx, 2          ; idx * 4 (byte offset for 32-bit elements)
add    t0, base, t0
lw     val, 0(t0)

The Zba extension's sh1add, sh2add, sh3add collapse the shift-add pair into one instruction for the common ×2, ×4, ×8 scales, reducing the sequence to:

sh2add t0, idx, base
lw     val, 0(t0)

This is good for one-dimensional arrays, but Zba does not directly cover the load itself, and its scales stop at ×8. Multi-dimensional access and larger element strides remain multi-instruction sequences:

slli   t0, row, 6          ; row * 64 (row stride for int matrix[16][16])
add    t0, base, t0
slli   t1, col, 2
add    t0, t0, t1
lw     val, 0(t0)

Five instructions for a 2D array access, none of them fusable by Zba alone.

The Xcrisp indexed addressing family (this section) provides single-instruction load + scale + add for the full set of integer widths, with scales ×1 through ×128. The primary workloads served are:

  • Hash table probes and sparse array access — non-sequential reads where the index doesn't fit a loop induction pattern.
  • Jump table dispatch for switch statements, state machines, and interpreters — covered by JALXPC (§8.4) rather than the load family.
  • 2D matrix and struct-array access where stride is a power of two larger than 8.
  • Generated code (JIT, dynamic linkers, byte-code interpreters) that resolves addresses at runtime through tables.

The family is wide-mode only because it uses encoding bits in the 0x7F escape (§8.2) that do not exist in narrow-mode 32-bit fetches. Narrow-mode code continues to use Zba shift-add sequences. Code that wants the indexed forms places itself in .text.wide.

9.2 X-Type Format

The indexed-load instructions occupy a sub-encoding of the W-type format (§8.3) gated by funct3 = 110. When the W-type decoder sees funct3 = 110, the 19-bit imm field is reinterpreted as the X-type payload:

 35 34   29 28   23 22  20 19  17 16   15  13 12     7 6           0
+---+------+------+------+------+----+------+--------+-------------+
| 0 | rs1  | rs2  |scale | w+s  | r  | 110  |   rd   |  1111111    |
+---+------+------+------+------+----+------+--------+-------------+
Field Bits Meaning
format [35] 0 = W-type family
rs1 [34:29] 6-bit base register (x0–x63)
rs2 [28:23] 6-bit index register (x0–x63)
scale [22:20] 3-bit scale selector (×1, ×2, ×4, ×8, ×16, ×32, ×64, ×128)
w+s [19:17] 3-bit width-and-sign selector (see §8.3)
r [16] reserved, must be zero in v0.1
funct3 [15:13] 110 (X-type dispatch within W-type)
rd [12:7] 6-bit destination register
opcode [6:0] 0x7F

The effective address is computed as:

addr = rs1 + zext64(rs2) × (1 << scale_log2)

where scale_log2 is the value of the scale field (0–7). The index is zero-extended to 64 bits before scaling — array indices are unsigned by convention; negative indices require the user to pre-sign-extend into the index register.

The scale set covers:

  • ×1: byte-precise random access (rare; mostly for symmetry).
  • ×2, ×4, ×8: matches Zba's sh1add/sh2add/sh3add scales and the natural element sizes for halfword/word/dword arrays.
  • ×16: 16-byte structures (a common C struct stride for AoS data).
  • ×32: 32-byte cache-line-aligned records.
  • ×64, ×128: row strides for 16- and 32-wide matrix layouts.

9.3 Indexed Loads

The width-and-sign field selects the access:

w+s Mnemonic Width Sign Operation
000 LBX byte signed rd = sext8(mem8[addr])
001 LBUX byte unsigned rd = zext8(mem8[addr])
010 LHX half signed rd = sext16(mem16[addr])
011 LHUX half unsigned rd = zext16(mem16[addr])
100 LWX word signed rd = sext32(mem32[addr])
101 LWUX word unsigned rd = zext32(mem32[addr])
110 LDX dword (n/a) rd = mem64[addr]
111 reserved illegal-instruction

Each instruction is a single 32-bit operation that completes in the same cycles as a standard load (2 cycles latency, 1-cycle throughput in steady state on the reference pipeline).

The reserved w+s = 111 slot is held for a possible future 128-bit load (LQX) or a sign-extended halfword-into-32-bit form. No specific allocation in v0.1.

Assembler Syntax

LWX    rd, (rs1, rs2, scale)            ; canonical
LWX    rd, scale(rs1, rs2)              ; 68k-style alternative
LWX    rd, [rs1 + rs2 * scale_factor]   ; verbose form (scale_factor = 1, 2, 4, ..., 128)

The assembler accepts any of these forms and normalises to the canonical representation.

9.4 JALXPC — Indexed PC-Relative Jump

For switch dispatch and other table-indexed jumps, JALXPC reads a target address from a PC-relative jump table and transfers control to it. Encoded as W-type funct3 = 101:

 35 34         29 28          16 15  13 12     7 6           0
+---+------------+--------------+------+--------+-------------+
| 0 |    rs2     | imm[12:0]    | 101  |   rd   |  1111111    |
+---+------------+--------------+------+--------+-------------+
Field Bits Meaning
format [35] 0 = W-type family
rs2 [34:29] 6-bit index register (zero-extended)
imm[12:0] [28:16] 13-bit signed PC-relative offset, scaled ×8 (±32 KiB)
funct3 [15:13] 101
rd [12:7] Link register; x0 for plain jump, otherwise receives PC+4
opcode [6:0] 0x7F

Operation:

table_addr = PC + sext(imm) × 8
target     = mem64[table_addr + zext64(rs2) × 8]
if (rd != x0):
    rd = PC + 4
PC = target

When rd = x0, the assembler emits the mnemonic JMPXPC (no-link form). When rd != x0, the mnemonic is JALXPC and the link is captured.

The 13-bit ×8-scaled immediate gives ±32 KiB of reach from the dispatching instruction to the base of the jump table. Switch tables are typically allocated in .rodata near the function emitting the dispatch, and 32 KiB comfortably covers most cases; for very large code modules with distant tables, the compiler falls back to LAPC + LDX + JALR.

9.5 Examples

8.5.1 Single-Dimensional Array Load

int load_elem(int *arr, size_t idx) {
    return arr[idx];
}

Standard RV64GC (Zba):

load_elem:
    sh2add  t0, a1, a0
    lw      a0, 0(t0)
    ret

3 instructions.

FireStorm (wide mode, X-type):

load_elem:
    LWX    a0, (a0, a1, ×4)         ; a0 = sext32(mem32[a0 + a1 * 4])
    ret

2 instructions. One fewer instruction, no temporary register consumed.

8.5.2 2D Matrix Access

int load_cell(int matrix[16][16], size_t row, size_t col) {
    return matrix[row][col];
}

The row stride is 16 ints = 64 bytes; the column scale is ×4.

Standard RV64GC:

load_cell:
    slli    t0, a1, 6                ; row * 64
    add     t0, a0, t0               ; &matrix[row][0]
    sh2add  t0, a2, t0               ; + col * 4
    lw      a0, 0(t0)                ; load
    ret

5 instructions (using Zba sh2add for the inner stride).

FireStorm (wide mode, X-type final column step):

load_cell:
    slli    t0, a1, 6                ; row * 64 (no Zba scale for 64; must use slli)
    add     t0, a0, t0               ; &matrix[row][0]
    LWX     a0, (t0, a2, ×4)         ; a0 = sext32(mem32[t0 + col * 4])
    ret

4 instructions. One fewer instruction than Zba; the X-type fuses the column scale-and-load.

For 2D access X-type captures only part of the win because there is no indexed address-materialise form (no "LAX") — the row stride still requires an explicit slli/add pair to compute the row base. Adding LAX is a v0.2 candidate (see §13). With LAX:

load_cell:
    LAX     t0, (a0, a1, ×64)        ; t0 = a0 + row * 64        — hypothetical v0.2
    LWX     a0, (t0, a2, ×4)         ; load column-scaled         — single instruction
    ret

would bring this to 3 instructions. v0.1 stops at the 4-instruction form.

8.5.3 Switch Statement Dispatch

int dispatch(int op, int arg) {
    switch (op) {
        case 0: return op_add(arg);
        case 1: return op_sub(arg);
        case 2: return op_mul(arg);
        case 3: return op_div(arg);
        default: return 0;
    }
}

Compiler emits a jump table and dispatches via it (assuming bounds-checked op).

Standard RV64GC (PIC, no Xcrisp PIC):

dispatch:
    li      t0, 4
    bgeu    a0, t0, .Ldefault
.La:
    auipc   t0, %pcrel_hi(jump_table)
    addi    t0, t0, %pcrel_lo(.La)
    sh3add  t0, a0, t0
    ld      t0, 0(t0)
    jr      t0
.Ldefault:
    li      a0, 0
    ret

The dispatch path: 7 instructions total (bounds check + 5 for the table jump).

FireStorm with JALXPC:

dispatch:
    li      t0, 4
    bgeu    a0, t0, .Ldefault
    JMPXPC  a0, jump_table              ; PC = mem64[jump_table + a0 * 8]
.Ldefault:
    li      a0, 0
    ret

Dispatch path: 3 instructions. ~57% reduction for the dispatch step, plus all the entries in jump_table can themselves use JALPC for the inner-handler relative jumps.

This pattern is hot in interpreters (Z-machine, bytecode VMs), parsers, OS syscall dispatch, and any code with a high-fan-out conditional. For the AntOS syscall path, halving the dispatch instruction count is a measurable kernel-entry latency improvement.

8.5.4 Hash Table Probe

entry_t *probe(entry_t *table, uint64_t hash, uint64_t mask) {
    return &table[hash & mask];
}

Assuming entry_t is 16 bytes:

Standard RV64GC:

probe:
    and     t0, a1, a2
    slli    t0, t0, 4               ; *16; no Zba ×16
    add     a0, a0, t0
    ret

4 instructions.

FireStorm:

probe:
    and     t0, a1, a2
    ; need an "LAX" address-compute, which doesn't exist in v0.1.
    ; Fallback: compute address explicitly then load if needed.
    slli    t0, t0, 4
    add     a0, a0, t0
    ret

Same 4 instructions — X-type doesn't help if we want the address rather than the loaded value.

If the caller does entry->field right after, the load can be folded:

int probe_load(entry_t *table, uint64_t hash, uint64_t mask) {
    return table[hash & mask].first_field;     /* assume first_field is int at offset 0 */
}
probe_load:
    and     t0, a1, a2
    LWX     a0, (a0, t0, ×16)        ; a0 = sext32(mem32[a0 + t0 * 16])
    ret

3 instructions versus 5 for standard (and/slli/add/lw/ret with Zba — sh4add is not in Zba, so the slli is mandatory).

1–2 instructions saved per probe, depending on Zba availability and exact element size.

9.6 Compiler Patterns

The compiler emits X-type instructions for:

  1. Loop-variant array access (not loop-induction): a[idx] where idx is computed inside the loop body but a is loop-invariant. Loop-induction patterns (a[i++]) prefer LWPI / LDPI for the auto-inc.
  2. Hash table and dictionary probes where the entry size is power-of-two ≤ 128 bytes.
  3. Lookup table access with scaled-index addressing (sine tables, cosine tables, log/exp LUTs).
  4. switch statement dispatch via JALXPC when the jump table is within ±32 KiB.
  5. Function pointer table dispatch via JALXPC when the table is PC-resident.

The compiler does not emit X-type for:

  • Sequential array iteration (LWPI / LDPI families do this better — single-cycle, no index register needed).
  • 2D matrix access with non-power-of-two strides (the indexed form doesn't apply; falls back to mul/shift + add).
  • Narrow-mode code (the entire 0x7F escape is wide-mode only).

9.7 Wide-Mode-Only Restriction

The X-type and JALXPC instructions live in the 0x7F escape (§8.2). They are undefined behaviour in narrow mode and the assembler rejects them in narrow sections. Code that targets both modes must provide a narrow-mode fallback using Zba shift-add + standard load sequences.

The wide-register-file access (x0–x63) is available natively because all register fields in the X-type and JALXPC encodings are 6 bits wide.

9.8 Indexed Stores (Deferred to v0.2)

Indexed stores (SWX rs2_index, val, base) face an encoding challenge: the destination is memory, not a register, so the X-type's rd field has no natural use. Repurposing rd as the source value (i.e., "S-X-type") is straightforward but adds a third pipeline read port for the index instruction, which raises the implementation cost.

The use case for indexed stores is scatter writes (e.g., bucket-sort placement, hash table insertion, sparse vector update). These are real patterns but less hot than the gather (load) case in the workloads FireStorm targets. v0.1 omits indexed stores; v0.2 will revisit based on workload data.

For now, scatter writes use the standard Zba shift-add followed by a store:

sh2add  t0, idx, base
sw      val, 0(t0)

10. ABI and Calling-Convention Interaction

Xcrisp instructions do not change the calling convention. They are pure local operations within a function:

  • Auto-inc forms update rs1 (a register); the caller/callee category of rs1 is unchanged from the standard RV64 ABI.
  • Load-op, op-store, load-op-store, and block memory write back to standard registers (or memory); same ABI rules apply.
  • Load-op-store writes only to memory; the rd field names a register that is read (as a base address), not written. ABI roles of pointer registers are unaffected.
  • Block memory instructions are interruptible and may take many cycles, but they hold no hidden architectural state — only the named registers carry progress. Function-call semantics are unaffected.

A function compiled with +xcrisp is fully ABI-compatible with one compiled without; both observe the standard lp64d calling convention. The only externally visible consequence of +xcrisp is that the emitted code may contain Xcrisp instructions, which require an Xcrisp-aware decoder to execute.

A vanilla RV64GC implementation receiving Xcrisp code will trap on the first Xcrisp instruction (illegal-instruction in the custom opcode space). Mixed-mode deployment requires either runtime feature gating (test mxcrisp CSR before entering an Xcrisp code path) or build-time selection (separate object files).


11. Compiler and Toolchain Integration

11.1 Target Flags

The +xcrisp target feature, alone or as part of +xfirestorm (= +xwide,+xcrisp), enables Xcrisp emission. With +xcrisp alone, the compiler emits Xcrisp instructions in standard .text (DDR3); with +xfirestorm, functions marked or detected as wide-eligible are placed in .text.wide (SRAM) and use both Xcrisp and the wide register file.

Per-function annotations: __attribute__((target("xcrisp"))) enables Xcrisp emission for a single function regardless of global flags.

11.2 Auto-Vectorization Patterns

The compiler should recognise the following C patterns and emit the corresponding Xcrisp sequences:

Source pattern Emitted sequence
*p++ = v (post-inc store) SDPI v, k(p) for the natural element width
v = *p++ (post-inc load) LDPI v, k(p)
*--p = v (pre-dec store) SDPD v, k(p)
sum += *p (accumulator) LDADD sum, (p), sum
*p &= mask (mask-in-place) MMDAND [p], [p], mask (load-op-store, true in-place)
*p ^= v (xor-in-place) MMDXOR [p], [p], v
dst[i] = src[i] + bias (element-wise transform) MMWADD [d], [s], bias then advance pointers
dst[i] = src[i] << k (scaling pass) MMWSLL [d], [s], k
while (*p != term) p++ LBPI v, 1(p); BNEM v, (term_reg), loop
memcpy(d, s, n) BMCPY d, s, n
memset(d, c, n) BMSET d, c, n

The accumulator and string-scan patterns yield the largest density wins per inner-loop iteration. The load-op-store patterns yield the largest performance wins on memory-bound kernels (audio, image, vector) where the inner loop is dst[i] = f(src[i], constant): each iteration's load → ALU → store collapses into one instruction with no register-file traffic.

11.3 Inline Assembly Constraints

GCC/Clang inline-assembly constraints for Xcrisp:

  • =r (register destination), r (register operand): unchanged from base RV64.
  • Q: memory operand allowed for Xcrisp memory-operand instructions; expanded to (rs1) form without offset.
  • A new Xc constraint may be added for "this operand must be a register addressable by Xcrisp" — in practice equivalent to r since all Xcrisp register operands are unrestricted within their bank.

Compilers may emit Xcrisp instructions even when not explicitly requested via inline asm, if the C-level operation matches one of the patterns in §10.2 and +xcrisp is enabled.


12. Implementation Guidance

12.1 Pipeline Considerations

The fused instructions are designed to deliver both code density and performance wins. An implementation that captures only the density side leaves significant performance on the table; the notes below identify the microarchitectural paths that turn each fused form into a real cycle saving.

  • Load-op fusion is most valuable when the implementation forwards the load result directly into the ALU input without writing the intermediate to the register file. A two-cycle issue (load-result available at cycle +1, ALU op at cycle +2) is the typical pattern; the fused instruction occupies one issue slot but two execution slots, which the implementation may schedule freely. Win sources: front-end throughput (one decode instead of two), register-file write port freed for an independent op, one fewer live temporary for the allocator, smaller prefetch-buffer footprint.

  • Op-store fusion delivers latency and density wins simultaneously. The ALU result should be forwarded directly into the store-buffer entry, bypassing the register file entirely — the temporary never exists architecturally and need not exist physically. On a simple in-order pipeline this saves one cycle versus separate add; sw (no register-file write-then-read on the temp); on a multi-issue pipeline it relieves write-port pressure and frees an issue slot. The allocator also wins one fewer live temporary per pattern, which on a tight basic block may avoid a spill. The store-buffer entry should be allocated at decode time so the ALU result writes directly into it.

  • Load-op-store fusion is the most aggressive Xcrisp instruction and the one with the largest possible performance win. The instruction does the work of three (lw; add/op; sw) in one instruction-fetch slot, with no architectural temporary, no register-file read/write of the intermediate value, and one fewer architectural register live across the operation. Recommended implementation: a three-stage internal pipeline (load → ALU → store) that allows back-to-back load-op-store instructions to overlap and achieve one-per-cycle throughput in steady state. The store buffer entry should be allocated at decode time so the ALU forwards directly into it (same path as op-store), and the load result should be forwarded into the ALU input without ever being written to a physical register (same path as load-op). A minimal implementation that decodes load-op-store into three sequential micro-ops still gets the density win and avoids the architectural-temporary write; the performance win scales with how much of the pipeline overlap is implemented.

  • Auto-increment instructions write two registers (rd and rs1 for loads; just rs1 updated for stores plus the memory write). A single-write-port register file must serialise the two-register update across two cycles; a two-write-port file (already required by some standard extensions) takes it in stride. The density win is unconditional; the performance win arrives when the second write port is available, otherwise it's a wash on cycles but still a win on fetch/decode bandwidth.

  • Compare-mem-branch issues a load and a branch in one instruction. The load-latency critical path is unchanged from a separate load + branch (the branch must still wait for the loaded value), so the raw cycle count for this single pattern is identical. The wins come from the surrounding context: one fewer issue slot consumed, one fewer register-file read (the loaded value is consumed directly by the compare unit, never written back), one fewer architectural register live across the load (helpful for register pressure in tight loops), and the smaller prefetch-buffer footprint. On loops that are fetch-bound or register-pressured — which sentinel scans and table walks often are — this is a real throughput win, not just a code-size one.

  • Block memory instructions are intended as DMA-like primitives. A minimal implementation iterates byte-by-byte (slow but correct); a serious implementation uses wider transfers when alignment permits and the operands don't overlap unsafely. A well-tuned BMCPY should approach one DRAM-burst per cycle of throughput on aligned operands, dwarfing any libc inline-expanded loop.

12.2 Trap Restart for Block Operations

A block-memory instruction trapped mid-execution must:

  1. Update rd, rs1, rs2 to reflect progress (bytes completed).
  2. Leave mepc pointing at the block instruction (not past it).
  3. Discard any uncommitted internal transfer state.

The trap handler does nothing special — mret returns to the same mepc, the instruction re-executes with the partially-advanced register state, and the byte-loop resumes from where it stopped. A buggy implementation that fails to update the registers atomically with the memory write will cause silent data corruption; this is the single most important invariant for block-memory correctness.

12.3 CSR Allocation

CSR Address Type Description
mxcrisp 0xFC1 (suggested) M-mode RO Xcrisp version & feature bits

Bit layout of mxcrisp (proposed):

Bits Field Meaning
[0] PRESENT 1 if Xcrisp implemented
[7:1] VERSION Xcrisp version (1 = v0.1)
[8] HAS_BMOPS 1 if synchronous block memory ops (BMCPY/BMSET) implemented
[9] HAS_MMOPS 1 if load-op-store ops implemented
[10] HAS_PIC 1 if PC-relative PIC instructions implemented (wide-mode only)
[11] HAS_DMA 1 if asynchronous DMA ops (DMACPY/DMASET) implemented
[15:12] DMA_QUEUE_DEPTH log₂ of DMA queue depth, 0 if no DMA. (010 = 4 entries, 011 = 8 entries)
[16] HAS_INDEXED 1 if X-type indexed addressing implemented (wide-mode only)
[63:17] reserved

A reduced FireStorm variant lacking block-memory hardware may set HAS_BMOPS = 0 while keeping the rest of Xcrisp. Variants lacking load-op-store, PIC, DMA, or indexed-load support clear the corresponding bits similarly. A variant implementing synchronous block ops but no DMA engine sets HAS_BMOPS = 1, HAS_DMA = 0, and DMA_QUEUE_DEPTH = 0.


13. Encoding Summary

13.1 At-a-Glance Opcode Map

Mnemonic opcode funct3 funct7 / sub Format Mode
LBPI..LWUPI 0x0B 000–110 (none) I-type both
LBPD..LWUPD 0x0B 111 imm[11:9]=width I-type sub both
SBPI..SDPD 0x2B 000–111 (none) S-type both
LWADD..LWUSLTU 0x5B 000 width+aluop R-type both
LDADD..LDSLTU 0x5B 000 width+aluop R-type both
ADDSW..SRASD 0x5B 001 width+aluop R-type (rd=rs3) both
BMCPY, BMSET 0x5B 010 0000000, 0000001 R-type both
DMACPY, DMASET 0x5B 010 0000010, 0000011 R-type both
BSRCH.B/H/W/D 0x5B 010 0010xxx (width selector) R-type both
BSCAN.B/H/W/D 0x5B 010 0011xxx (width selector) R-type both
BSHIFTR/L.B/H/W/D 0x5B 010 0100xxd (width + direction) R-type both
MMWADD..MMWUSLTU 0x5B 011 width+aluop R-type (rd=dest-addr) both
MMDADD..MMDSLTU 0x5B 011 width+aluop R-type (rd=dest-addr) both
BEQM..BGEUM 0x7B 000–111 (none) B-type both
LDPC, LWPC, LWUPC, LAPC, JALPC 0x7F 000–100 bit[35]=0 W-type wide only
JALXPC, JMPXPC 0x7F 101 bit[35]=0 W-type wide only
LBX..LDX 0x7F 110 bit[35]=0, scale+w+s X-type (sub of W) wide only
CALLM, JMPM 0x7F (funct4 at [16:13]) bit[35]=1 WI-type wide only

13.2 Reserved Spaces

The following encoding spaces are reserved for future Xcrisp expansion and must not be used by any other FireStorm extension or vendor implementation:

  • 0x0B with funct3 = 111 and imm[11:9] = 111
  • 0x2B: no reserved space in v0.1 (all funct3 slots in use)
  • 0x5B funct3 = 100111 (reserved sub-families)
  • 0x5B funct3 = 000/001/011 with width = 11 or aluop[4:0]01010
  • 0x5B funct3 = 010 with funct7 ≥ 0101000 (B-tree expansion space — BMCMP at 0000100 reserved for v0.2; BSRCH/BSCAN/BSHIFT occupy 0010xxx0100xxx)
  • 0x7B: no reserved space in v0.1 (all funct3 slots in use; dword ordered comparisons require new opcode in v0.2)
  • 0x7F W-type with funct3 = 111 (reserved PC-relative variant)
  • 0x7F W-type funct3 = 110 (X-type) with w+s = 111 (reserved indexed-load width/sign)
  • 0x7F W-type funct3 = 110 (X-type) with bit [16] = 1 (reserved sub-encoding bit)
  • 0x7F WI-type with funct4 = 00101111 (reserved register-indirect variants)

The 0x7F opcode itself (the wide-mode escape mechanism, §8.2) is the general-purpose lane for any wide-mode-only extension. Future FireStorm extensions may allocate sub-encodings within the 29-bit payload by coordinating with the format bit at [35] and choosing dispatch sub-fields that don't conflict with allocated W-type funct3 or WI-type funct4 values.


14. Open Items

  1. mxcrisp CSR address. Suggested 0xFC1; needs to align with the wide-dirty CSR allocation (open item §16 of the parent doc) and any other FireStorm-specific CSRs.
  2. Dword ordered compare-mem-branch. Requires either an additional opcode or a width-modifier bit somewhere; deferred to v0.2.
  3. 16-bit memory width for load-op and op-store. The width = 11 slot is reserved; semantics TBD.
  4. BMCMP semantics. Result encoding (remaining-count, flag, or both) TBD.
  5. Overlapping BMCPY. Implementation-defined in v0.1; may be tightened to memmove-compatible in v0.2 if cost is acceptable.
  6. Alignment hints for block memory. Future width[1:0] use in BM-family funct7.
  7. Sub-word load-op forms. Byte and halfword load-op (e.g. LBADD, LHUADD) are not in v0.1; if added later they'd consume width = 10/11 in load-op funct7 with redefined semantics.
  8. PIC and Zcmp interaction. Zcmp's cm.popret is a tail-return; combining with JMPM for full tail-call sequences is an optimisation worth exploring.
  9. PIC linker relaxation pass. Specification of the relaxation algorithm and the new relocation types needed to mark AUIPC + LD/JALR/ADDI pairs as relaxable.
  10. Indexed stores (SBX..SDX). v0.1 omits indexed stores; the S-X-type encoding repurposing rd as the stored-value source is sketched in §8.8 but not finalised. v0.2 to revisit based on workload data.
  11. LAX (indexed address materialise). An "X-type but result is the computed address, not the loaded value" form would help 2D access and address-of-array-element patterns. Encoding space exists (X-type w+s = 111 is reserved). Deferred to v0.2.
  12. Index sign-extension option. Current X-type zero-extends rs2 (the index). Some patterns (e.g., signed offsets from a midpoint) want sign-extension. A scale-field bit could select, at the cost of halving the scale set. Open for v0.2.
  13. JALXPC reach extension. ±32 KiB is sufficient for most switch tables but not for very large generated dispatch tables (e.g., character-class tables in regex engines). A wider-immediate variant would consume a second funct3 slot in W-type. Open for v0.2.
  14. DMA fault reporting. The mechanism for surfacing DMA-engine memory faults (unmapped page, write to RO, bus error) is sketched in §5.5.2 but not finalised. Candidate design: a status CSR mxdma_status holding the faulting DMA's tag-register name, fault address, and fault cause; M-mode trap with cause = DMA_FAULT. Final encoding TBD.
  15. DMA priority. Should the DMA engine support priority levels so audio-critical transfers preempt bulk ones? A 2-bit priority field could be carried in the spare funct7 bits of DMACPY/DMASET. Deferred to v0.2 pending workload data.
  16. DMA-to-MMIO ordering. §5.5.2 suggests MMIO DMAs are strictly ordered relative to CPU MMIO stores; the exact memory-ordering model (RVWMO interaction, fences required) needs formalisation.
  17. Multiple-consumer DMA tag table. Currently one tag-table entry per outstanding DMA. If two DMAs name the same rs2, the second stalls until the first completes. An alternative (allocate fresh tag entry on each issue, with a "scoreboard" mapping register names to multiple entries) is more flexible but costlier. v0.1 takes the simpler one-tag-per-register approach.

15. Glossary

Term Meaning
Auto-increment Load or store that updates its base register as a side effect (post-inc or pre-dec).
Load-op fusion Combined load and ALU operation: rd = mem[rs1] OP rs2.
Op-store fusion Combined ALU operation and store: mem[rs1] = rs2 OP rs3. No register destination.
Compare-mem-branch Branch on the result of comparing a register with a memory value.
Block memory Variable-length memory transfer instruction (BMCPY, BMSET); interruptible with register-held restart state.
rs3 (in op-store) The R-type rd field [11:7] repurposed as a third source register; no architectural register is written.
mxcrisp M-mode CSR exposing Xcrisp implementation and feature bits.

End of document. See also: FireStorm CPU ISA — base architecture and wide-mode encoding.

Important: The Ant64 family of home computers are at early design/prototype stage, everything you see here is subject to change.