FireStorm Xlate Extension — Memory Translator Specification
Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA Companions: FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension
1. Overview
The Xlate extension adds per-register data transformations to FireStorm's load and store path. Each general-purpose register has two software-configurable translator slots — one for loads ("read translator") and one for stores ("write translator") — that automatically transform the data flowing between memory and the register. The available transformations cover the common bit/byte swizzles: endian conversion (byte-swap 16/32/64), bit reversal (per-byte, 16/32/64), nibble swap, and halfword/word reorder.
The extension adds no new instructions. Every existing load and store — standard RV64 (LW/SW etc.), Xcrisp auto-inc (LWPI/SWPI), Xcrisp indexed (LWX), and Xcrisp memory-fused arithmetic (LWADD/MMWADD) — benefits transparently once the translator slots are configured for the relevant registers. Configuration is via standard CSR ops; runtime overhead on load/store is at most one pipeline stage (typically absorbed into the existing memory pipeline).
1.1 Wins
Without Xlate, every byte-swap or bit-shuffle needs an explicit instruction sequence after each load or before each store. With Zbb, byte reversal costs 1 instruction (rev8); with Zbkb, bit reversal costs 1 instruction (brev8). Without either, the cost can be 3–6 instructions of shift/mask/or. In hot loops over packed data, this is multiplicative:
- Endian conversion in a parser reading 1000 big-endian 32-bit fields with
rev8: 2000 instructions (load + rev8 each). With Xlate (read translator = byteswap32 on the loaded register): 1000 instructions. 50% reduction in the load step. - SPI bit-order conversion when sending bytes to a hardware peripheral that wants LSB-first while software stores MSB-first: 1 instruction per byte (the store), versus 2 (
brev8+sb) with Zbkb or 5+ without. Cuts a hot transmit loop in half. - Mixed-endian PDP-style word ordering (32-bit halves of a 64-bit value swapped relative to host order): a load/store pair plus shifts and ORs, versus a single load/store with translator slot 8 (word-swap-64). Saves 3–4 instructions per access.
- BCD digit reordering (display formatting, retro emulators): nibble-swap per byte is a single shift/mask/or sequence (~3 instructions). With Xlate slot 1: zero overhead per access.
The win compounds in code that does many memory ops with the same translation. The compile-time cost of setting up translators is amortised across all subsequent loads and stores.
1.2 Non-Goals
- Not floating-point. v0.1 covers integer GPRs only. FPR translation (e.g., byte-swap on FLW for network-format float exchange) is open for v0.2.
- Not DMA. DMACPY and DMASET (Xcrisp §5.5.2) move bytes between memory locations without passing through GPRs; translators do not apply.
- Not block memory. BMCPY and BMSET (Xcrisp §5.5.1) similarly bypass the per-register translator path.
- Not per-instruction override. Translation is per-register, not per-instruction. A load instruction always uses the destination register's read translator; there is no "untranslated load" opcode in v0.1. Disable translation by setting the slot to 0 (identity).
- Not arbitrary programmable transforms. v0.1 provides 12 fixed translator slots covering common swizzles. Programmable transforms (custom bit permutations specified by additional CSRs) are open for v0.2.
- Not pixel format conversion. RGB565 ↔ RGB888 and similar packed-pixel translations are useful but require non-trivial bit-field manipulation. v0.2 candidate.
2. Relationship to Standard RISC-V
Xlate does not add new opcodes. It modifies the semantics of existing load and store instructions when the relevant register has a non-identity translator configured.
In standard RV64, the load pipeline is:
mem → byte-extract → sign/zero-extend → register
With Xlate, the pipeline becomes:
mem → byte-extract → read-translator → sign/zero-extend → register
The translator is applied to the bytes read from memory before the load instruction's sign or zero extension. The instruction's width (LB/LH/LW/LD and signed/unsigned variants) determines the final register value as usual.
For stores, the pipeline is symmetric:
register → truncate-to-width → write-translator → mem
A register with both translators set to slot 0 (identity) behaves exactly as in standard RV64. A register with non-identity translators behaves as if there were an extra ALU op in the load/store path — invisible to the programmer except in the observed memory values.
The mxlate CSR (§12) advertises Xlate's presence and supported slot set. A reduced FireStorm variant without Xlate hardware returns zero from mxlate; the translator CSRs xlate_rd_* and xlate_wr_* either do not exist (CSR access traps) or are tied to zero (writes are silently ignored).
2.1 Coexistence with Zbb / Zbkb
The standard Zbb extension provides rev8 (full-register byte reverse) and Zbkb provides brev8 (bit reverse within each byte). These are still useful — they apply explicitly to a register's contents without involving memory. Xlate is memory-side only: it transforms data as it moves between register and memory. The two are complementary:
- Use Zbb
rev8to byte-reverse a value already in a register. - Use Xlate read translator slot 4 (byteswap32) to load a value byte-reversed in one instruction.
For a single one-off conversion, Zbb is fine. For a hot loop loading many big-endian values, Xlate is faster and denser.
3. Translator Slots
Xlate v0.1 defines 16 translator slots, numbered 0–15. Slots 0–11 are fixed transformations; slots 12–15 are reserved for future allocation (programmable slots, pixel format conversion, etc.).
| Slot | Mnemonic | Width | Operation |
|---|---|---|---|
0000 |
IDENT | any | Identity (no transformation) |
0001 |
NSWAP8 | per-byte | Nibble swap within each byte: b[7:4] ↔ b[3:0] |
0010 |
BREV8 | per-byte | Bit reverse within each byte: b[i] ↔ b[7-i] |
0011 |
BSWAP16 | 2 bytes | Byte swap 16-bit: AB → BA |
0100 |
BSWAP32 | 4 bytes | Byte swap 32-bit: ABCD → DCBA |
0101 |
BSWAP64 | 8 bytes | Byte swap 64-bit: ABCDEFGH → HGFEDCBA |
0110 |
HSWAP32 | 4 bytes | Halfword swap within 32-bit: AB CD → CD AB |
0111 |
HSWAP64 | 8 bytes | Halfword swap across 64-bit: AB CD EF GH → GH EF CD AB |
1000 |
WSWAP64 | 8 bytes | Word swap within 64-bit: ABCD EFGH → EFGH ABCD |
1001 |
BREV16 | 2 bytes | Bit reverse 16-bit |
1010 |
BREV32 | 4 bytes | Bit reverse 32-bit |
1011 |
BREV64 | 8 bytes | Bit reverse 64-bit |
1100–1111 |
reserved | — | Reserved for v0.2 (programmable, pixel format, etc.) |
3.1 Per-Byte vs Width-Specific Slots
Slots 0, 1, 2 are per-byte translators. They operate on each byte of the loaded data independently and work with any load width (LB, LBU, LH, LHU, LW, LWU, LD). A nibble-swap of an LB result swaps the nibbles of the one loaded byte; a nibble-swap of an LD result swaps the nibbles of each of the eight loaded bytes.
Slots 3–11 are width-specific translators. Each slot's transformation is defined for one specific access width. A load or store using such a slot must match the slot's width or the instruction traps with cause XLATE_WIDTH_MISMATCH (§7).
| Slot | Required load/store width |
|---|---|
| BSWAP16, BREV16 | halfword (LH, LHU, SH) |
| BSWAP32, HSWAP32, BREV32 | word (LW, LWU, SW) |
| BSWAP64, HSWAP64, WSWAP64, BREV64 | dword (LD, SD) |
For loads with sign extension (LH, LW), the translator applies to the loaded bytes before sign extension (§2). For stores, the translator applies after the register value is truncated to the store width.
3.2 Involutory Property
All v0.1 fixed slots are involutory: applying the translation twice yields the original value. This means a register configured with the same translator slot on both read and write (e.g., read = BSWAP32 and write = BSWAP32) acts as a "private host-order view" of foreign-format memory: loads convert in, stores convert back out, and the register always sees host-order values.
The involutory property is convenient but not architecturally required. v0.2 may introduce non-involutory slots (pixel pack/unpack, etc.).
4. Per-Register Configuration
Each GPR has two 4-bit translator selectors stored in CSRs: read translator (used when this register is the destination of a load) and write translator (used when this register is the source of a store). The selectors are organised into 8 CSRs, each holding the translator state for 16 registers.
4.1 Configuration CSRs
| CSR | Address (suggested) | Type | Covers | Description |
|---|---|---|---|---|
xlate_rd_0 |
0x800 |
URW | x0–x15 | Read translators for x0..x15 |
xlate_rd_1 |
0x801 |
URW | x16–x31 | Read translators for x16..x31 |
xlate_rd_2 |
0x802 |
URW | x32–x47 | Read translators for x32..x47 (wide only) |
xlate_rd_3 |
0x803 |
URW | x48–x63 | Read translators for x48..x63 (wide only) |
xlate_wr_0 |
0x804 |
URW | x0–x15 | Write translators for x0..x15 |
xlate_wr_1 |
0x805 |
URW | x16–x31 | Write translators for x16..x31 |
xlate_wr_2 |
0x806 |
URW | x32–x47 | Write translators for x32..x47 (wide only) |
xlate_wr_3 |
0x807 |
URW | x48–x63 | Write translators for x48..x63 (wide only) |
Each CSR is 64 bits, holding 16 × 4-bit fields. Field i (bits [4i+3 : 4i]) gives the translator slot selector for register base + i, where base is 0, 16, 32, or 48 for groups 0–3 respectively.
For example, the read translator for x5 is in xlate_rd_0 bits [23:20]. The write translator for x42 is in xlate_wr_2 bits [43:40].
4.2 Reset State
All eight xlate_* CSRs reset to zero. This means every register has translator slot 0 (IDENT) for both read and write, so Xlate-aware FireStorm boots with all memory operations behaving as standard RV64.
4.3 x0 Special Case
Register x0 is the architectural zero register. Loads to x0 are no-ops (discarded); stores from x0 always write the value zero. The bits [3:0] of xlate_rd_0 and xlate_wr_0 (the x0 fields) are writable but observably ignored by the hardware — writes succeed, reads return whatever was written, but the values do not affect memory operation behaviour. This is the standard RV64 treatment of x0-related state and avoids special-casing the CSR write logic.
4.4 Configuration Idioms
Set the read translator for x10 to BSWAP32 (slot 4):
li t0, 4
slli t0, t0, (10 * 4) ; shift into x10's field
li t1, 0xF ; mask for the field
slli t1, t1, (10 * 4)
csrrc x0, xlate_rd_0, t1 ; clear the field
csrrs x0, xlate_rd_0, t0 ; set the new value
6 instructions. Acceptable for setup code but verbose. The assembler provides a pseudo:
XLATE_RD x10, BSWAP32 ; expands to the sequence above
Configure x10 to do round-trip byteswap (load and store both byteswap32):
XLATE_RD x10, BSWAP32
XLATE_WR x10, BSWAP32
Disable translation on x10:
XLATE_RD x10, IDENT ; equivalently, XLATE_OFF_RD x10
XLATE_WR x10, IDENT
Snapshot and restore translator state (for context switch or function-frame save):
save:
csrr s0, xlate_rd_0
csrr s1, xlate_rd_1
csrr s2, xlate_wr_0
csrr s3, xlate_wr_1
; ... save s0–s3 to stack
restore:
; ... reload s0–s3 from stack
csrw xlate_rd_0, s0
csrw xlate_rd_1, s1
csrw xlate_wr_0, s2
csrw xlate_wr_1, s3
For wide mode add the x32–x63 group CSRs (xlate_rd_2/3, xlate_wr_2/3).
5. Memory Operation Semantics
This section gives the precise architectural semantics of memory operations under Xlate.
5.1 Load Pipeline
A load instruction with destination register rd and width W proceeds as:
- Compute effective address per the instruction format (rs1 + imm, indexed, PC-relative, etc.).
- Read
Wbytes from memory at the effective address. - Look up
rd's read translator slot inxlate_rd_*. - If the slot is width-specific and the load width
Wdoes not match the slot's required width, trapXLATE_WIDTH_MISMATCH. - Apply the translator to the
Wbytes. - Sign- or zero-extend the result to 64 bits per the instruction (signed for LB/LH/LW, unsigned for LBU/LHU/LWU, no extension for LD).
- Write the result to
rd(with extension nibble bits applied in wide mode).
Steps 3–5 are inserted before step 6 (sign extension). For width-specific slots, the trap in step 4 fires before any architectural state change.
5.2 Store Pipeline
A store instruction with source register rs2 and width W proceeds as:
- Compute effective address.
- Read 64 bits from
rs2. - Truncate to the low
Wbytes. - Look up
rs2's write translator slot inxlate_wr_*. - If the slot is width-specific and the store width
Wdoes not match, trapXLATE_WIDTH_MISMATCH. - Apply the translator to the
Wbytes. - Write the resulting
Wbytes to memory.
5.3 Special Case: Load to x0
A load with rd = x0 is architecturally a no-op (the loaded value is discarded). The translator lookup is performed but the result is discarded. The memory access still occurs (and may trigger prefetch-buffer fills if the access targets code memory); width-mismatch trap behaviour is preserved.
5.4 Special Case: Auto-Inc and Indexed Loads/Stores
Xcrisp auto-increment loads (LBPI, LHPI, LWPI, LDPI, LBUPI, etc.; §3 of ee_xcrisp) and indexed loads (LBX, LHX, LWX, LDX; §8 of ee_xcrisp) all participate in Xlate translation per the standard semantics. The address-update side effects of auto-inc are unaffected.
Auto-increment stores (SBPI, SWPI, etc.) likewise apply the write translator to the value before writing memory.
5.5 Special Case: Memory-Fused Arithmetic
The Xcrisp memory-fused arithmetic instructions (§5 of ee_xcrisp) have three sub-families with distinct interactions:
- Load-op (LWADD, LDADD, etc., funct3=000): the memory side is a load into
rd; the read translator ofrdapplies normally. - Op-store (ADDSW, etc., funct3=001): the memory side is a store from the computed result; the write translator of
rdapplies. Note thatrdhere is interpreted asrs3per the op-store convention, and is read (as the third source) and ultimately as the source of the store value after the ALU op. The translator that applies is the write-translator ofrd. - Load-op-store (MMWADD, MMDADD, etc., funct3=011): both a load and a store occur. The load reads from
mem[rs1]; the read translator ofrs1does not apply (the value is not written to a register; it goes directly into the ALU). The store writes tomem[rd]; the write translator ofrddoes not apply for the same reason. Load-op-store is therefore untranslated in v0.1. (This is an open item: see §14 for the future-work plan to add load-op-store translator integration.)
5.6 Block Memory and DMA
Block memory operations (BMCPY, BMSET, DMACPY, DMASET; §5.5 of ee_xcrisp) transfer bytes between memory regions without passing them through GPRs. Translators do not apply. The bytes are copied or set verbatim.
If translation of bulk-copied data is needed, software must either:
- Use a loop of single-element loads and stores (slow but translator-aware), or
- Pre/post-process the data with explicit translator-aware code, or
- Set up the destination buffer in the translated format and use raw block ops.
A future Xcrisp v0.2 may add a translator-aware block copy (BMCPYT) if the use case justifies it.
5.7 Atomic Operations
Atomic memory operations (LR, SC, AMO*) are untranslated in v0.1. Atomics typically operate on lock-discipline data structures where bit-level transformation would defeat their purpose; the case for translator interaction is weak. This may revisit in v0.2 if needed.
6. Width Compatibility Table
The following table summarises which translator slots are compatible with which load/store widths:
| Slot | LB/SB (8) | LH/SH (16) | LW/SW (32) | LD/SD (64) |
|---|---|---|---|---|
| IDENT (0) | ✓ | ✓ | ✓ | ✓ |
| NSWAP8 (1) | ✓ | ✓ | ✓ | ✓ |
| BREV8 (2) | ✓ | ✓ | ✓ | ✓ |
| BSWAP16 (3) | ✗ | ✓ | ✗ | ✗ |
| BSWAP32 (4) | ✗ | ✗ | ✓ | ✗ |
| BSWAP64 (5) | ✗ | ✗ | ✗ | ✓ |
| HSWAP32 (6) | ✗ | ✗ | ✓ | ✗ |
| HSWAP64 (7) | ✗ | ✗ | ✗ | ✓ |
| WSWAP64 (8) | ✗ | ✗ | ✗ | ✓ |
| BREV16 (9) | ✗ | ✓ | ✗ | ✗ |
| BREV32 (10) | ✗ | ✗ | ✓ | ✗ |
| BREV64 (11) | ✗ | ✗ | ✗ | ✓ |
A ✗ entry means the instruction traps with cause XLATE_WIDTH_MISMATCH. Software can probe support by attempting a load with a specific width and slot and catching the trap.
6.1 Sign vs Unsigned Loads
LB and LBU both load one byte; they differ only in the final sign/zero extension to 64 bits. Both work identically with per-byte translators (slots 0–2) — the translator runs on the byte, then sign/zero extension applies. The translator does not distinguish LB from LBU.
For width-specific translators (BSWAP16 etc.), the same rule: the translator runs on the loaded bytes, then sign/zero extension. So LH (signed halfword) with BSWAP16 produces a byteswapped halfword that is then sign-extended to 64 bits. LHU produces the same byteswap but zero-extended.
7. Trap Causes
| Cause | Mnemonic | Trigger |
|---|---|---|
32 |
XLATE_WIDTH_MISMATCH | Load/store width does not match the configured width-specific translator slot |
33 |
XLATE_RESERVED_SLOT | Attempted use of a reserved translator slot (12–15) |
34 |
XLATE_PRIVILEGE | (Reserved for future privilege-checked translator features) |
Cause numbers are suggested; final assignment requires coordination with the other FireStorm trap-cause allocations.
On a translator trap, the architectural state holds:
- PC pointing at the trapping instruction
- The destination register (loads) is not written
- The memory access is not performed (stores leave memory unchanged; loads do not commit a value to rd)
The trap handler may inspect mtval (or stval/utval per privilege) to recover the offending memory address; the offending translator slot can be recovered by reading the appropriate xlate_* CSR.
8. Examples
All examples have been hand-verified against the slot definitions in §3 and the semantic rules in §5.
8.1 Network-Protocol Parser
Reading a stream of big-endian 32-bit fields from a network buffer into host-order values.
Standard RV64GC (no Xlate, with Zbb):
parse_loop:
lw t0, 0(a0) ; load big-endian
rev8 t0, t0 ; Zbb: byte reverse 64-bit
srli t0, t0, 32 ; right-justify the reversed 32 bits
; ... process t0 ...
addi a0, a0, 4
bne a0, a1, parse_loop
Inner loop: 5 instructions for the load step, plus the loop branch.
With Xlate (one-time setup outside the loop):
XLATE_RD t0, BSWAP32 ; configure once
parse_loop:
lw t0, 0(a0) ; loaded value auto-byteswapped
; ... process t0 ...
addi a0, a0, 4
bne a0, a1, parse_loop
Inner loop: 3 instructions. Saves 2 instructions per iteration, plus eliminates the srli left over from the rev8 (which operates on 64 bits, requiring a shift to discard the high zeros).
For a 100-field parse: ~200 fewer instructions executed.
8.2 SPI Byte Transmit with Bit Reversal
A hardware SPI controller expects LSB-first byte transmission; the data in memory is stored MSB-first.
Standard RV64GC (with Zbkb):
spi_send_loop:
lb t0, 0(a0)
brev8 t0, t0 ; Zbkb: bit reverse each byte
sb t0, 0(spi_data_reg)
addi a0, a0, 1
bne a0, a1, spi_send_loop
Inner loop: 4 instructions plus branch.
With Xlate (configure write translator on t0 to BREV8):
XLATE_WR t0, BREV8
spi_send_loop:
lb t0, 0(a0) ; load normal MSB-first byte
sb t0, 0(spi_data_reg) ; auto-bit-reversed on store
addi a0, a0, 1
bne a0, a1, spi_send_loop
Inner loop: 3 instructions. Saves 1 instruction per byte, ~25% reduction.
Without Zbkb, the standard form is much worse (5–6 instructions for the bit-reverse shift/mask sequence per byte), and the Xlate win is correspondingly larger.
8.3 Mixed-Endian DSP Data Block
Some legacy DSP file formats store 64-bit values as two 32-bit halves in reversed order — the high word first, low word second (PDP-style mid-endian within dword).
Standard RV64GC:
ld t0, 0(a0) ; load 8 bytes as one dword
slli t1, t0, 32 ; shift low → high
srli t0, t0, 32 ; shift high → low
or t0, t0, t1 ; combine swapped halves
4 instructions.
With Xlate (read translator on t0 = WSWAP64):
ld t0, 0(a0) ; auto-wordswapped on load
1 instruction. Saves 3 instructions per access.
8.4 BCD-to-Display Digit Reorder
BCD-encoded values often need nibble reordering for display (the high nibble is the high digit; some displays want them swapped). With Xlate slot 1 (NSWAP8), this is a per-byte free operation:
Standard RV64GC:
lb t0, 0(a0)
srli t1, t0, 4
andi t1, t1, 0x0F
slli t0, t0, 4
andi t0, t0, 0xF0
or t0, t0, t1
6 instructions (no standard single-instruction nibble swap).
With Xlate:
XLATE_RD t0, NSWAP8 ; configure once
lb t0, 0(a0) ; auto-nibble-swapped
1 instruction (after one-time setup). Saves 5 instructions per byte.
8.5 Round-Trip Translation (Foreign-Format Read-Modify-Write)
A memory region holds big-endian 32-bit values that need to be modified. Without Xlate, every read needs byteswap-in and every write needs byteswap-out.
Standard RV64GC (Zbb):
lw t0, 0(a0) ; load big-endian
rev8 t0, t0 ; in to host
srli t0, t0, 32 ; right-justify
addi t0, t0, 1 ; modify
slli t0, t0, 32 ; left-justify
rev8 t0, t0 ; back to big-endian
sw t0, 0(a0) ; store
7 instructions for one read-modify-write.
With Xlate (configure t0 for round-trip BSWAP32):
XLATE_RD t0, BSWAP32 ; configure once
XLATE_WR t0, BSWAP32
loop:
lw t0, 0(a0) ; auto-byteswap in
addi t0, t0, 1 ; modify in host order
sw t0, 0(a0) ; auto-byteswap out
3 instructions in the loop. Saves 4 instructions per RMW. The involutory property (§3.2) makes the round-trip transparent.
9. ABI and Function-Call Interaction
Translator state is treated symmetrically with the register-value preservation rules of the standard lp64d ABI:
- Translator state for caller-saved registers (t0–t6, a0–a7, ft0–ft11) is caller-saved. A function that needs a translator on these registers configures it inside the function's own scope; the caller assumes nothing on entry and the translator state is undefined on return.
- Translator state for callee-saved registers (s0–s11, fs0–fs11, ra, sp, gp, tp) is callee-saved. A function that overwrites these translators must save and restore them, just as it would save and restore the register values themselves.
- Translator state for x0 is meaningless (§4.3) and need not be saved.
The standard Xstack PUSH/POP and Zcmp cm.push/cm.pop instructions save/restore register values, not translator state. Code that uses non-identity translators on callee-saved registers must explicitly save the relevant xlate_* CSRs in addition to the register values. For example:
function_entry:
XLATE_SAVE s0, s1 ; pseudo: save s0/s1's translator slots to a temporary
XLATE_RD s0, BSWAP32
XLATE_RD s1, BSWAP32
; ... function body using s0 and s1 with byteswap32 translation ...
function_exit:
XLATE_RESTORE s0, s1 ; pseudo: restore translator slots
ret
The XLATE_SAVE and XLATE_RESTORE pseudos expand into CSR read/write sequences. The save/restore work scales linearly with the number of registers being adjusted; for most functions only one or two translator changes are needed.
9.1 Standard Library Convention
Standard library functions (memcpy, strlen, printf, etc.) are compiled assuming all registers have identity translators on entry. Code that calls libc with non-identity translators on argument-passing registers (a0–a7) must reset those translators to IDENT before the call, since libc functions assume standard load/store behaviour.
A future ABI revision may relax this by requiring libc to be translator-aware, but v0.1 keeps the boundary clean.
10. Interaction with Other Extensions
10.1 Xwide
Wide-mode access to x32–x63 requires the wide-mode CSRs xlate_rd_2/3 and xlate_wr_2/3 (§4.1). In narrow mode these CSRs may not exist (CSR access traps) or may be present but unused (the upper-register fields have no effect since narrow-mode code cannot reference x32–x63).
Translator state for x32–x63 is caller-saved (matching the parent doc's treatment of those registers as caller-saved scratch).
10.2 Xcrisp
All Xcrisp memory operations interact with Xlate per §5.4 and §5.5. Summary:
| Xcrisp instruction family | Translator applies? |
|---|---|
| Auto-inc loads/stores (LWPI, SWPI, etc.) | Yes |
| Indexed loads (LWX, LDX, etc.) | Yes |
| Load-op (LWADD, etc.) — load side | Yes (on rd's read translator) |
| Op-store (ADDSW, etc.) — store side | Yes (on rd's write translator) |
| Load-op-store (MMWADD, etc.) | No (open item §14) |
| Block memory (BMCPY, BMSET) | No |
| DMA (DMACPY, DMASET) | No |
| Compare-mem-branch (BEQM, etc.) | No — comparison reads bypass register pipeline |
| PIC loads (LDPC, LWPC, etc.) | Yes (on rd's read translator) |
| JALXPC indirect jump table lookup | No (the loaded target is a control transfer, not a register write) |
10.3 Xcond
Predicated R-type instructions (§5 of ee_xcond) do not access memory and are unaffected by Xlate. If a predicated instruction has a false predicate, no register write occurs and no translator lookup is needed.
10.4 Xstack
Xstack push/pop instructions (§4 of ee_xstack) move register values to/from the hardware BSRAM stack. These are conceptually register-to-register transfers (the BSRAM is a fast-path backing store, not generic memory), and translators do not apply. A function using translators saves and restores them via the standard CSR mechanism, not via PUSH/POP.
10.5 Xmath
Xmath instructions (see ee_xmath) are register-to-register and do not access memory, so translators do not apply to the operation itself — the same as Xcond predicated ALU ops (§10.3). Translators apply only to the loads and stores that move Xmath operands in and out of registers.
This composes usefully with Xmath's G12 multi-precision group (ADDC/SUBC/ROLC/RORC): a bignum stored big-endian in memory can be loaded through a byteswap read-translator so its limbs arrive in host order for the carry-chain arithmetic, then written back through the matching write-translator — the involutory round-trip of §3.2, applied to multi-word integers.
Xmath's xcarry CSR (the G12 carry/borrow bit, ee_xmath §14.1) is allocated at 0x808, immediately after this extension's translator-config block at 0x800–0x807. It is ordinary CSR state accessed with standard CSR instructions, not a memory operand, so translators never touch a carry read, set, or clear.
11. Compiler and Toolchain Integration
11.1 Target Flag
The +xlate target feature enables Xlate emission. The full FireStorm feature set is +xfirestorm = +xwide,+xcrisp,+xstack,+xcond,+xlate.
11.2 Intrinsics
Compiler intrinsics for translator configuration:
#include <riscv_xlate.h>
void __xlate_rd(int reg, enum xlate_slot slot); /* set read translator */
void __xlate_wr(int reg, enum xlate_slot slot); /* set write translator */
int __xlate_get_rd(int reg); /* read current read translator */
int __xlate_get_wr(int reg); /* read current write translator */
The enum xlate_slot provides named constants matching §3: XLATE_IDENT, XLATE_BSWAP32, etc.
11.3 Attribute-Based Translator Hints
For variables consistently loaded/stored from foreign-format memory, the compiler accepts an attribute:
__attribute__((xlate_load("bswap32"), xlate_store("bswap32")))
int network_value;
The compiler arranges for the variable's register-resident form to use the specified translators throughout its live range. This is a hint; the compiler may decline (e.g., if the variable's storage class doesn't fit translator semantics).
11.4 Inline Assembly
GCC/Clang inline-assembly constraints:
- The constraint
xr<N>requests a register with read translator slotN. - The constraint
xw<N>requests a register with write translator slotN. - The compiler arranges the necessary CSR setup before the asm block and restoration after.
12. Detection
| CSR | Address (suggested) | Privilege | Description |
|---|---|---|---|
mxlate |
0xFC4 |
MRO | Xlate version and feature bits |
Bit layout of mxlate:
| Bits | Field | Meaning |
|---|---|---|
[0] |
PRESENT | 1 if Xlate implemented |
[7:1] |
VERSION | Xlate version (1 = v0.1) |
[19:8] |
SLOTS_IMPLEMENTED | One bit per slot 1–11 indicating implementation (bit i-1 for slot i); bit i-1 = 1 means slot i is implemented |
[20] |
HAS_WIDE_CSRS | 1 if xlate_*_2/3 (wide-register CSRs) are implemented |
[63:21] |
reserved | — |
A minimum-conforming implementation may implement only slots 0 (mandatory) and a subset of 1–11; software probes SLOTS_IMPLEMENTED to discover availability. A reduced FireStorm variant may implement Xlate only for x0–x31 (narrow-mode register space) and clear HAS_WIDE_CSRS.
13. Encoding Summary
Xlate adds no new instructions and no new opcodes. Its only architectural footprint is:
- 8 user-writable CSRs at addresses 0x800–0x807 holding per-register translator slot selectors.
- 1 machine-read-only CSR at 0xFC4 (
mxlate) for detection. - 3 new trap cause codes (32–34) for width mismatch, reserved slot, and privilege violation.
The eight config CSRs are read and written with the standard csrrw/csrrs/csrrc instructions (and their immediate forms) — Xlate defines no CSR-access instructions of its own. Note that the next user CSR, 0x808, is allocated to Xmath's xcarry (G12 carry bit, ee_xmath §14.1); the translator block should not be extended past 0x807 without re-coordinating that allocation.
The translator logic is implemented inside the existing load/store pipeline. Decoders treat all loads and stores as Xlate-aware; the translator slot is looked up at issue (for stores) or at writeback (for loads), and the data path applies the transformation.
14. Open Items
- CSR addresses. All suggested values are placeholders; final assignment requires coordination with
mxcrisp(0xFC1),mxstack(0xFC2),mxcond(0xFC3), and the wide-dirty CSR. - Programmable translator slots. Slots 12–15 are reserved; one or more could be a software-defined bit permutation specified by an additional descriptor CSR. The hardware cost is non-trivial (general bit-permutation network) but the flexibility is high.
- Pixel format conversion. RGB565 ↔ RGB888 expand/pack, ARGB ↔ RGBA channel rotation, alpha blending pre/post-multiply — all useful for retro and embedded graphics. Candidate for slots 12–13.
- Saturation translators. Saturating signed 32 → 16 (audio sample clamping) and 64 → 32 are useful DSP primitives. Currently must be done with explicit Zbb min/max.
- FPR translators. Float load/store with byte-swap is useful for network-format float exchange. Adds another 8 CSRs (for f0–f63 in wide mode). v0.2 candidate.
- Load-op-store translator integration. The MMW family (§5.5 of Xcrisp) currently bypasses translators. Adding translation to the memory-side of MMW would let the in-place increment of a foreign-format buffer be a single instruction.
- Per-instruction override. Sometimes a register that's normally translated needs a one-off untranslated access. v0.1 requires reconfiguring the slot; v0.2 could add an "ignore translator" instruction-level prefix or a separate untranslated load/store opcode.
- Atomic operation translators. v0.1 does not translate atomics (§5.7). If a use case emerges (e.g., a lock-free queue where data items are big-endian on disk and host-order in memory), this could be added.
- Width-mismatch policy. v0.1 traps on width mismatch (§6). An alternative — apply identity translation silently — is more forgiving but masks bugs. A per-implementation or per-register strict/permissive mode bit may be considered.
- Block-copy translator-aware variant (BMCPYT). A translator-aware block copy that applies a translator slot to every byte/halfword/word/dword as it copies. Useful for bulk endian conversion of large buffers. v0.2 candidate.
15. Glossary
| Term | Meaning |
|---|---|
| Translator | A fixed bit/byte transformation applied to data flowing between memory and a GPR. |
| Translator slot | A 4-bit selector identifying one of 16 predefined (or future programmable) translator operations. |
| Read translator | The translator applied when a register is the destination of a load. |
| Write translator | The translator applied when a register is the source of a store. |
| Identity (IDENT) | Slot 0; no transformation. Default state on reset. |
| Per-byte translator | A translator that operates independently on each byte (slots 0–2); width-agnostic. |
| Width-specific translator | A translator that requires a specific load/store width (slots 3–11); traps on mismatch. |
| Involutory | A property of v0.1 slots: applying the translation twice yields the original. Lets paired read/write translators give a "host-order view" of foreign-format memory. |
End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension.