FireStorm Xlate Extension — Memory Translator Specification

Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA Companions: FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension

1. Overview

The Xlate extension adds per-register data transformations to FireStorm's load and store path. Each general-purpose register has two software-configurable translator slots — one for loads ("read translator") and one for stores ("write translator") — that automatically transform the data flowing between memory and the register. The available transformations cover the common bit/byte swizzles: endian conversion (byte-swap 16/32/64), bit reversal (per-byte, 16/32/64), nibble swap, and halfword/word reorder.

The extension adds no new instructions. Every existing load and store — standard RV64 (LW/SW etc.), Xcrisp auto-inc (LWPI/SWPI), Xcrisp indexed (LWX), and Xcrisp memory-fused arithmetic (LWADD/MMWADD) — benefits transparently once the translator slots are configured for the relevant registers. Configuration is via standard CSR ops; runtime overhead on load/store is at most one pipeline stage (typically absorbed into the existing memory pipeline).

1.1 Wins

Without Xlate, every byte-swap or bit-shuffle needs an explicit instruction sequence after each load or before each store. With Zbb, byte reversal costs 1 instruction (rev8); with Zbkb, bit reversal costs 1 instruction (brev8). Without either, the cost can be 3–6 instructions of shift/mask/or. In hot loops over packed data, this is multiplicative:

Endian conversion in a parser reading 1000 big-endian 32-bit fields with rev8: 2000 instructions (load + rev8 each). With Xlate (read translator = byteswap32 on the loaded register): 1000 instructions. 50% reduction in the load step.
SPI bit-order conversion when sending bytes to a hardware peripheral that wants LSB-first while software stores MSB-first: 1 instruction per byte (the store), versus 2 (brev8 + sb) with Zbkb or 5+ without. Cuts a hot transmit loop in half.
Mixed-endian PDP-style word ordering (32-bit halves of a 64-bit value swapped relative to host order): a load/store pair plus shifts and ORs, versus a single load/store with translator slot 8 (word-swap-64). Saves 3–4 instructions per access.
BCD digit reordering (display formatting, retro emulators): nibble-swap per byte is a single shift/mask/or sequence (~3 instructions). With Xlate slot 1: zero overhead per access.

The win compounds in code that does many memory ops with the same translation. The compile-time cost of setting up translators is amortised across all subsequent loads and stores.

1.2 Non-Goals

Not floating-point. v0.1 covers integer GPRs only. FPR translation (e.g., byte-swap on FLW for network-format float exchange) is open for v0.2.
Not DMA. DMACPY and DMASET (Xcrisp §5.5.2) move bytes between memory locations without passing through GPRs; translators do not apply.
Not block memory. BMCPY and BMSET (Xcrisp §5.5.1) similarly bypass the per-register translator path.
Not per-instruction override. Translation is per-register, not per-instruction. A load instruction always uses the destination register's read translator; there is no "untranslated load" opcode in v0.1. Disable translation by setting the slot to 0 (identity).
Not arbitrary programmable transforms. v0.1 provides 12 fixed translator slots covering common swizzles. Programmable transforms (custom bit permutations specified by additional CSRs) are open for v0.2.
Not pixel format conversion. RGB565 ↔ RGB888 and similar packed-pixel translations are useful but require non-trivial bit-field manipulation. v0.2 candidate.

2. Relationship to Standard RISC-V

Xlate does not add new opcodes. It modifies the semantics of existing load and store instructions when the relevant register has a non-identity translator configured.

In standard RV64, the load pipeline is:

mem → byte-extract → sign/zero-extend → register

With Xlate, the pipeline becomes:

mem → byte-extract → read-translator → sign/zero-extend → register

The translator is applied to the bytes read from memory before the load instruction's sign or zero extension. The instruction's width (LB/LH/LW/LD and signed/unsigned variants) determines the final register value as usual.

For stores, the pipeline is symmetric:

register → truncate-to-width → write-translator → mem

A register with both translators set to slot 0 (identity) behaves exactly as in standard RV64. A register with non-identity translators behaves as if there were an extra ALU op in the load/store path — invisible to the programmer except in the observed memory values.

The mxlate CSR (§12) advertises Xlate's presence and supported slot set. A reduced FireStorm variant without Xlate hardware returns zero from mxlate; the translator CSRs xlate_rd_* and xlate_wr_* either do not exist (CSR access traps) or are tied to zero (writes are silently ignored).

2.1 Coexistence with Zbb / Zbkb

The standard Zbb extension provides rev8 (full-register byte reverse) and Zbkb provides brev8 (bit reverse within each byte). These are still useful — they apply explicitly to a register's contents without involving memory. Xlate is memory-side only: it transforms data as it moves between register and memory. The two are complementary:

Use Zbb rev8 to byte-reverse a value already in a register.
Use Xlate read translator slot 4 (byteswap32) to load a value byte-reversed in one instruction.

For a single one-off conversion, Zbb is fine. For a hot loop loading many big-endian values, Xlate is faster and denser.

3. Translator Slots

Xlate v0.1 defines 16 translator slots, numbered 0–15. Slots 0–11 are fixed transformations; slots 12–15 are reserved for future allocation (programmable slots, pixel format conversion, etc.).

Slot	Mnemonic	Width	Operation
`0000`	IDENT	any	Identity (no transformation)
`0001`	NSWAP8	per-byte	Nibble swap within each byte: `b[7:4] ↔ b[3:0]`
`0010`	BREV8	per-byte	Bit reverse within each byte: `b[i] ↔ b[7-i]`
`0011`	BSWAP16	2 bytes	Byte swap 16-bit: `AB → BA`
`0100`	BSWAP32	4 bytes	Byte swap 32-bit: `ABCD → DCBA`
`0101`	BSWAP64	8 bytes	Byte swap 64-bit: `ABCDEFGH → HGFEDCBA`
`0110`	HSWAP32	4 bytes	Halfword swap within 32-bit: `AB CD → CD AB`
`0111`	HSWAP64	8 bytes	Halfword swap across 64-bit: `AB CD EF GH → GH EF CD AB`
`1000`	WSWAP64	8 bytes	Word swap within 64-bit: `ABCD EFGH → EFGH ABCD`
`1001`	BREV16	2 bytes	Bit reverse 16-bit
`1010`	BREV32	4 bytes	Bit reverse 32-bit
`1011`	BREV64	8 bytes	Bit reverse 64-bit
`1100`–`1111`	reserved	—	Reserved for v0.2 (programmable, pixel format, etc.)

3.1 Per-Byte vs Width-Specific Slots

Slots 0, 1, 2 are per-byte translators. They operate on each byte of the loaded data independently and work with any load width (LB, LBU, LH, LHU, LW, LWU, LD). A nibble-swap of an LB result swaps the nibbles of the one loaded byte; a nibble-swap of an LD result swaps the nibbles of each of the eight loaded bytes.

Slots 3–11 are width-specific translators. Each slot's transformation is defined for one specific access width. A load or store using such a slot must match the slot's width or the instruction traps with cause XLATE_WIDTH_MISMATCH (§7).

Slot	Required load/store width
BSWAP16, BREV16	halfword (LH, LHU, SH)
BSWAP32, HSWAP32, BREV32	word (LW, LWU, SW)
BSWAP64, HSWAP64, WSWAP64, BREV64	dword (LD, SD)

For loads with sign extension (LH, LW), the translator applies to the loaded bytes before sign extension (§2). For stores, the translator applies after the register value is truncated to the store width.

3.2 Involutory Property

All v0.1 fixed slots are involutory: applying the translation twice yields the original value. This means a register configured with the same translator slot on both read and write (e.g., read = BSWAP32 and write = BSWAP32) acts as a "private host-order view" of foreign-format memory: loads convert in, stores convert back out, and the register always sees host-order values.

The involutory property is convenient but not architecturally required. v0.2 may introduce non-involutory slots (pixel pack/unpack, etc.).

4. Per-Register Configuration

Each GPR has two 4-bit translator selectors stored in CSRs: read translator (used when this register is the destination of a load) and write translator (used when this register is the source of a store). The selectors are organised into 8 CSRs, each holding the translator state for 16 registers.

4.1 Configuration CSRs

CSR	Address (suggested)	Type	Covers	Description
`xlate_rd_0`	`0x800`	URW	x0–x15	Read translators for x0..x15
`xlate_rd_1`	`0x801`	URW	x16–x31	Read translators for x16..x31
`xlate_rd_2`	`0x802`	URW	x32–x47	Read translators for x32..x47 (wide only)
`xlate_rd_3`	`0x803`	URW	x48–x63	Read translators for x48..x63 (wide only)
`xlate_wr_0`	`0x804`	URW	x0–x15	Write translators for x0..x15
`xlate_wr_1`	`0x805`	URW	x16–x31	Write translators for x16..x31
`xlate_wr_2`	`0x806`	URW	x32–x47	Write translators for x32..x47 (wide only)
`xlate_wr_3`	`0x807`	URW	x48–x63	Write translators for x48..x63 (wide only)

Each CSR is 64 bits, holding 16 × 4-bit fields. Field i (bits [4i+3 : 4i]) gives the translator slot selector for register base + i, where base is 0, 16, 32, or 48 for groups 0–3 respectively.

For example, the read translator for x5 is in xlate_rd_0 bits [23:20]. The write translator for x42 is in xlate_wr_2 bits [43:40].

4.2 Reset State

All eight xlate_* CSRs reset to zero. This means every register has translator slot 0 (IDENT) for both read and write, so Xlate-aware FireStorm boots with all memory operations behaving as standard RV64.

4.3 x0 Special Case

Register x0 is the architectural zero register. Loads to x0 are no-ops (discarded); stores from x0 always write the value zero. The bits [3:0] of xlate_rd_0 and xlate_wr_0 (the x0 fields) are writable but observably ignored by the hardware — writes succeed, reads return whatever was written, but the values do not affect memory operation behaviour. This is the standard RV64 treatment of x0-related state and avoids special-casing the CSR write logic.

4.4 Configuration Idioms

Set the read translator for x10 to BSWAP32 (slot 4):

        li      t0, 4
        slli    t0, t0, (10 * 4)         ; shift into x10's field
        li      t1, 0xF                  ; mask for the field
        slli    t1, t1, (10 * 4)
        csrrc   x0, xlate_rd_0, t1       ; clear the field
        csrrs   x0, xlate_rd_0, t0       ; set the new value

6 instructions. Acceptable for setup code but verbose. The assembler provides a pseudo:

        XLATE_RD x10, BSWAP32                ; expands to the sequence above

Configure x10 to do round-trip byteswap (load and store both byteswap32):

        XLATE_RD x10, BSWAP32
        XLATE_WR x10, BSWAP32

Disable translation on x10:

        XLATE_RD x10, IDENT                  ; equivalently, XLATE_OFF_RD x10
        XLATE_WR x10, IDENT

Snapshot and restore translator state (for context switch or function-frame save):

save:
        csrr    s0, xlate_rd_0
        csrr    s1, xlate_rd_1
        csrr    s2, xlate_wr_0
        csrr    s3, xlate_wr_1
        ; ... save s0–s3 to stack

restore:
        ; ... reload s0–s3 from stack
        csrw    xlate_rd_0, s0
        csrw    xlate_rd_1, s1
        csrw    xlate_wr_0, s2
        csrw    xlate_wr_1, s3

For wide mode add the x32–x63 group CSRs (xlate_rd_2/3, xlate_wr_2/3).

5. Memory Operation Semantics

This section gives the precise architectural semantics of memory operations under Xlate.

5.1 Load Pipeline

A load instruction with destination register rd and width W proceeds as:

Compute effective address per the instruction format (rs1 + imm, indexed, PC-relative, etc.).
Read W bytes from memory at the effective address.
Look up rd's read translator slot in xlate_rd_*.
If the slot is width-specific and the load width W does not match the slot's required width, trap XLATE_WIDTH_MISMATCH.
Apply the translator to the W bytes.
Sign- or zero-extend the result to 64 bits per the instruction (signed for LB/LH/LW, unsigned for LBU/LHU/LWU, no extension for LD).
Write the result to rd (with extension nibble bits applied in wide mode).

Steps 3–5 are inserted before step 6 (sign extension). For width-specific slots, the trap in step 4 fires before any architectural state change.

5.2 Store Pipeline

A store instruction with source register rs2 and width W proceeds as:

Compute effective address.
Read 64 bits from rs2.
Truncate to the low W bytes.
Look up rs2's write translator slot in xlate_wr_*.
If the slot is width-specific and the store width W does not match, trap XLATE_WIDTH_MISMATCH.
Apply the translator to the W bytes.
Write the resulting W bytes to memory.

5.3 Special Case: Load to x0

A load with rd = x0 is architecturally a no-op (the loaded value is discarded). The translator lookup is performed but the result is discarded. The memory access still occurs (and may trigger prefetch-buffer fills if the access targets code memory); width-mismatch trap behaviour is preserved.

5.4 Special Case: Auto-Inc and Indexed Loads/Stores

Xcrisp auto-increment loads (LBPI, LHPI, LWPI, LDPI, LBUPI, etc.; §3 of ee_xcrisp) and indexed loads (LBX, LHX, LWX, LDX; §8 of ee_xcrisp) all participate in Xlate translation per the standard semantics. The address-update side effects of auto-inc are unaffected.

Auto-increment stores (SBPI, SWPI, etc.) likewise apply the write translator to the value before writing memory.

5.5 Special Case: Memory-Fused Arithmetic

The Xcrisp memory-fused arithmetic instructions (§5 of ee_xcrisp) have three sub-families with distinct interactions:

Load-op (LWADD, LDADD, etc., funct3=000): the memory side is a load into rd; the read translator of rd applies normally.
Op-store (ADDSW, etc., funct3=001): the memory side is a store from the computed result; the write translator of rd applies. Note that rd here is interpreted as rs3 per the op-store convention, and is read (as the third source) and ultimately as the source of the store value after the ALU op. The translator that applies is the write-translator of rd.
Load-op-store (MMWADD, MMDADD, etc., funct3=011): both a load and a store occur. The load reads from mem[rs1]; the read translator of rs1 does not apply (the value is not written to a register; it goes directly into the ALU). The store writes to mem[rd]; the write translator of rd does not apply for the same reason. Load-op-store is therefore untranslated in v0.1. (This is an open item: see §14 for the future-work plan to add load-op-store translator integration.)

5.6 Block Memory and DMA

Block memory operations (BMCPY, BMSET, DMACPY, DMASET; §5.5 of ee_xcrisp) transfer bytes between memory regions without passing them through GPRs. Translators do not apply. The bytes are copied or set verbatim.

If translation of bulk-copied data is needed, software must either:

Use a loop of single-element loads and stores (slow but translator-aware), or
Pre/post-process the data with explicit translator-aware code, or
Set up the destination buffer in the translated format and use raw block ops.

A future Xcrisp v0.2 may add a translator-aware block copy (BMCPYT) if the use case justifies it.

5.7 Atomic Operations

Atomic memory operations (LR, SC, AMO*) are untranslated in v0.1. Atomics typically operate on lock-discipline data structures where bit-level transformation would defeat their purpose; the case for translator interaction is weak. This may revisit in v0.2 if needed.

6. Width Compatibility Table

The following table summarises which translator slots are compatible with which load/store widths:

Slot	LB/SB (8)	LH/SH (16)	LW/SW (32)	LD/SD (64)
IDENT (0)	✓	✓	✓	✓
NSWAP8 (1)	✓	✓	✓	✓
BREV8 (2)	✓	✓	✓	✓
BSWAP16 (3)	✗	✓	✗	✗
BSWAP32 (4)	✗	✗	✓	✗
BSWAP64 (5)	✗	✗	✗	✓
HSWAP32 (6)	✗	✗	✓	✗
HSWAP64 (7)	✗	✗	✗	✓
WSWAP64 (8)	✗	✗	✗	✓
BREV16 (9)	✗	✓	✗	✗
BREV32 (10)	✗	✗	✓	✗
BREV64 (11)	✗	✗	✗	✓

A ✗ entry means the instruction traps with cause XLATE_WIDTH_MISMATCH. Software can probe support by attempting a load with a specific width and slot and catching the trap.

6.1 Sign vs Unsigned Loads

LB and LBU both load one byte; they differ only in the final sign/zero extension to 64 bits. Both work identically with per-byte translators (slots 0–2) — the translator runs on the byte, then sign/zero extension applies. The translator does not distinguish LB from LBU.

For width-specific translators (BSWAP16 etc.), the same rule: the translator runs on the loaded bytes, then sign/zero extension. So LH (signed halfword) with BSWAP16 produces a byteswapped halfword that is then sign-extended to 64 bits. LHU produces the same byteswap but zero-extended.

7. Trap Causes

Cause	Mnemonic	Trigger
`32`	XLATE_WIDTH_MISMATCH	Load/store width does not match the configured width-specific translator slot
`33`	XLATE_RESERVED_SLOT	Attempted use of a reserved translator slot (12–15)
`34`	XLATE_PRIVILEGE	(Reserved for future privilege-checked translator features)

Cause numbers are suggested; final assignment requires coordination with the other FireStorm trap-cause allocations.

On a translator trap, the architectural state holds:

PC pointing at the trapping instruction
The destination register (loads) is not written
The memory access is not performed (stores leave memory unchanged; loads do not commit a value to rd)

The trap handler may inspect mtval (or stval/utval per privilege) to recover the offending memory address; the offending translator slot can be recovered by reading the appropriate xlate_* CSR.

8. Examples

All examples have been hand-verified against the slot definitions in §3 and the semantic rules in §5.

8.1 Network-Protocol Parser

Reading a stream of big-endian 32-bit fields from a network buffer into host-order values.

Standard RV64GC (no Xlate, with Zbb):

parse_loop:
        lw      t0, 0(a0)                ; load big-endian
        rev8    t0, t0                   ; Zbb: byte reverse 64-bit
        srli    t0, t0, 32               ; right-justify the reversed 32 bits
        ; ... process t0 ...
        addi    a0, a0, 4
        bne     a0, a1, parse_loop

Inner loop: 5 instructions for the load step, plus the loop branch.

With Xlate (one-time setup outside the loop):

        XLATE_RD t0, BSWAP32                 ; configure once
parse_loop:
        lw      t0, 0(a0)                    ; loaded value auto-byteswapped
        ; ... process t0 ...
        addi    a0, a0, 4
        bne     a0, a1, parse_loop

Inner loop: 3 instructions. Saves 2 instructions per iteration, plus eliminates the srli left over from the rev8 (which operates on 64 bits, requiring a shift to discard the high zeros).

For a 100-field parse: ~200 fewer instructions executed.

8.2 SPI Byte Transmit with Bit Reversal

A hardware SPI controller expects LSB-first byte transmission; the data in memory is stored MSB-first.

Standard RV64GC (with Zbkb):

spi_send_loop:
        lb      t0, 0(a0)
        brev8   t0, t0                   ; Zbkb: bit reverse each byte
        sb      t0, 0(spi_data_reg)
        addi    a0, a0, 1
        bne     a0, a1, spi_send_loop

Inner loop: 4 instructions plus branch.

With Xlate (configure write translator on t0 to BREV8):

        XLATE_WR t0, BREV8
spi_send_loop:
        lb      t0, 0(a0)                ; load normal MSB-first byte
        sb      t0, 0(spi_data_reg)      ; auto-bit-reversed on store
        addi    a0, a0, 1
        bne     a0, a1, spi_send_loop

Inner loop: 3 instructions. Saves 1 instruction per byte, ~25% reduction.

Without Zbkb, the standard form is much worse (5–6 instructions for the bit-reverse shift/mask sequence per byte), and the Xlate win is correspondingly larger.

8.3 Mixed-Endian DSP Data Block

Some legacy DSP file formats store 64-bit values as two 32-bit halves in reversed order — the high word first, low word second (PDP-style mid-endian within dword).

Standard RV64GC:

        ld      t0, 0(a0)               ; load 8 bytes as one dword
        slli    t1, t0, 32              ; shift low → high
        srli    t0, t0, 32              ; shift high → low
        or      t0, t0, t1              ; combine swapped halves

4 instructions.

With Xlate (read translator on t0 = WSWAP64):

        ld      t0, 0(a0)               ; auto-wordswapped on load

1 instruction. Saves 3 instructions per access.

8.4 BCD-to-Display Digit Reorder

BCD-encoded values often need nibble reordering for display (the high nibble is the high digit; some displays want them swapped). With Xlate slot 1 (NSWAP8), this is a per-byte free operation:

Standard RV64GC:

        lb      t0, 0(a0)
        srli    t1, t0, 4
        andi    t1, t1, 0x0F
        slli    t0, t0, 4
        andi    t0, t0, 0xF0
        or      t0, t0, t1

6 instructions (no standard single-instruction nibble swap).

With Xlate:

        XLATE_RD t0, NSWAP8                  ; configure once
        lb      t0, 0(a0)                    ; auto-nibble-swapped

1 instruction (after one-time setup). Saves 5 instructions per byte.

8.5 Round-Trip Translation (Foreign-Format Read-Modify-Write)

A memory region holds big-endian 32-bit values that need to be modified. Without Xlate, every read needs byteswap-in and every write needs byteswap-out.

Standard RV64GC (Zbb):

        lw      t0, 0(a0)               ; load big-endian
        rev8    t0, t0                  ; in to host
        srli    t0, t0, 32              ; right-justify
        addi    t0, t0, 1               ; modify
        slli    t0, t0, 32              ; left-justify
        rev8    t0, t0                  ; back to big-endian
        sw      t0, 0(a0)               ; store

7 instructions for one read-modify-write.

With Xlate (configure t0 for round-trip BSWAP32):

        XLATE_RD t0, BSWAP32                 ; configure once
        XLATE_WR t0, BSWAP32
loop:
        lw      t0, 0(a0)                    ; auto-byteswap in
        addi    t0, t0, 1                    ; modify in host order
        sw      t0, 0(a0)                    ; auto-byteswap out

3 instructions in the loop. Saves 4 instructions per RMW. The involutory property (§3.2) makes the round-trip transparent.

9. ABI and Function-Call Interaction

Translator state is treated symmetrically with the register-value preservation rules of the standard lp64d ABI:

Translator state for caller-saved registers (t0–t6, a0–a7, ft0–ft11) is caller-saved. A function that needs a translator on these registers configures it inside the function's own scope; the caller assumes nothing on entry and the translator state is undefined on return.
Translator state for callee-saved registers (s0–s11, fs0–fs11, ra, sp, gp, tp) is callee-saved. A function that overwrites these translators must save and restore them, just as it would save and restore the register values themselves.
Translator state for x0 is meaningless (§4.3) and need not be saved.

The standard Xstack PUSH/POP and Zcmp cm.push/cm.pop instructions save/restore register values, not translator state. Code that uses non-identity translators on callee-saved registers must explicitly save the relevant xlate_* CSRs in addition to the register values. For example:

function_entry:
        XLATE_SAVE s0, s1               ; pseudo: save s0/s1's translator slots to a temporary
        XLATE_RD s0, BSWAP32
        XLATE_RD s1, BSWAP32
        ; ... function body using s0 and s1 with byteswap32 translation ...
function_exit:
        XLATE_RESTORE s0, s1            ; pseudo: restore translator slots
        ret

The XLATE_SAVE and XLATE_RESTORE pseudos expand into CSR read/write sequences. The save/restore work scales linearly with the number of registers being adjusted; for most functions only one or two translator changes are needed.

9.1 Standard Library Convention

Standard library functions (memcpy, strlen, printf, etc.) are compiled assuming all registers have identity translators on entry. Code that calls libc with non-identity translators on argument-passing registers (a0–a7) must reset those translators to IDENT before the call, since libc functions assume standard load/store behaviour.

A future ABI revision may relax this by requiring libc to be translator-aware, but v0.1 keeps the boundary clean.

10. Interaction with Other Extensions

10.1 Xwide

Wide-mode access to x32–x63 requires the wide-mode CSRs xlate_rd_2/3 and xlate_wr_2/3 (§4.1). In narrow mode these CSRs may not exist (CSR access traps) or may be present but unused (the upper-register fields have no effect since narrow-mode code cannot reference x32–x63).

Translator state for x32–x63 is caller-saved (matching the parent doc's treatment of those registers as caller-saved scratch).

10.2 Xcrisp

All Xcrisp memory operations interact with Xlate per §5.4 and §5.5. Summary:

Xcrisp instruction family	Translator applies?
Auto-inc loads/stores (LWPI, SWPI, etc.)	Yes
Indexed loads (LWX, LDX, etc.)	Yes
Load-op (LWADD, etc.) — load side	Yes (on rd's read translator)
Op-store (ADDSW, etc.) — store side	Yes (on rd's write translator)
Load-op-store (MMWADD, etc.)	No (open item §14)
Block memory (BMCPY, BMSET)	No
DMA (DMACPY, DMASET)	No
Compare-mem-branch (BEQM, etc.)	No — comparison reads bypass register pipeline
PIC loads (LDPC, LWPC, etc.)	Yes (on rd's read translator)
JALXPC indirect jump table lookup	No (the loaded target is a control transfer, not a register write)

10.3 Xcond

Predicated R-type instructions (§5 of ee_xcond) do not access memory and are unaffected by Xlate. If a predicated instruction has a false predicate, no register write occurs and no translator lookup is needed.

10.4 Xstack

Xstack push/pop instructions (§4 of ee_xstack) move register values to/from the hardware BSRAM stack. These are conceptually register-to-register transfers (the BSRAM is a fast-path backing store, not generic memory), and translators do not apply. A function using translators saves and restores them via the standard CSR mechanism, not via PUSH/POP.

10.5 Xmath

Xmath instructions (see ee_xmath) are register-to-register and do not access memory, so translators do not apply to the operation itself — the same as Xcond predicated ALU ops (§10.3). Translators apply only to the loads and stores that move Xmath operands in and out of registers.

This composes usefully with Xmath's G12 multi-precision group (ADDC/SUBC/ROLC/RORC): a bignum stored big-endian in memory can be loaded through a byteswap read-translator so its limbs arrive in host order for the carry-chain arithmetic, then written back through the matching write-translator — the involutory round-trip of §3.2, applied to multi-word integers.

Xmath's xcarry CSR (the G12 carry/borrow bit, ee_xmath §14.1) is allocated at 0x808, immediately after this extension's translator-config block at 0x800–0x807. It is ordinary CSR state accessed with standard CSR instructions, not a memory operand, so translators never touch a carry read, set, or clear.

11. Compiler and Toolchain Integration

11.1 Target Flag

The +xlate target feature enables Xlate emission. The full FireStorm feature set is +xfirestorm = +xwide,+xcrisp,+xstack,+xcond,+xlate.

11.2 Intrinsics

Compiler intrinsics for translator configuration:

#include <riscv_xlate.h>

void __xlate_rd(int reg, enum xlate_slot slot);   /* set read translator */
void __xlate_wr(int reg, enum xlate_slot slot);   /* set write translator */
int  __xlate_get_rd(int reg);                     /* read current read translator */
int  __xlate_get_wr(int reg);                     /* read current write translator */

The enum xlate_slot provides named constants matching §3: XLATE_IDENT, XLATE_BSWAP32, etc.

11.3 Attribute-Based Translator Hints

For variables consistently loaded/stored from foreign-format memory, the compiler accepts an attribute:

__attribute__((xlate_load("bswap32"), xlate_store("bswap32")))
int network_value;

The compiler arranges for the variable's register-resident form to use the specified translators throughout its live range. This is a hint; the compiler may decline (e.g., if the variable's storage class doesn't fit translator semantics).

11.4 Inline Assembly

GCC/Clang inline-assembly constraints:

The constraint xr<N> requests a register with read translator slot N.
The constraint xw<N> requests a register with write translator slot N.
The compiler arranges the necessary CSR setup before the asm block and restoration after.

12. Detection

CSR	Address (suggested)	Privilege	Description
`mxlate`	`0xFC4`	MRO	Xlate version and feature bits

Bit layout of mxlate:

Bits	Field	Meaning
`[0]`	PRESENT	1 if Xlate implemented
`[7:1]`	VERSION	Xlate version (1 = v0.1)
`[19:8]`	SLOTS_IMPLEMENTED	One bit per slot 1–11 indicating implementation (bit `i-1` for slot `i`); bit `i-1` = 1 means slot `i` is implemented
`[20]`	HAS_WIDE_CSRS	1 if `xlate_*_2/3` (wide-register CSRs) are implemented
`[63:21]`	reserved	—

A minimum-conforming implementation may implement only slots 0 (mandatory) and a subset of 1–11; software probes SLOTS_IMPLEMENTED to discover availability. A reduced FireStorm variant may implement Xlate only for x0–x31 (narrow-mode register space) and clear HAS_WIDE_CSRS.

13. Encoding Summary

Xlate adds no new instructions and no new opcodes. Its only architectural footprint is:

8 user-writable CSRs at addresses 0x800–0x807 holding per-register translator slot selectors.
1 machine-read-only CSR at 0xFC4 (mxlate) for detection.
3 new trap cause codes (32–34) for width mismatch, reserved slot, and privilege violation.

The eight config CSRs are read and written with the standard csrrw/csrrs/csrrc instructions (and their immediate forms) — Xlate defines no CSR-access instructions of its own. Note that the next user CSR, 0x808, is allocated to Xmath's xcarry (G12 carry bit, ee_xmath §14.1); the translator block should not be extended past 0x807 without re-coordinating that allocation.

The translator logic is implemented inside the existing load/store pipeline. Decoders treat all loads and stores as Xlate-aware; the translator slot is looked up at issue (for stores) or at writeback (for loads), and the data path applies the transformation.

14. Open Items

CSR addresses. All suggested values are placeholders; final assignment requires coordination with mxcrisp (0xFC1), mxstack (0xFC2), mxcond (0xFC3), and the wide-dirty CSR.
Programmable translator slots. Slots 12–15 are reserved; one or more could be a software-defined bit permutation specified by an additional descriptor CSR. The hardware cost is non-trivial (general bit-permutation network) but the flexibility is high.
Pixel format conversion. RGB565 ↔ RGB888 expand/pack, ARGB ↔ RGBA channel rotation, alpha blending pre/post-multiply — all useful for retro and embedded graphics. Candidate for slots 12–13.
Saturation translators. Saturating signed 32 → 16 (audio sample clamping) and 64 → 32 are useful DSP primitives. Currently must be done with explicit Zbb min/max.
FPR translators. Float load/store with byte-swap is useful for network-format float exchange. Adds another 8 CSRs (for f0–f63 in wide mode). v0.2 candidate.
Load-op-store translator integration. The MMW family (§5.5 of Xcrisp) currently bypasses translators. Adding translation to the memory-side of MMW would let the in-place increment of a foreign-format buffer be a single instruction.
Per-instruction override. Sometimes a register that's normally translated needs a one-off untranslated access. v0.1 requires reconfiguring the slot; v0.2 could add an "ignore translator" instruction-level prefix or a separate untranslated load/store opcode.
Atomic operation translators. v0.1 does not translate atomics (§5.7). If a use case emerges (e.g., a lock-free queue where data items are big-endian on disk and host-order in memory), this could be added.
Width-mismatch policy. v0.1 traps on width mismatch (§6). An alternative — apply identity translation silently — is more forgiving but masks bugs. A per-implementation or per-register strict/permissive mode bit may be considered.
Block-copy translator-aware variant (BMCPYT). A translator-aware block copy that applies a translator slot to every byte/halfword/word/dword as it copies. Useful for bulk endian conversion of large buffers. v0.2 candidate.

15. Glossary

Term	Meaning
Translator	A fixed bit/byte transformation applied to data flowing between memory and a GPR.
Translator slot	A 4-bit selector identifying one of 16 predefined (or future programmable) translator operations.
Read translator	The translator applied when a register is the destination of a load.
Write translator	The translator applied when a register is the source of a store.
Identity (IDENT)	Slot 0; no transformation. Default state on reset.
Per-byte translator	A translator that operates independently on each byte (slots 0–2); width-agnostic.
Width-specific translator	A translator that requires a specific load/store width (slots 3–11); traps on mismatch.
Involutory	A property of v0.1 slots: applying the translation twice yields the original. Lets paired read/write translators give a "host-order view" of foreign-format memory.

End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension.