FireStorm Xstack Extension — Hardware Stack Specification
Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA Companion: FireStorm Xcrisp Extension See also: FireStorm Performance Examples for worked comparisons
1. Overview
The Xstack extension adds three hardware-accelerated stacks to FireStorm, one per privilege level: User Stack (U-mode and above), Supervisor Stack (S-mode and above), and Machine Stack (M-mode only). Each stack is backed by a dedicated FPGA BSRAM block with a wide port for multi-register push/pop in a single cycle, addressed by its own hardware stack pointer CSR with paired base/limit registers for overflow trapping.
The extension is opt-in: standard RV64 stack code using x2 (sp) continues to use DDR3 exactly as before. Code that wants the speedup uses the new Xstack instructions, which target the hardware stacks via their own pointer CSRs (usp, ssp, msp) and do not touch x2. There is no transparent redirection — the two stack mechanisms coexist, and the compiler chooses per-function which to use.
The extension is encoded entirely in 32-bit instructions within the custom-2 opcode space (0x5B, funct3 = 100–110) and is available in both narrow and wide modes (§3 of the parent doc).
1.1 Wins
A push/pop of 8 registers through standard memory takes 8 instructions and at least 8 cycles, plus cache pollution from the stack churn. The Xstack equivalent is one instruction in one cycle on a 288-bit BSRAM port (8 dwords/cycle), or two cycles on a 144-bit port (4 dwords/cycle). The win compounds for typical use:
- Function prologue/epilogue. A function saving
ra+s0–s7and adjusting the frame pointer collapses to a single PUSH on entry and a single POPRET on exit — 2 instructions total for the function-frame overhead, versus ~18 in standard RV64 code. - Trap entry/exit. Saving the live caller-saved set on interrupt entry becomes a single instruction. Total interrupt entry overhead falls from typical 25–40 cycles to under 10.
- Coroutine yield / green-thread schedule. A whole context switch can be a stack-pointer swap (one instruction) if the coroutine's state is entirely on the hardware stack.
- No D-cache pressure. Hot loops with deep call graphs do not pass save/restore traffic through the small 8 KB D-cache when entering/exiting helper functions — the entire frame traffic stays in the dedicated Xstack BSRAM. On FireStorm's tiny-cache memory subsystem, this matters disproportionately: the D-cache is small enough that frame churn from deep call graphs would quickly evict useful data.
- Hardware bounds trap. Stack overflow is caught at the cycle it occurs, with no software checks. Useful for security-sensitive code and for catching deep-recursion bugs in development.
1.2 Non-Goals
- Not a replacement for x2. The standard
spand the conventional DDR3-resident stack remain the default for compatibility, large stack frames, and code that takes the address of a stack variable (&local). - Not a heap. The hardware stacks are LIFO-only at the architectural level (random access via load/store is permitted but discouraged; see §4.4).
- Not unlimited. Each hardware stack is small (16–64 KiB). Programs needing more stack must fall back to the DDR3 stack.
2. Relationship to Standard RISC-V
The Xstack extension introduces no changes to standard RV64 calling conventions or ABI. The standard sp (x2) is unaffected; argument passing in a0–a7, return values, and frame-pointer conventions remain unchanged.
A function compiled with +xstack may emit Xstack push/pop instructions in its prologue and epilogue in addition to or instead of standard frame management. Mixed-mode is permitted within a single function: a function may PUSH callee-saved registers to the hardware stack and allocate large locals via x2/DDR3 in the same prologue.
The Xstack-enabled mxstack CSR (§6) advertises presence, capacities, and port widths.
2.1 Coexistence with Zcmp
The standard Zcmp extension (cm.push/cm.pop/cm.popret/cm.mvsa01/cm.jt) compresses standard-stack prologue/epilogue sequences. Zcmp and Xstack are independent and complementary:
- Zcmp targets the standard
x2-relative stack in DDR3 with 16-bit compressed instructions. - Xstack targets the hardware stack via
usp/ssp/mspwith 32-bit instructions.
A function may use Zcmp for its outer frame (large locals, address-taken variables) and Xstack for fast callee-saved spill, though typical code chooses one or the other per function based on the compiler's analysis.
3. Hardware Stack Architecture
3.1 Three Stacks, One per Privilege Level
| Stack | Pointer CSR | Base CSR | Limit CSR | Backing BSRAM | Accessible from |
|---|---|---|---|---|---|
| User | usp |
usb |
usl |
U-Stack BSRAM block | U, S, M modes |
| Supervisor | ssp |
ssb |
ssl |
S-Stack BSRAM block | S, M modes only |
| Machine | msp |
msb |
msl |
M-Stack BSRAM block | M mode only |
Attempting to access a higher-privilege stack from a lower-privilege mode raises an illegal-instruction exception. The user stack is accessible from all modes (a privileged trap handler may legitimately need to inspect or unwind the interrupted user code's stack).
3.2 Sizing and Implementation
Each stack is backed by a dedicated FPGA BSRAM block. Per-variant sizing (suggested, not architectural):
| FireStorm variant | U-Stack | S-Stack | M-Stack | Total BSRAM cost |
|---|---|---|---|---|
| GW5AST-138 | 64 KiB | 32 KiB | 16 KiB | 112 KiB |
Actual sizes are reported in implementation-specific fields of the mxstack CSR (§6); software must not assume any particular capacity beyond what the CSR reports.
The BSRAM port width determines the maximum number of registers transferred per cycle by a single PUSH/POP instruction:
| Port width | Dwords/cycle | Typical fit |
|---|---|---|
| 72-bit | 1 | minimal — push/pop is one register per cycle |
| 144-bit | 2 | lower-cost option |
| 288-bit | 4 | baseline |
| 576-bit | 8 | aggressive variant |
A PUSH or POP of N registers takes ⌈N / dwords-per-cycle⌉ cycles. The instruction encoding is unchanged across implementations; only throughput varies. The actual port width is reported in mxstack[PORT_WIDTH].
3.3 Layout and Growth Direction
Each stack grows downward (toward lower addresses), matching standard RISC-V convention. The pointer CSR *sp holds the address of the most recently pushed dword; the next push targets *sp - 8.
*sb(base) holds the highest valid address — the initial value of*spwhen the stack is empty (or one dword past the highest valid slot, depending on the boundary convention chosen; §3.5).*sl(limit) holds the lowest valid address — overflow trap fires when a push would result in*sp < *sl.
The BSRAM is addressed by the low bits of *sp modulo the stack's capacity; the high bits are checked against base/limit but not used for indexing. This means each hardware stack occupies a fixed logical address range from *sl to *sb, which software may treat as an ordinary memory region for direct load/store via *sp as a base register (see §4.4).
3.4 Memory Mapping
Each hardware stack is mapped into the physical address space at a fixed range, allowing arbitrary load/store via the stack pointer or a base register. The mappings (suggested):
| Stack | Physical range |
|---|---|
| U-Stack | 0xE000_0000 – 0xE000_FFFF (64 KiB max) |
| S-Stack | 0xE001_0000 – 0xE001_7FFF (32 KiB max) |
| M-Stack | 0xE001_8000 – 0xE001_BFFF (16 KiB max) |
Access to a stack region from an insufficient privilege level traps as a load/store access fault. Reads and writes via *sp use these physical addresses transparently.
3.5 Empty / Full Convention
The pointer convention is full-descending: *sp points at the most recently pushed dword (which is therefore valid data). An empty stack has *sp = *sb; the first push decrements *sp by 8 and stores at the new *sp. Overflow occurs if a push would result in *sp < *sl. Underflow occurs if a pop is attempted when *sp >= *sb.
This matches standard ARM and many other "full-descending" conventions and is the easiest to reason about for compiler emit and for hand-written assembly.
4. Instruction Set
The Xstack extension uses three funct3 slots in custom-2 (0x5B):
| funct3 | Sub-family | Privilege | Section |
|---|---|---|---|
100 |
User stack operations | U+ | §4.1 |
101 |
Supervisor stack operations | S+ | §4.2 (mirror of §4.1) |
110 |
Stack management (peek, poke, switch, CSR access) | per-instruction | §4.3 |
111 |
Reserved (machine stack and future) | — | §4.5 |
All instructions are 32 bits, encoded as variant of R-type:
31 25 24 20 19 15 14 12 11 7 6 0
+-----------+--------+--------+-------+--------+-------------+
| funct7 | rs2 | rs1 | funct3| rd | 1011011 |
+-----------+--------+--------+-------+--------+-------------+
Operand interpretation varies by sub-family; details below.
4.1 User Stack Operations (funct3 = 100)
Within this sub-family, funct7 selects the specific instruction. The 5-bit rs2, rs1, and rd fields encode operation-specific operands as described per instruction.
4.1.1 PUSH and POP — Register List Save/Restore
The PUSH and POP instructions save or restore a list of registers to/from the user stack in a single instruction. The register list is encoded in funct7[6:2] as a 5-bit selector matching one of 32 predefined patterns (§4.4); funct7[1:0] selects PUSH vs POP vs POPRET vs POPRETZ. The other instruction fields encode an optional frame adjustment.
31 25 24 20 19 15 14 12 11 7 6 0
+------------------+--------+--------+-------+--------+-------------+
| rlist[4:0]| op2 | spimm[5:0] | 100 | 0 | 1011011 |
+------------------+--------+--------+-------+--------+-------------+
Field interpretation for PUSH/POP/POPRET/POPRETZ:
| Field | Bits | Meaning |
|---|---|---|
rlist[4:0] |
[31:27] | Register list selector (§4.4) |
op2[1:0] |
[26:25] | 00=PUSH, 01=POP, 10=POPRET, 11=POPRETZ |
spimm[5:0] |
[24:19] | Unsigned frame adjustment, scaled ×16 (range 0–1008 bytes) |
| (remaining) | [18:7] | Must be zero; reserved |
Operations:
- PUSH (op2 =
00): For each register inrlist, decrementuspby 8 and store the register atmem64[usp]. After all registers are pushed, additionally decrementuspbyspimm × 16(allocating frame slack for non-register locals on the user stack). - POP (op2 =
01): First incrementuspbyspimm × 16. Then for each register inrlist(in reverse order of PUSH), loadmem64[usp]into the register and incrementuspby 8. - POPRET (op2 =
10): As POP, then executejalr x0, 0(ra)(return). The pop and return are a single architectural instruction. - POPRETZ (op2 =
11): As POPRET, but first seta0to zero. Common forvoidfunction returns where the C ABI requiresa0to be defined.
The register-file write port arrangement determines whether multiple registers can be loaded per cycle on POP, just as the BSRAM port width determines push throughput.
4.1.2 ENTER and LEAVE — Raw Frame Allocation
For functions that need stack space but no register save:
| funct7 | Mnemonic | Operation |
|---|---|---|
1000000 |
ENTER | usp -= zext(imm) × 16 (allocate frame; imm in spimm field) |
1000001 |
LEAVE | usp += zext(imm) × 16 (deallocate frame) |
Field layout same as PUSH/POP but rlist ignored.
4.1.3 Bounds Behaviour
Any PUSH that would result in usp < usl, or any POP that would result in usp > usb, raises a stack-bounds exception (new cause code, see §7) at the cycle of the violation. The architectural state at the time of the trap reflects the value of usp before the bounds-violating decrement/increment, so the trap handler observes the operation as not-yet-committed. Re-entry to the trapping instruction is not the expected recovery path (the cause is typically program error); the trap handler unwinds or terminates.
4.2 Supervisor Stack Operations (funct3 = 101)
Bit-for-bit identical to §4.1, but targeting ssp/ssb/ssl. Instructions: PUSHS, POPS, POPSRET, POPSRETZ, ENTERS, LEAVES.
These instructions are privileged: executing them in U-mode raises an illegal-instruction exception. In S-mode and M-mode they execute as in §4.1.
The supervisor stack is intended for OS-internal code paths (trap handlers, scheduler, syscall dispatch). It enables a kernel to maintain its own fast stack without interference from user-mode stack manipulation, and without paying the cost of user-stack overflow-checking on every kernel entry.
4.3 Stack Management (funct3 = 110)
This sub-family contains random-access, switching, and CSR-helper instructions. funct7 selects the specific operation.
4.3.1 PEEK and POKE — Random Access by Slot
For looking at or modifying stack slots without LIFO discipline:
| funct7 | Mnemonic | Operation |
|---|---|---|
0000000 |
UPEEK | rd = mem64[usp + sext(imm) × 8] (imm in spimm field, signed) |
0000001 |
UPOKE | mem64[usp + sext(imm) × 8] = rs2 |
0000010 |
SPEEK | rd = mem64[ssp + sext(imm) × 8] |
0000011 |
SPOKE | mem64[ssp + sext(imm) × 8] = rs2 |
These instructions are byte-grained mem64 accesses through the stack's memory mapping (§3.4); they are functionally equivalent to ld rd, off(usp) / sd rs2, off(usp) but use a 6-bit signed scaled offset (±504 bytes) and do not consume a separate usp-base addressing path. They are intended for accessing local variables that have been spilled to the hardware stack but need non-LIFO access.
The supervisor variants follow the same privilege rules as §4.2.
4.3.2 USWAP and SSWAP — Atomic Stack-Pointer Swap
For coroutine yields, green-thread context switches, and any code that needs to replace the active stack atomically:
| funct7 | Mnemonic | Operation |
|---|---|---|
0010000 |
USWAP | tmp = usp; usp = rs1; rd = tmp |
0010001 |
SSWAP | tmp = ssp; ssp = rs1; rd = tmp (privileged) |
A coroutine implementation parks the entire user-stack context behind a single GPR (the held usp value) and resumes another coroutine by USWAP'ing that GPR into usp. The full register-save / register-restore cycle is then a sequence of three instructions: PUSH (save current coroutine's registers), USWAP (switch stacks), POP (restore the resumed coroutine's registers). Total cycle count on a 4-dword/cycle implementation: roughly 6–10 cycles for a 12-register context switch.
4.3.3 GETUSP / SETUSP / GETSSP / SETSSP — Pointer CSR Helpers
Although the stack pointers are exposed as standard CSRs and may be read/written via csrr/csrw, dedicated shorter forms are provided for the most frequent operations:
| funct7 | Mnemonic | Operation |
|---|---|---|
0100000 |
GETUSP | rd = usp |
0100001 |
SETUSP | usp = rs1 |
0100010 |
GETSSP | rd = ssp (privileged read; ssp is M-mode and S-mode visible) |
0100011 |
SETSSP | ssp = rs1 (privileged) |
0100100 |
GETUSB | rd = usb |
0100101 |
GETUSL | rd = usl |
0100110 |
GETSSB | rd = ssb (privileged) |
0100111 |
GETSSL | rd = ssl (privileged) |
Setting the base or limit CSRs is done via standard csrw to enforce the privilege check; it is not common enough to warrant short-form instructions.
4.3.4 PUSHR and POPR — Single Register
For pushing or popping a single register without paying the rlist-decode cost:
| funct7 | Mnemonic | Operation |
|---|---|---|
0110000 |
UPUSHR | usp -= 8; mem64[usp] = rs1 |
0110001 |
UPOPR | rd = mem64[usp]; usp += 8 |
0110010 |
SPUSHR | ssp -= 8; mem64[ssp] = rs1 (privileged) |
0110011 |
SPOPR | rd = mem64[ssp]; ssp += 8 (privileged) |
These are useful in irregular code paths (early returns, error handling, computed pushes) where the register list isn't fixed at compile time.
4.4 Register-List Patterns
The 5-bit rlist field in PUSH/POP encodes 32 predefined register patterns. The patterns are organised into three groups: prologue-style (callee-saved), trap-style (caller-saved), and special. Mnemonic suffix {r,t,a} selects the group at the assembler level.
4.4.1 Prologue Patterns (callee-saved, rlist[4] = 0)
These mirror Zcmp's rlist for compatibility with familiar compiler patterns. The pattern always includes ra (the return address) at the bottom of the saved range:
| rlist | Registers saved (push order, top to bottom) | Count |
|---|---|---|
00000 |
ra |
1 |
00001 |
ra, s0 |
2 |
00010 |
ra, s0, s1 |
3 |
00011 |
ra, s0–s2 |
4 |
00100 |
ra, s0–s3 |
5 |
00101 |
ra, s0–s4 |
6 |
00110 |
ra, s0–s5 |
7 |
00111 |
ra, s0–s6 |
8 |
01000 |
ra, s0–s7 |
9 |
01001 |
ra, s0–s8 |
10 |
01010 |
ra, s0–s9 |
11 |
01011 |
ra, s0–s10 |
12 |
01100 |
ra, s0–s11 |
13 |
01101–01111 |
reserved | — |
(The order is top-of-stack first: PUSH stores ra at the new top, then s0 just below it, etc.)
4.4.2 Trap Patterns (caller-saved, rlist[4] = 1, rlist[3] = 0)
For interrupt and trap handler entry/exit. Saves the caller-saved set that user code may have live at the moment of interruption:
| rlist | Registers saved | Count |
|---|---|---|
10000 |
ra, t0–t6 |
8 |
10001 |
ra, t0–t6, a0–a3 |
12 |
10010 |
ra, t0–t6, a0–a7 |
16 |
10011 |
t0–t6, a0–a7 |
15 (no ra; used inside handler) |
10100 |
t0–t6, a0–a7, fa0–fa7 |
23 (integer + FP arg saves) |
10101–10111 |
reserved | — |
A trap handler entered with the full caller-saved set already on the user stack can call into C-language helpers without further spill, and the C compiler need not know the special handler context.
4.4.3 Special Patterns (rlist[4] = 1, rlist[3] = 1)
For wide-mode handlers and other specialised needs:
| rlist | Registers saved | Count |
|---|---|---|
11000 |
x32–x47 (lower extended caller-saved) |
16 |
11001 |
x48–x63 (upper extended caller-saved) |
16 |
11010 |
x32–x63 (all extended) |
32 |
11011 |
f32–f47 (lower extended FPRs) |
16 |
11100 |
f48–f63 (upper extended FPRs) |
16 |
11101 |
f32–f63 (all extended FPRs) |
32 |
11110 |
f0–f31 (all standard FPRs) |
32 |
11111 |
reserved | — |
The wide-extended patterns (11000–11010, 11011–11101) are valid only when the wide-dirty bit is set or when the handler explicitly wants to checkpoint the extended file. In narrow-mode-only implementations these patterns trap as illegal-instruction.
The full-FPR pattern (11110) is useful for context switches that need to preserve the entire FP file.
4.4.4 Combining Patterns
Multiple PUSH/POP instructions may be combined to save unusual register sets. For example, a wide-mode trap handler saving both the standard caller-saved set and the extended scratch set issues two PUSHes:
PUSH rlist=10010, spimm=0 ; save ra, t0–t6, a0–a7 (standard caller-saved)
PUSH rlist=11010, spimm=0 ; save x32–x63 (extended)
The two pushes complete in roughly (16 + 32) / dwords-per-cycle cycles on a single-port implementation; an implementation with multiple BSRAM ports may overlap them.
4.5 Note: funct3 = 111 Reassigned to Xctx
The Xstack v0.1 design originally reserved custom-2 funct3 = 111 for a future machine-stack push/pop family. That slot has been reassigned to the Xctx extension (hardware context switching).
If a future v0.2 of Xstack adds dedicated machine-stack push/pop instructions, they will live in the management subfamily (funct3 = 110) with funct7 in the 1xxxxxx range, which remains reserved within that subfamily and provides ample room. The machine stack pointer CSRs (msp, msb, msl) remain in place and continue to be accessible via standard CSR ops.
5. Examples
5.1 Function Prologue and Epilogue
A C function saving ra plus s0–s3 and allocating 64 bytes of stack frame, then restoring and returning:
foo:
PUSH rlist=00100, spimm=4 ; save ra, s0–s3 (5 regs × 8 = 40 bytes); 4×16 = 64 bytes of frame
; ... function body ...
POPRET rlist=00100, spimm=4 ; restore ra, s0–s3, free 64 bytes of frame, return
Two instructions total for full frame management, versus typically 11–13 instructions in standard RV64.
5.2 Interrupt Handler
A trap handler saving the standard caller-saved set on entry, calling a C helper, and restoring on exit:
trap_handler:
PUSH rlist=10010, spimm=0 ; save ra, t0–t6, a0–a7 (no frame slack)
mv a0, sp ; pass trap-frame pointer to handler
GETUSP a1 ; pass user stack pointer (for unwinding)
jal ra, c_handler ; call into C
POPRET rlist=10010, spimm=0 ; restore and return
The handler entry/exit pays approximately 2 cycles on a 288-bit-port implementation (16 registers / 4 dwords-per-cycle = 4 cycles for the PUSH, plus the small fixed instruction-execute overhead). Compared to typical RV64 trap entry of 30+ cycles, this is roughly a 5×–10× win on handler dispatch.
5.3 Coroutine Yield
A coroutine yields its execution back to a scheduler that resumes a different coroutine. Both coroutines hold their full user-stack state in BSRAM; the yield is a stack-pointer swap:
; coroutine A, yielding to scheduler
yield_to_scheduler:
PUSH rlist=01100, spimm=0 ; save ra, s0–s11 (full callee-saved)
USWAP t0, t1 ; t0 = held usp for scheduler (was in t1);
; t1 = held usp for coroutine A (now stored back)
POP rlist=01100, spimm=0 ; restore scheduler's saved state
ret ; resume scheduler
; scheduler resumes coroutine B:
resume_coroutine:
PUSH rlist=01100, spimm=0 ; save scheduler's callee-saved
USWAP t0, t1 ; t0 = held usp for coroutine B; t1 = scheduler usp
POP rlist=01100, spimm=0 ; restore coroutine B's saved state
ret ; resume coroutine B
A full coroutine context switch in roughly 6–10 cycles, plus the call/return overhead — comparable to a function call.
5.4 Encoded Instructions
PUSH rlist=00100, spimm=4 — save ra, s0–s3, allocate 64-byte frame:
rlist[4:0] = 00100
op2[1:0] = 00 (PUSH)
spimm[5:0] = 000100
funct3 = 100
rd = 00000 (unused)
funct7 = 00100_00 = 0010000
imm encoded as rs2/rs1 split: rs2 = 00000, rs1 = 00100
Encoding: 0010000 00000 00100 100 00000 1011011
USWAP t0, t1 — swap usp with t1, return previous usp in t0:
funct7 = 0010000
rs2 = 00000 (unused)
rs1 = 6 (t1, 0b00110)
funct3 = 110
rd = 5 (t0, 0b00101)
Encoding: 0010000 00000 00110 110 00101 1011011
POPRET rlist=00100, spimm=4 — pop ra, s0–s3, free 64-byte frame, return:
op2 = 10 (POPRET)
rlist = 00100
spimm = 4
Encoding fields same shape as PUSH with op2 = 10
6. CSR Allocation
6.1 Stack Pointer CSRs
| CSR | Address (suggested) | Privilege | Description |
|---|---|---|---|
usp |
0x800 |
URW | User stack pointer |
usb |
0x801 |
URW | User stack base (highest valid address + 8) |
usl |
0x802 |
URW | User stack limit (lowest valid address) |
ssp |
0x900 |
SRW | Supervisor stack pointer |
ssb |
0x901 |
SRW | Supervisor stack base |
ssl |
0x902 |
SRW | Supervisor stack limit |
msp |
0xBC0 |
MRW | Machine stack pointer |
msb |
0xBC1 |
MRW | Machine stack base |
msl |
0xBC2 |
MRW | Machine stack limit |
The user-mode stack CSRs are URW (writeable from U-mode), allowing user code to inspect and adjust its own hardware stack. The base and limit CSRs are user-visible but writes are typically issued by the OS or runtime, not by user code; the architecture does not enforce this beyond standard CSR access rules.
Supervisor and Machine CSRs are accessible only from their respective privilege levels or higher.
6.2 Feature CSR
| CSR | Address (suggested) | Privilege | Description |
|---|---|---|---|
mxstack |
0xFC2 |
MRO | Xstack version, capacities, and port width |
Bit layout of mxstack:
| Bits | Field | Meaning |
|---|---|---|
[0] |
PRESENT | 1 if Xstack implemented |
[7:1] |
VERSION | Xstack version (1 = v0.1) |
[11:8] |
PORT_WIDTH_LOG2 | log₂(dwords-per-cycle), 0 = 1 dword/cycle, 3 = 8 dwords/cycle |
[19:12] |
U_STACK_SIZE_KIB | User-stack size in KiB |
[27:20] |
S_STACK_SIZE_KIB | Supervisor-stack size in KiB |
[33:28] |
M_STACK_SIZE_KIB | Machine-stack size in KiB (6 bits, max 63) |
[34] |
HAS_WIDE_RLIST | 1 if extended-register rlist patterns supported |
[63:35] |
reserved | — |
7. Trap Causes
The Xstack extension defines new trap cause codes, allocated from the implementation-defined range:
| Cause | Mnemonic | Trigger |
|---|---|---|
24 |
U_STACK_OVERFLOW | PUSH to user stack would set usp < usl |
25 |
U_STACK_UNDERFLOW | POP from user stack would set usp > usb |
26 |
S_STACK_OVERFLOW | PUSH to supervisor stack would violate ssl |
27 |
S_STACK_UNDERFLOW | POP from supervisor stack would violate ssb |
28 |
M_STACK_OVERFLOW | (Reserved for v0.2 machine-stack ops) |
29 |
M_STACK_UNDERFLOW | (Reserved for v0.2 machine-stack ops) |
30 |
XSTACK_PRIVILEGE | Privileged Xstack instruction attempted from insufficient privilege level |
31 |
XSTACK_RESERVED_ENCODING | Reserved encoding in Xstack opcode space |
Cause numbers are subject to revision pending coordination with other FireStorm extensions that allocate from the same range.
On a stack-bounds trap, mtval (or stval/utval per privilege) contains the post-update stack pointer value — the value that would have resulted had the operation been allowed to complete. Software inspecting the trap may compare against the corresponding base/limit CSR to confirm the violation type.
8. Compiler and Toolchain Integration
8.1 Target Flags
The +xstack target feature enables Xstack emission. With +xstack enabled, the compiler may emit hardware-stack PUSH/POP instructions for callee-saved register saves in function prologues, subject to per-function analysis (see §8.3).
The full FireStorm feature set is +xfirestorm = +xwide,+xcrisp,+xstack.
Per-function annotation: __attribute__((target("xstack"))) enables Xstack for one function, __attribute__((target("no-xstack"))) disables it.
8.2 ABI Compatibility
The Xstack extension does not alter the calling convention. A function compiled with +xstack is fully ABI-compatible with one compiled without; both observe lp64d argument passing, return values, and the standard saved/caller-saved register categorisation.
Callee-saved registers (s0–s11, fs0–fs11) preserved on the hardware stack instead of the DDR3 stack are still preserved across the call — the location of preservation is the implementation detail. The caller sees no difference.
8.3 When to Use Xstack
The compiler should prefer Xstack push/pop for a function's prologue/epilogue when all of the following hold:
- The function does not take the address of any local variable that would otherwise live in a register (no
&localfor register-allocated locals). - The function's stack-resident locals fit in the spimm allocation budget (≤1008 bytes per PUSH).
- The current call depth is unlikely to overflow the user stack. The compiler may emit a conservative bounds check before entering a deep recursion or fall back to DDR3 stack for functions in tail-recursive contexts.
If any condition fails, the function uses standard x2/DDR3 stack management (optionally with Zcmp compression).
Trap handlers and interrupt service routines should default to using Xstack (typically ssp or msp depending on privilege) for both performance and isolation from interrupted user code.
8.4 Inline Assembly
GCC/Clang inline-assembly constraints:
- The new constraints
xu,xs,xmreference the user/supervisor/machine stack pointers as operand registers (analogous to existing constraints forsp). - The asm template may emit Xstack push/pop directly; the compiler treats them as clobbering memory at the
mxstack-reported stack region.
9. Implementation Guidance
9.1 BSRAM Port Configuration
Each hardware stack is backed by one BSRAM block (or a small set of blocks ganged for width). The BSRAM is configured as a single-read, single-write port at the chosen width:
| Port width | BSRAM blocks per stack (GoWin GW5AST, 18-Kib BSRAM) | Capacity at width |
|---|---|---|
| 72-bit | 4 (ganged for 72-bit data + parity/tag) | 8 KiB |
| 144-bit | 8 | 16 KiB |
| 288-bit | 16 | 32 KiB |
The exact configuration depends on the GoWin BSRAM block geometry and the desired stack capacity. The architectural guarantee is only that the implementation reports its actual port width in mxstack[PORT_WIDTH_LOG2] and that the decode of PUSH/POP correctly handles the multi-cycle case when N > dwords-per-cycle.
9.2 Multi-Cycle PUSH/POP Sequencing
A PUSH of N registers on a port with W dwords-per-cycle issues ⌈N / W⌉ BSRAM write transactions. The instruction is atomic from the architectural perspective: a trap mid-sequence either does not advance usp (if no writes have committed) or commits all writes and advances usp to the final value (no partial state).
Implementations typically handle this by:
- Pre-computing the final
uspvalue at decode time. - Checking the final value against
uslbefore any write commits (overflow detected up-front). - Issuing the write transactions in sequence.
- Committing
uspto the final value only after the last write retires.
A trap during the sequence (e.g., asynchronous timer interrupt) is held off until the instruction retires; the BSRAM transactions are not interruptible mid-instruction.
9.3 Forwarding into Subsequent Instructions
If a POP is followed immediately by an instruction that uses one of the popped registers, the implementation should forward the BSRAM read result directly into the dependent instruction's operand input, similar to standard load-use forwarding. Without this, the multi-cycle POP becomes a multi-cycle stall for the following instruction.
9.4 Bounds Check Path
The bounds-check comparison (post-update usp vs usl for overflow, post-update usp vs usb for underflow) is a 64-bit signed-compare on the critical path. On aggressive implementations, this should run in parallel with the BSRAM access, with the trap asserted retroactively if the bounds check fails. A simpler in-order implementation may serialise the check, costing one cycle per PUSH/POP.
9.5 Privilege Routing
The decoder must check the current privilege level on every Xstack instruction:
- Funct3 =
100instructions are always permitted (user stack accessible from all modes). - Funct3 =
101instructions trap as illegal in U-mode. - Funct3 =
111(reserved) traps always in v0.1. - Funct3 =
110instructions check per-instruction (some are privileged, some are not).
The check is on the privilege level at instruction issue; no later re-checking is needed.
10. Interaction with Other Extensions
10.1 Wide Mode (Xwide)
In wide mode (§3 of the parent doc), Xstack instructions occupy a full 32-bit slot within a 36-bit SRAM fetch word. The extension nibble (bits [35:32]) is mostly unused by Xstack since the architectural register operands are encoded compactly: rs1, rs2, and rd reach only x0–x31 in the base PUSH/POP encoding.
The wide-extended rlist patterns (§4.4.3) reference x32–x63 and f32–f63 directly without requiring extension-nibble bits — the pattern selector implies the register range. These patterns are valid only when the implementation supports wide mode (HAS_WIDE_RLIST bit in mxstack).
For management instructions (§4.3) where rs1 and rd are general operands, the standard extension-nibble scheme of §5.1 of the parent doc applies normally.
10.2 Xcrisp PIC
Xstack and Xcrisp PIC are independent. A wide-mode function may freely mix PIC instructions (for global access) with Xstack instructions (for fast frame management). A common pattern:
hot_function: ; placed in .text.wide
PUSH rlist=01000, spimm=0 ; save ra, s0–s7
LAPC s0, global_lookup_table ; PIC: get table base in one instruction
LDPC s1, current_state ; PIC: get state pointer
; ... function body using s0, s1 ...
POPRET rlist=01000, spimm=0
10.3 Zcmp
Zcmp targets the DDR3 stack via x2. Xstack targets the hardware stack via usp/ssp/msp. They do not interact and may be used independently. A function may compile to either depending on the compiler's analysis; the choice is per-function, not per-instruction.
11. Encoding Summary
11.1 At-a-Glance Opcode Map
| Mnemonic | opcode | funct3 | funct7 / op2 | Format | Privilege |
|---|---|---|---|---|---|
| PUSH, POP, POPRET, POPRETZ | 0x5B |
100 | rlist+op2 | rlist+spimm | U+ |
| ENTER, LEAVE | 0x5B |
100 | 1000000/01 | spimm | U+ |
| PUSHS..POPSRETZ | 0x5B |
101 | rlist+op2 | rlist+spimm | S+ |
| ENTERS, LEAVES | 0x5B |
101 | 1000000/01 | spimm | S+ |
| UPEEK, UPOKE | 0x5B |
110 | 0000000/01 | spimm-scaled | U+ |
| SPEEK, SPOKE | 0x5B |
110 | 0000010/11 | spimm-scaled | S+ |
| USWAP | 0x5B |
110 | 0010000 | R-type | U+ |
| SSWAP | 0x5B |
110 | 0010001 | R-type | S+ |
| GETUSP, SETUSP, GETUSB, GETUSL | 0x5B |
110 | 0100000–0100101 | R-type | U+ |
| GETSSP, SETSSP, GETSSB, GETSSL | 0x5B |
110 | 0100010, 0100011, 0100110, 0100111 | R-type | S+ |
| UPUSHR, UPOPR | 0x5B |
110 | 0110000/01 | R-type | U+ |
| SPUSHR, SPOPR | 0x5B |
110 | 0110010/11 | R-type | S+ |
11.2 Reserved Spaces
0x5Bfunct3 =100rlist =01101–01111(reserved prologue patterns)0x5Bfunct3 =100rlist =10101–10111(reserved trap patterns)0x5Bfunct3 =100rlist =11111(reserved special pattern)0x5Bfunct3 =100funct7 =1000010–1111111(reserved non-rlist operations)0x5Bfunct3 =101mirrors the funct3 =100reservations0x5Bfunct3 =110funct7 =0000100–0001111,0010010–0011111,0101000–0101111,0110100–1111111(reserved management instructions)0x5Bfunct3 =111(reassigned to Xctx; see §4.5)
12. Open Items
- Machine-stack instruction set. If added in v0.2, machine-stack push/pop/management instructions will live in the management subfamily (funct3 =
110) with funct7 in the1xxxxxxrange. The machine-stack CSRs (msp/msb/msl) are accessible via standard CSR ops in v0.1. - CSR addresses. All allocations are suggested; final assignment requires coordination with the wide-dirty bit (parent doc open items §14.2) and the
mxcrispCSR. - Trap cause numbers. Suggested 24–31; final assignment depends on coordination across FireStorm extensions.
- rlist FP-only patterns. Currently only special-pattern
11110saves the full FP file; intermediate FP-only patterns (fa0–fa7,fs0–fs7) may be useful but consume rlist slots. - Per-function bounds adjustment. Should there be a fast instruction to atomically grow/shrink
uslfor sub-stack allocation (e.g., alloca-style)? Currently requires multiple csrrw operations. - Cross-privilege stack inspection. S-mode and M-mode may need to peek into U-stack memory for debugger and trap-frame access. The memory mapping (§3.4) permits this via standard load instructions; whether dedicated cross-privilege peek instructions are warranted is TBD.
- Reset state. What is the architectural state of
usp/ssp/mspafter reset? Probably all set equal to their respective*sb(empty stack), but firmware may want to override before first use. - Stack-overflow recovery. Should the architecture mandate that bounds-trap occurs before any register file is corrupted, or is post-trap rollback acceptable? Current spec (§9.2) assumes pre-commit detection.
13. Glossary
| Term | Meaning |
|---|---|
| Hardware stack | A LIFO data structure backed by FPGA BSRAM, addressed by a dedicated CSR stack pointer, with hardware-enforced bounds. |
| U-Stack / S-Stack / M-Stack | The hardware stacks at User / Supervisor / Machine privilege level. |
| rlist | A 5-bit selector encoding one of 32 predefined register-list patterns for PUSH/POP. |
| spimm | A 6-bit immediate, scaled ×16, that allocates or deallocates frame slack alongside a PUSH/POP. |
| Full-descending | A stack convention in which the pointer points at the most recent valid entry, and pushes decrement the pointer before writing. |
| Port width | The number of dwords (64-bit words) the hardware stack BSRAM can read or write in a single cycle. |
| USWAP / SSWAP | Atomic stack-pointer swap, the foundation of fast coroutine and green-thread context switching. |
End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension.