FireStorm Xctx Extension — Hardware Context Switching Specification

Document version: 0.1 (draft) Status: Initial design capture Parent document: FireStorm CPU ISA Companions: FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension, FireStorm Xlate Extension

1. Overview

The Xctx extension adds hardware-managed context switching to FireStorm. The CPU maintains a small pool of hardware-resident execution contexts, each holding the full register file and key per-task CSRs. Software interacts with contexts via a small set of instructions: a context can yield voluntarily, halt indefinitely, free its slot, or spawn a new context. A hardware slice timer optionally preempts a running context when its time allotment expires, automatically switching to the next ready context.

Context state lives in dedicated FPGA BSRAM, separate from main memory. A context switch is therefore a single hardware operation costing roughly 10–30 cycles on a wide BSRAM port (depending on variant) — versus hundreds of cycles for a software save/restore through DDR3. The cost is bounded, predictable, and far below the granularity at which switching becomes useful.

The architecture is designed so that a future multi-core FireStorm variant can share the context pool across cores transparently: a context preempted on one core can be picked up by any other core whose pipeline is idle, with no instruction-set changes (see §7). v0.1 specifies single-core behaviour; multi-core is a v0.2 extension.

1.1 Wins

Fast user-level fibers. A cooperative fiber library can YIELD between fibers in a single instruction. Production fiber implementations on standard RV64 cost 30–60 instructions per yield (save callee-saved regs, switch stack, restore callee-saved regs); Xctx YIELD is one instruction.
Cheap interrupt-driven I/O. A task that initiates I/O can HALT itself in a single instruction; the I/O completion ISR issues RESUME and the task is back on the ready queue. Compare with standard practice of inserting a context-switch into the ISR exit path, or polling.
Predictable preemption latency. With a hardware slice timer, preemption happens at the cycle the slice expires — no software-timer poll latency, no missed quanta.
Trivially-parallelisable workloads. A program with many ready contexts and multiple cores (v0.2) sees automatic load balancing as cores pull from the shared queue. No software dispatcher needed.
Lower memory pressure. Context state lives in dedicated BSRAM rather than the DRAM stack, so save/restore traffic does not pass through the small 8 KB D-cache or consume DRAM bandwidth. On a cache-rich architecture this would manifest as reduced D-cache eviction; on FireStorm's tiny-cache memory subsystem, it's a more critical win — the cache is small enough that fiber-state churn would quickly thrash it.

1.2 Non-Goals

Not an OS. Xctx is a primitive for building schedulers, not a scheduler itself. Policy decisions (priority, fairness, real-time constraints) live in software on top of the Xctx primitives.
Not virtual memory. All contexts share the same memory address space (no per-context page table). Process isolation requires software (or future M-mode privilege) above Xctx.
Not full thread-local storage. Per-context per-register translator state and xstack pointers are saved/restored; arbitrary thread-local data is not. Use a per-context register (e.g., tp per the standard ABI) as a TLS pointer.
Not unbounded. The context pool has a finite size (8–32 contexts per FireStorm variant). Programs needing more concurrency must multiplex software fibers within hardware contexts.

2. Relationship to Standard RISC-V

Xctx adds new instructions in custom-2 (0x5B) funct3 = 111, an opcode slot otherwise unused in v0.1 of the FireStorm extensions. Standard RV64GC code is unaffected; a standard RV64 implementation receiving an Xctx instruction traps as illegal-instruction.

The mxctx CSR (§8) advertises Xctx presence and pool size. Programs that need to be portable across Xctx-aware and non-Xctx FireStorm variants probe the CSR and fall back to software fibers when Xctx is absent.

2.1 Privilege Model

Most Xctx instructions are unprivileged — any context may manage other contexts via YIELD, HALT, FREE, NEW, RESUME, CTXID, and CTXSTATE. This is intentional: hardware fiber systems need to be usable without a kernel mediating every operation. The instructions that do need privilege:

Setting the slice timer (CSR mxctxslice write) is M-mode only. The kernel sets the preemption policy; tasks cannot extend their own slice.
Inspecting and modifying the context pool directly (e.g., reading another context's saved state from BSRAM) is M-mode only, via CSR-mediated access.
Resetting / killing all contexts (CSR mxctx_reset) is M-mode only.

The denial-of-service risk inherent in unprivileged NEW is bounded by the context pool size: a malicious or buggy task can at worst exhaust the pool, at which point NEW fails (returns -1) and the offending task can be detected and FREEd by the kernel via privileged inspection.

2.2 Interaction with Standard Traps and Interrupts

A trap or interrupt taken on the currently-running context follows standard RV64 semantics: the trap handler executes in the same context as the trapping instruction. The slice timer pauses during M-mode (and S-mode, where implemented) execution — kernel and trap-handler code is not subject to preemption.

If the slice timer is paused but a trap completes returning to a context whose slice has expired, the hardware honours the expiry on the first instruction of resumed user code, switching to the next ready context.

3. Context State

A "context" is the architectural state required to resume an interrupted task. The Xctx context comprises:

State element	Size (wide)	Size (narrow)	Notes
GPRs x0–x63	512 bytes	256 bytes (x0–x31)	x0 always zero; storage is contiguous
FPRs f0–f63	512 bytes	256 bytes (f0–f31)	Excluded from context if F/D not implemented
PC	8 bytes	8 bytes	64-bit instruction address
FCSR	4 bytes	4 bytes	Floating-point control/status
xlate state	32 bytes	16 bytes	4 CSRs in wide mode (`xlate_rd_0..3`, `xlate_wr_0..3`); 2 in narrow
xstack state	24 bytes	24 bytes	`usp`, `usb`, `usl`
Mode flag	1 byte	1 byte	Narrow / wide
Reserved	(pad)	(pad)	Round to 1 KB
Total	~1100 B (round to 1024)	~600 B (round to 512)

Implementations may round the per-context storage allocation up to the nearest convenient size (256 B, 512 B, 1 KB, 2 KB) for BSRAM addressing convenience. The mxctx CSR reports the actual per-context storage size (§8).

3.1 What Is Not In a Context

The following state is per-CPU, not per-context — it is not saved/restored on context switch:

mstatus and other privilege/trap CSRs (managed by traps, not tasks).
mxcrisp, mxlate, mxstack, mxcond, mxctx feature CSRs (read-only).
mxctxslice (slice duration is a kernel policy, not per-task).
Cache and TLB state (microarchitectural, transparent to tasks).
The xstack base/limit CSRs themselves are per-context (a context has its own user stack region in BSRAM, see §13.4 for interaction); but their allocation is at context creation time.

3.2 Context Storage

Context state lives in a dedicated FPGA BSRAM bank distinct from the Xstack BSRAM and from main memory. The bank is sized per variant:

FireStorm variant	Contexts	Per-context	Bank size
GW5AST-138	32	1 KB	32 KB

The bank is addressed by context ID (0..N−1). Each context's region is at a fixed offset; the hardware maps context-state accesses transparently. Software does not directly address the context BSRAM in user mode; M-mode code may read/write context state via dedicated CSRs (§8) for debugging and migration.

The BSRAM port width determines the cycles needed to save or load a context. Suggested port widths:

Variant	Port width	Save/restore cycles (1 KB context)
GW5AST-138	576-bit (72 B/cycle)	~15 cycles

A full context switch (save current + load next) costs roughly 2× the single-direction cost, plus a few cycles for queue manipulation: ~30 cycles. This is the worst case — implementations with parallel save/restore paths can overlap the two and approach single-direction latency.

4. Context Lifecycle

A context is always in one of four states, tracked in the hardware state table (1 entry per context, ~4 bits each):

State	Code	Meaning
Free	`00`	Slot is unallocated; no associated task
Running	`01`	Currently executing on a CPU core (in v0.1, the single core)
Ready	`10`	Waiting in the ready queue; will be picked up when CPU is free
Halted	`11`	Suspended by HALT; will not run until RESUMEd

State transitions:

                   NEW
            Free ────────► Ready
                           │
                  (pulled  │ ▲
                   by CPU) │ │ (preempted / YIELD)
                           ▼ │
                        Running ─────► Halted
                           │   HALT
                           │
                           └─────► Free
                              FREE

NEW: Free → Ready (with initial state set up; see §5.4).
CPU pickup: Ready → Running (automatic, on slice expiry or YIELD).
YIELD or slice expiry: Running → Ready.
HALT: Running → Halted.
RESUME (issued by another context, possibly an ISR): Halted → Ready.
FREE: Running → Free (current context exits).
External / debug: M-mode may force any state transition via CSR ops.

4.1 Ready Queue

The ready queue is a hardware FIFO of context IDs in the Ready state. When the CPU becomes idle (current context yielded, halted, freed, or expired its slice), the next ID is dequeued and that context becomes Running.

Queue depth equals the number of context slots. Order is FIFO by default; M-mode may reorder via CSR ops for priority schedulers.

If the queue is empty (no Ready contexts), the CPU idles — pipeline halts until an event (interrupt, or another agent issues NEW or RESUME) makes a context Ready. This is the natural fall-through for low-power operation when all tasks are blocked.

5. Instruction Set

All Xctx instructions are 32-bit R-type in custom-2 (0x5B), funct3 = 111. They are available in both narrow and wide mode.

 31        25 24    20 19    15 14   12 11     7 6           0
+-----------+--------+--------+-------+--------+-------------+
|  funct7   |  rs2   |  rs1   |  111  |   rd   |  1011011    |
+-----------+--------+--------+-------+--------+-------------+

funct7	Mnemonic	rd	rs1	rs2	Operation
`0000000`	YIELD	—	—	—	Save current, mark Ready, load next
`0000001`	HALT	—	—	—	Save current, mark Halted, load next
`0000010`	FREE	—	—	—	Free current slot, load next
`0000011`	NEW	result	entry PC	initial sp	Allocate ctx; rd = ID or -1
`0000100`	RESUME	—	ctx ID	—	Wake halted context
`0000101`	CTXID	result	—	—	rd = current context ID
`0000110`	CTXSTATE	result	ctx ID	—	rd = state code of context rs1
`0000111`	CTXCOUNT	result	—	—	rd = number of Ready contexts
`0001000`–`1111111`	reserved				illegal-instruction

Unused register fields encode as x0. For instructions with no register operands (YIELD, HALT, FREE), all three register fields must be x0; non-x0 encodings are reserved.

5.1 YIELD — Voluntary Yield

YIELD                       ; encoding: 0000000 00000 00000 111 00000 1011011

Saves the current context's state to BSRAM, marks the context as Ready (pushes its ID onto the ready queue), and loads the next Ready context. Single instruction; the architectural effect is equivalent to:

Save GPRs, FPRs, PC of next instruction, FCSR, xlate state, xstack state to current context's BSRAM region.
Push current context ID onto ready queue.
Pop next ID from ready queue (or idle if empty).
Load that context's state from BSRAM into the live register file and CSRs.
Resume execution at the loaded PC.

The pushed PC is the instruction following YIELD — i.e., when the yielding context is resumed, execution continues immediately after the YIELD.

If the ready queue is empty when YIELD executes, the current context's state is saved and the CPU enters the idle state (pipeline parked, low-power if supported) until a new context becomes Ready.

5.2 HALT — Suspend Until Resumed

HALT                        ; encoding: 0000001 00000 00000 111 00000 1011011

Saves the current context's state and marks it Halted. The context is not pushed onto the ready queue; it will not be scheduled until another context (or an interrupt handler) issues RESUME for its ID.

After HALT, the CPU loads the next Ready context (or idles if none). The saved PC is the instruction following HALT.

Common use: a task that blocks on I/O obtains its own context ID via CTXID, registers it with the I/O subsystem, then HALTs. The completion ISR issues RESUME on the stored ID.

5.3 FREE — Destroy Current Context

FREE                        ; encoding: 0000010 00000 00000 111 00000 1011011

Marks the current context as Free. The context's BSRAM region is released; the slot becomes available for future NEW operations. No state is saved (the task is exiting). The CPU loads the next Ready context.

A context that executes FREE never resumes. This is the analogue of a thread exit() call.

5.4 NEW — Spawn New Context

NEW rd, rs1, rs2            ; encoding: 0000011 rs2 rs1 111 rd 1011011
                            ; rd = context ID (or -1 if pool exhausted)
                            ; rs1 = entry PC
                            ; rs2 = initial stack pointer (x2)

Allocates a Free context slot and initialises it with:

pc = rs1 (the entry PC; the new context begins execution here)
x2 = rs2 (the initial stack pointer)
All other GPRs = 0 (clean register file)
All FPRs = 0; FCSR = 0
Translator state = all identity (xlate CSRs zero)
xstack pointers = freshly allocated user-stack region (one per context; see §13.4)
Mode = same as the caller (narrow or wide)

The new context is pushed onto the ready queue. The instruction returns the allocated context ID in rd, or -1 (all bits set) if no Free slot is available.

The caller continues execution at the instruction following NEW; the new context will execute when the CPU schedules it (typically on the next YIELD or slice expiry).

To pass arguments to the new context, the caller may either (a) push arguments to a shared memory region and pass its address as part of the initial stack, or (b) NEW with rs2 = arg_block_addr and arrange the new context's entry code to consume them.

5.5 RESUME — Wake Halted Context

RESUME rs1                  ; encoding: 0000100 00000 rs1 111 00000 1011011
                            ; rs1 = context ID to wake

If the context named by rs1 is in the Halted state, it is moved to Ready and pushed onto the ready queue. If the context is in any other state (Free, Running, Ready), RESUME is a no-op. No trap.

The non-trapping semantics on non-Halted contexts is deliberate: it makes the common ISR pattern "wake up context X, regardless of its current state" race-free. The ISR doesn't need to check the state first.

The caller continues execution at the next instruction.

5.6 CTXID — Get Current Context ID

CTXID rd                    ; encoding: 0000101 00000 00000 111 rd 1011011
                            ; rd = my context ID

Returns the current context's ID in rd. Useful for self-registration with I/O subsystems, debug printing, and any code that needs to identify itself.

5.7 CTXSTATE — Query Context State

CTXSTATE rd, rs1            ; encoding: 0000110 00000 rs1 111 rd 1011011
                            ; rd = state code of context rs1
                            ;      0=Free, 1=Running, 2=Ready, 3=Halted

Returns the state of the context named by rs1. If rs1 is out of range (≥ number of context slots), returns -1.

5.8 CTXCOUNT — Count Ready Contexts

CTXCOUNT rd                 ; encoding: 0000111 00000 00000 111 rd 1011011
                            ; rd = number of contexts in Ready state (excluding self)

Returns the current ready-queue depth. Useful for adaptive scheduling decisions (e.g., a task that sees a deep ready queue might shorten its work batch and YIELD sooner).

6. Time Slicing

The slice timer is a hardware counter that decrements each cycle the current context runs in U-mode. When it reaches zero, the hardware behaves exactly as if the current context had issued YIELD.

6.1 Slice Duration CSR

CSR	Address (suggested)	Privilege	Description
`mxctxslice`	`0xBC8`	MRW	Slice duration in cycles

The slice value applies uniformly to all contexts; v0.1 does not support per-context slice durations. Suggested values:

0: slice timer disabled. The current context runs until it voluntarily YIELDs, HALTs, or FREEs itself. This is the cooperative scheduling mode.
1000 (1 µs at 1 GHz): aggressive preemption for highly-interactive workloads.
100000 (100 µs): typical preemptive multitasking quantum.
Maximum: 64-bit value, effectively unbounded.

The slice timer is shared across contexts in v0.1 — it starts counting from the slice value when a context begins running, and decrements until either zero (preemption) or the context yields voluntarily. On context switch, the timer reloads to mxctxslice for the incoming context.

6.2 Slice Pauses

The slice timer pauses in:

M-mode (kernel) execution
S-mode (supervisor) execution, where implemented
Trap handlers (until mret / sret)
BSRAM context save/restore operations themselves (the switch is not charged to either context's slice)

The timer resumes when U-mode execution resumes.

6.3 No-Slice Mode

When mxctxslice = 0, the hardware never auto-preempts. Tasks coordinate via explicit YIELD. This is the natural mode for:

Cooperative fiber systems with well-behaved tasks
Real-time code paths where preemption would cause unacceptable jitter
Boot-time single-task execution

The cooperative mode is the default at reset.

7. Multi-Core Handoff (Future)

The architecture in this document is single-core; v0.2 of Xctx will specify multi-core behaviour. The design intent is that the instruction-set semantics do not change: software written for single-core Xctx works unchanged on multi-core Xctx, gaining parallel execution automatically as additional cores pull Ready contexts from a shared queue.

The v0.2 design points (sketched):

A shared ready queue accessible to all cores.
A shared context BSRAM bank accessible to all cores.
Per-core "current context ID" register; the rest of the per-core state is the standard register file (which now holds the active context's registers on that core).
When a core becomes idle (YIELD/HALT/FREE/slice expiry on its current context), it atomically pops the next Ready ID from the shared queue and loads that context's state. Concurrent pops are serialised through the queue's arbitration.
NEW from any core allocates from the shared free pool; the new context becomes Ready and may be picked up by any core (including the one that issued NEW, if that core's current task subsequently YIELDs).
The instruction-set additions remain the same as v0.1.

What v0.2 specifies that v0.1 does not:

Cache-coherence behaviour between cores accessing the same memory region from different contexts.
Memory-ordering semantics for context migration (a value written by context A on core 0 must be visible to context A on core 1 after migration).
Possible CSR additions for core-pinning, NUMA hinting, or priority-based queue ordering.

The Ant64 platform initially has a single FireStorm core in the GoWin FPGA; multi-core FireStorm is a future product direction. v0.1 of Xctx delivers the full benefit to the single-core case.

8. CSR Allocation

8.1 Detection and Configuration

CSR	Address (suggested)	Privilege	Description
`mxctx`	`0xFC5`	MRO	Xctx version and pool size
`mxctxslice`	`0xBC8`	MRW	Slice duration in cycles (§6.1)
`mxctxid`	`0xC80`	URO	Current context ID (alias for CTXID, faster)
`mxctx_reset`	`0xBC9`	MRW	Write any value to reset the entire context pool

Bit layout of mxctx:

Bits	Field	Meaning
`[0]`	PRESENT	1 if Xctx implemented
`[7:1]`	VERSION	Xctx version (1 = v0.1)
`[15:8]`	NUM_CONTEXTS	Number of context slots (8–32 typical)
`[23:16]`	CTX_SIZE_LOG2	log₂ of per-context storage in bytes (10 = 1 KB)
`[24]`	HAS_FPR_CONTEXT	1 if FPR state is included in context
`[25]`	HAS_XLATE_CONTEXT	1 if Xlate state is per-context
`[26]`	HAS_XSTACK_CONTEXT	1 if Xstack pointers are per-context
`[27]`	MULTI_CORE	1 if multi-core context migration supported (v0.2)
`[63:28]`	reserved	—

8.2 M-Mode Context Inspection (Debug/Migration)

For debuggers and kernel-level migration support, M-mode may inspect and modify any context's saved state via:

CSR	Address (suggested)	Privilege	Description
`mxctx_sel`	`0xBCA`	MRW	Selects which context's state is exposed in the windows below
`mxctx_reg`	`0xBCB`	MRW	Indexed register window (write `mxctx_idx` first to select GPR/FPR/PC/etc.)
`mxctx_idx`	`0xBCC`	MRW	Index within the selected context's state
`mxctx_state`	`0xBCD`	MRW	Read/write the state code of `mxctx_sel`'s context

This provides a generic "select a context, then window into its state via index" API. The specific index encoding is implementation-defined within the architectural framework; suggested layout: 0–63 for GPRs, 64–127 for FPRs, 128 for PC, 129 for FCSR, 130–133 for xlate CSRs, 134–136 for xstack CSRs.

User code does not need these CSRs — they exist for kernels, debuggers, and core migration logic.

9. Trap and Interrupt Interaction

9.1 Traps During Normal Execution

A trap (synchronous or asynchronous) on the running context is handled in the standard RV64 way: control transfers to the trap handler, which executes in the same context (sharing its register file). The trap handler may issue Xctx instructions (e.g., RESUME to wake a blocked task) before returning via mret/sret.

The current context does not change on trap entry — the trap handler runs inside the trapping context. If the kernel wants to switch contexts before returning to user, it issues YIELD or sets up a different context to be next.

9.2 Traps During Context Switch

A context switch is atomic with respect to traps: a trap raised mid-switch is held until the switch completes (the new context is fully loaded), at which point the trap is taken in the new context. This prevents the trap handler from observing half-saved or half-loaded register state.

9.3 Interrupts Targeting Halted Contexts

A hardware interrupt always targets the currently-running context. If the interrupt is logically meant for a Halted context (e.g., an I/O completion for a blocked task), the running ISR is responsible for calling RESUME on the appropriate context ID to make it Ready. The kernel's interrupt dispatch table maps device → context ID.

9.4 Asynchronous Trap on a Halted Context

If a trap source becomes pending while its destination context is Halted, the trap is queued within the trap controller. When the context is RESUMEd and starts running, the queued trap fires on the first instruction. This avoids losing traps across HALT/RESUME boundaries.

(Implementation note: this requires per-context trap-pending bits in the interrupt controller. Simpler implementations may handle only currently-running-context traps in v0.1; queued-trap support is an open item.)

9.5 New Trap Causes

Cause	Mnemonic	Trigger
`40`	XCTX_POOL_EXHAUSTED	Optional: NEW failed; can be left to software (NEW returns -1 instead of trapping)
`41`	XCTX_INVALID_ID	Reserved for v0.2; v0.1 silently no-ops on invalid IDs
`42`	XCTX_PRIVILEGE	Attempted unprivileged write to `mxctxslice` or other M-mode CSR

Cause numbers are suggested.

10. Examples

All examples assume Xctx, Xstack, and Xlate are present (+xfirestorm).

10.1 Cooperative Fiber Yield

Two fibers exchange the CPU via simple YIELDs:

void fiber_a(void) {
    for (int i = 0; i < 10; i++) {
        do_a_work(i);
        yield();        /* expands to YIELD */
    }
}

void fiber_b(void) {
    for (int i = 0; i < 10; i++) {
        do_b_work(i);
        yield();
    }
}

int main(void) {
    new_context(fiber_a, stack_a + STACK_SIZE);
    new_context(fiber_b, stack_b + STACK_SIZE);
    /* main becomes the scheduler — yields forever until both fibers FREE */
    while (1) yield();
}

In assembly:

fiber_a_body:
        ; ... do_a_work ...
        YIELD
        ; ... loop ...

fiber_b_body:
        ; ... do_b_work ...
        YIELD
        ; ... loop ...

main:
        LAPC    a1, fiber_a_body
        LAPC    a2, stack_a_top
        NEW     t0, a1, a2          ; t0 = fiber A's context ID
        LAPC    a1, fiber_b_body
        LAPC    a2, stack_b_top
        NEW     t0, a1, a2          ; t0 = fiber B's context ID
.Lschedloop:
        YIELD
        j       .Lschedloop

Each YIELD is one instruction (~30 cycles including the switch on Ant64). For 10 iterations × 2 fibers × 2 YIELDs (in and out): ~600 cycles of switching overhead total. A software-fiber equivalent on FireStorm (with only an 8 KB D-cache) would cost ~30 instructions per yield × 40 yields = 1200 instructions of save/restore that quickly evicts useful data from the tiny D-cache. Roughly 2× faster for the switching overhead on FireStorm specifically, plus the BSRAM-resident context state never touches the small D-cache.

10.2 Producer-Consumer with HALT/RESUME

A producer fills a buffer; a consumer drains it. When the buffer is full, the producer HALTs; when the consumer drains a slot, it RESUMEs the producer. Symmetric for empty-buffer/consumer-halts.

int producer_ctx, consumer_ctx;

void producer(void) {
    while (1) {
        int item = make_item();
        while (buffer_full()) {
            producer_ctx = ctxid();      /* register self */
            halt();                       /* sleep until consumer wakes me */
        }
        buffer_push(item);
        if (consumer_ctx >= 0) {
            int c = consumer_ctx;
            consumer_ctx = -1;
            resume(c);                    /* wake consumer if it was waiting */
        }
    }
}

void consumer(void) {
    while (1) {
        while (buffer_empty()) {
            consumer_ctx = ctxid();
            halt();
        }
        int item = buffer_pop();
        if (producer_ctx >= 0) {
            int p = producer_ctx;
            producer_ctx = -1;
            resume(p);
        }
        consume(item);
    }
}

This implements a classic blocking producer-consumer queue using only HALT and RESUME — no condition variables, no mutex (assuming single-core and atomic buffer ops). Each block-and-wake is a few instructions; no kernel call.

10.3 Preemptive Round-Robin

A simple preemptive multitasker: any number of tasks, each gets equal time slices, no explicit YIELD needed.

/* Setup */
__set_mxctxslice(10000);              /* 10 µs slices, M-mode */
new_context(task1, stack1_top);
new_context(task2, stack2_top);
new_context(task3, stack3_top);

/* Original context (main) joins the round-robin pool */
while (1) {
    main_work();
    /* No yield needed — hardware preempts after 10000 cycles */
}

The slice timer expires every 10000 cycles and switches contexts automatically. No software dispatcher needed; the hardware ready queue handles all the rotation.

10.4 I/O Wait via HALT

A network receive operation:

int recv(int sockfd, void *buf, size_t len) {
    if (data_available(sockfd)) {
        return immediate_recv(sockfd, buf, len);
    }
    /* No data — register self and halt until interrupt wakes us */
    register_waiter(sockfd, ctxid());
    halt();                            /* ISR will RESUME us when data arrives */
    return immediate_recv(sockfd, buf, len);
}

/* ISR for the network device */
void net_isr(void) {
    int waiter = lookup_waiter(get_sockfd_from_irq());
    if (waiter >= 0) {
        resume(waiter);
    }
}

The blocking recv costs: register_waiter (a few instructions for the table update) + halt (1 instruction, ~30 cycles for the switch). The ISR costs: lookup_waiter + resume (1 instruction). Total wake-to-running latency: well under 100 cycles. Compare to a software-thread implementation, which typically takes 10× longer due to mode switches and queue manipulation.

10.5 Task Creation

A fork-style spawn:

int spawn(void (*entry)(void *), void *arg) {
    void *stack = malloc_stack(DEFAULT_STACK_SIZE);
    if (!stack) return -1;
    /* Push arg onto the new stack so entry sees it */
    void *sp = (char *)stack + DEFAULT_STACK_SIZE;
    *((void **)sp - 1) = arg;
    sp = (char *)sp - sizeof(void *);
    return __builtin_xctx_new(entry, sp);
}

In assembly the body is:

spawn:
        ; ... allocate stack into a2 ...
        sd      a1, -8(a2)              ; push arg
        addi    a2, a2, -8              ; sp -= 8
        NEW     a0, a0, a2              ; a0 = new ctx ID (or -1)
        ret

10.6 Worked Encoding

YIELD (no operands):

funct7 = 0000000, rs2 = x0, rs1 = x0, funct3 = 111, rd = x0, opcode = 0x5B
Encoding: 0000000 00000 00000 111 00000 1011011 = 0x0000_705B

NEW a0, a1, a2 (allocate context; a0 = result, a1 = entry, a2 = sp):

funct7 = 0000011, rs2 = 12 (a2), rs1 = 11 (a1), funct3 = 111, rd = 10 (a0)
Encoding: 0000011 01100 01011 111 01010 1011011

RESUME a0 (wake context whose ID is in a0):

funct7 = 0000100, rs2 = x0, rs1 = 10 (a0), funct3 = 111, rd = x0
Encoding: 0000100 00000 01010 111 00000 1011011

11. ABI and OS Integration

11.1 Calling Convention

Xctx does not alter the standard lp64d calling convention. A context's register file follows the usual caller-saved / callee-saved categorisation; YIELD is treated as a "function call" with no inputs or outputs from the language perspective (no register changes architecturally guaranteed across YIELD, though in practice the same register file is restored when the context resumes).

Code calling yield() as a C function should treat it as a memory barrier (the compiler may not reorder memory accesses across it) and as a clobber of all caller-saved registers (in case the resumed context's state is different from the yielding one's — though architecturally Xctx restores exactly the same state, the C model is conservative).

11.2 Standard-Library Integration

Standard pthread or <threads.h> interfaces can be implemented on top of Xctx with significant performance gains:

pthread_create → NEW + stack allocation
pthread_yield → YIELD
pthread_join → loop of CTXSTATE polling, or HALT/RESUME via shared flag
pthread_exit → FREE
mutex_lock (blocking) → CTXID + HALT pattern (§10.4)

A reference libc implementation of pthread on Xctx is a future deliverable.

11.3 Kernel Integration

A kernel using Xctx for thread scheduling sees:

No save/restore code in trap entry/exit (the hardware does it on context switch).
No software timer ISR needed for preemption (the slice timer does it).
Cheap dispatch — schedule_next() is just YIELD, and the hardware picks the next Ready context.

Kernel-only complications:

Inspecting another task's state for debugging requires the M-mode context-inspection CSRs (§8.2).
Migrating a task from one queue to another (for priority schedulers) requires M-mode access to the state table.

12. Implementation Guidance

12.1 BSRAM Configuration

The context BSRAM is a single bank with a wide port (the actual width determines switch cycle cost). The bank is single-ported sufficient — context save and load operations are serialised through the same port, so a switch (save + load) takes 2× the single-direction transfer time. A dual-ported BSRAM (save and load in parallel) would halve switch latency but doubles the BSRAM resource cost.

Suggested configuration:

576-bit port (8 dwords/cycle)
32 contexts × 1 KB = 32 KB total
Switch latency: ~30 cycles (15 each direction)

12.2 Pipeline Integration

A context switch flushes the pipeline (any in-flight instructions of the outgoing context must complete or be squashed). The simplest implementation issues YIELD/HALT/FREE as a serialising instruction that drains the pipeline before saving state. Aggressive implementations can overlap the drain with the save (write-back of in-flight instructions becomes the first phase of save).

Pipelined save and load can also overlap if the BSRAM has sufficient port width; for 576-bit ports, the bottleneck is the register-file read/write ports, which dictate ~15-cycle parallel save and load.

12.3 Slice Timer Implementation

A single 64-bit decrementing counter, clocked by the CPU's main clock, with a pause signal that asserts in M-mode, S-mode, and during context switches. On reaching zero, asserts a "preempt" signal that triggers an internal YIELD on the next non-trapped U-mode instruction boundary.

The counter reloads from mxctxslice on every context switch (i.e., the incoming context starts with a fresh slice).

Total hardware cost: one 64-bit counter, a few mux/compare gates. Negligible.

12.4 State Table Implementation

The state table (Free/Running/Ready/Halted per context) is a small RAM with 2 bits per entry: 16 entries × 2 bits = 4 bytes for Ant64. Read/written by the hardware on every state transition.

12.5 Ready Queue Implementation

A small FIFO of context IDs, ~5 bits per entry × N entries. Trivial in any FPGA.

12.6 Idle State

When all contexts are Halted or Free (no Ready), the CPU enters an idle state: pipeline parked, clocks gated where possible, awaiting an interrupt. The first interrupt jumps to the trap handler (in M-mode), which can RESUME a context or otherwise unblock the system.

13. Interaction with Other Extensions

13.1 Xwide

In wide mode, a context contains all 64 GPRs (x0–x63) and 64 FPRs (f0–f63). In narrow mode, only x0–x31 and f0–f31 are saved. A context started in one mode and resumed in the same mode sees its full register file restored.

Mode crossing is a real concern. A context started in narrow mode but resumed on a wide-mode CPU core sees the upper registers (x32–x63) as zeros (since they were never saved). A wide-mode context resumed in narrow mode loses access to x32–x63 entirely — but since narrow-mode code can't address them, this is a no-op observable.

Mixing modes within a single context (e.g., a single task that runs in narrow mode in one phase and wide mode in another) is allowed; the context's stored state includes the mode flag, and the mode is restored on context load.

13.2 Xlate

Each context has its own xlate CSR state (xlate_rd_0..3, xlate_wr_0..3). When a context yields, its xlate state is saved alongside its register file; when resumed, the xlate state is restored. This means tasks can configure translators without interfering with each other.

This is critical: without per-context xlate state, a task that sets up a byteswap translator would leak that translator into the next-scheduled task, corrupting its memory operations.

13.3 Xcond

Predicated instructions are stateless from a context perspective — there are no per-context predicate-state CSRs. Context switches across predicated instructions are uninteresting; the predicate evaluation happens entirely within the instruction.

13.4 Xstack

Each context has its own user-stack region in the Xstack BSRAM. The usp, usb, usl CSRs are per-context; on context creation (NEW), the kernel (or runtime) allocates a fresh BSRAM region for the new context's user stack and initialises the CSRs to point at it.

Implementation: the Xstack BSRAM is partitioned into N user-stack regions, one per context. Each region's base/limit are fixed; the per-context usb/usl simply select which region is active. On context switch, the active region selector updates along with the rest of the context state.

Suggested partitioning:

FireStorm variant	Xstack total	Per-context u-stack
GW5AST-138	64 KB	64 KB / 32 contexts = 2 KB each

(Supervisor and machine stacks are not per-context; they are per-privilege-level, shared across all U-mode tasks.)

13.5 Xcrisp

Xcrisp instructions execute within whatever context is currently running. Specific interactions:

DMA operations (DMACPY, DMASET) are issued from a context but run asynchronously. If the issuing context is preempted while the DMA is in progress, the DMA continues; when the context resumes (potentially much later), it observes the DMA either still-running or complete via the tagged count register, exactly as in §5.5.2 of ee_xcrisp.
Block memory operations (BMCPY, BMSET) are synchronous and consume slice cycles; a slice expiry mid-BMCPY behaves like any other trap — the BMCPY's restart semantics (§5.5.1 of ee_xcrisp) preserve progress in the registers, and the resumed context picks up where it left off.
PIC and indexed loads are ordinary loads; no context-specific interaction.
Auto-inc and memory-fused operations are ordinary loads/stores; no context-specific interaction.

14. Encoding Summary

14.1 At-a-Glance Map

Mnemonic	opcode	funct3	funct7	rd	rs1	rs2	Privilege
YIELD	`0x5B`	`111`	`0000000`	x0	x0	x0	U+
HALT	`0x5B`	`111`	`0000001`	x0	x0	x0	U+
FREE	`0x5B`	`111`	`0000010`	x0	x0	x0	U+
NEW	`0x5B`	`111`	`0000011`	rd	rs1	rs2	U+
RESUME	`0x5B`	`111`	`0000100`	x0	rs1	x0	U+
CTXID	`0x5B`	`111`	`0000101`	rd	x0	x0	U+
CTXSTATE	`0x5B`	`111`	`0000110`	rd	rs1	x0	U+
CTXCOUNT	`0x5B`	`111`	`0000111`	rd	x0	x0	U+

14.2 Reserved

0x5B funct3 = 111 with funct7 ≥ 0001000 (reserved for future Xctx expansion)
0x5B funct3 = 111 with non-x0 register fields in instructions defined to take no register operands (YIELD, HALT, FREE): reserved (may be repurposed in future for, e.g., YIELD-to-specific-context).

14.3 Note: Xstack v0.2 Machine-Stack Slot Reassignment

The Xstack v0.1 specification (§4.5) reserved custom-2 funct3 = 111 for "future machine stack operations." Xctx claims this slot in its place; Xstack machine-stack instructions (if added in v0.2) will use the management subfamily (funct3 = 110) with funct7 values in the 1xxxxxx range, which is currently reserved within that subfamily and provides ample room.

15. Open Items

CSR addresses. mxctx at 0xFC5 and the various mxctx* CSRs at 0xBC8–0xBCD are suggested; final assignment requires coordination with other FireStorm extensions' CSR allocations.
Multi-core architecture (§7). Sketched but not specified in v0.1. Key open questions: D-cache coherence between cores (FireStorm's tiny 8 KB D-cache per core needs a coherence protocol — likely simple invalidation-based given the small size), prefetch-buffer coherence across cores for SMC scenarios, context-migration semantics, core-pinning hints, queue-arbitration policy.
Per-context slice durations. v0.1 has a single global slice value. v0.2 could add per-context slice duration for priority schedulers (real-time tasks get longer slices; background tasks shorter).
Context groups / domains. Should there be a way to group contexts so they can only interact with each other (e.g., a "process" abstraction)? This would constrain RESUME and CTXSTATE by group. Open for v0.2 if process isolation is needed.
Trap queueing on Halted contexts (§9.4). Implementation-defined in v0.1; may be tightened in v0.2.
Context migration to/from main memory. When the BSRAM context pool is exhausted, software might want to swap contexts to DRAM. This requires a CSR-mediated "save context X to address A" and "load context X from address A" mechanism. Not in v0.1.
Trap during NEW. If NEW is interrupted mid-allocation (between Free state set and initial-state load), what happens? v0.1 says NEW is atomic; this may need refinement for large initialisation.
Per-context FP control state. FCSR is per-context; should FFLAGS and FRM (subfields of FCSR) be separately addressable as part of context inspection? Implementation choice.
Slice clock source. v0.1 uses the CPU clock. For low-power variants with clock gating, a separate always-on slice clock may be useful.
Cross-privilege context transitions. A context starting in U-mode that traps to M-mode and then returns — does the slice timer keep counting? v0.1 says paused in M-mode; the alternative is to keep counting (which would let a long M-mode trap consume an entire slice). The v0.1 choice is intentional to give tasks predictable U-mode time, but is open to revisit.

16. Glossary

Term	Meaning
Context	A snapshot of execution state (GPRs, FPRs, PC, key CSRs) sufficient to resume an interrupted task.
Context ID	A small integer (0..N−1) identifying a context slot in the hardware pool.
Context pool	The set of N context slots, sized per FireStorm variant.
Context state table	The hardware table tracking each slot's state (Free/Running/Ready/Halted).
Ready queue	The hardware FIFO of contexts in the Ready state, awaiting CPU pickup.
Slice	The number of cycles a Running context may execute before automatic preemption.
YIELD / HALT / FREE	Voluntary state transitions issued by the running context.
NEW / RESUME	State transitions issued on behalf of a different context (creating or waking it).
Idle	The CPU state when no context is Ready; pipeline parked awaiting an event.

End of document. See also: FireStorm CPU ISA, FireStorm Xcrisp Extension, FireStorm Xstack Extension, FireStorm Xcond Extension, FireStorm Xlate Extension.