FireStorm Xmath Extension
1. Overview
The Xmath extension accelerates games, audio synthesis, fixed-point DSP, and retro-style demoscene code. Its instruction set captures the operations that dominate inner loops in those workloads: fused multiply-add, saturating arithmetic, min/max/sign/abs, fast transcendental approximations, BAM-based trigonometry, fixed-shape 3D vector math, and multi-precision integer arithmetic with carry.
Xmath replaces the previously-reserved RISC-V V (vector) extension space in FireStorm. V was originally reserved for a future implementation, but the actual workload mix FireStorm targets — per-voice audio synthesis, per-vertex 3D math, per-pixel image processing with non-trivial dependencies — benefits more from fast scalar fused ops than from multi-element data parallelism. V remains a possible v0.3+ addition if a clear need emerges, but it is no longer reserved at the encoding level.
1.1 Scope
Xmath provides ~64 instructions across twelve groups:
| Group | Instructions | Purpose |
|---|---|---|
| G1 — Integer Fused MAC | MADD, MSUB, RMSUB, MADDH, MADDU, MADDW | Fixed-point math, FIR/IIR filters, integer dot products |
| G2 — Saturating Arithmetic | ADDS, SUBS, ADDSU, SUBSU, MULSAT, SAT.B, SAT.H, SAT.W | Audio mix-down, colour channel clamping |
| G3 — Min/Max/Sign/Abs | MIN, MAX, MINU, MAXU, ABS, SIGN | Clipping, bounding-box tests, branchless conditionals |
| G4 — FP Approximations | FRECIP.S/.D, FRSQRT.S/.D, FSIN.S/.D, FCOS.S/.D, FSINCOS.S/.D, FATAN2.S/.D | Reciprocals for perspective divide, normalisation, rotation, vector angle |
| G5 — BAM Trigonometry | FSINBAM.S/.D, FCOSBAM.S/.D, FSINCOSBAM.S/.D, FRAD2BAM, FBAM2RAD | Binary-angle measure trig — retro/demoscene-native, single-cycle modular reduction |
| G6 — 3D Vector Math Bundles | DOT3, DOT4, CROSS3, LENSQ3, LERP | 3D vertex math, vector normalisation, interpolation |
| G7 — Vector Componentwise Bundles | VADD3, VSUB3, VSCALE3, VMADD3, VNORM3 | Physics integration, steering, collision response, lighting normals |
| G8 — 2D Math Primitives | DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2 | Navmesh pathfinding, 2D collision, raycaster setup, top-down games |
| G9 — Game / Animation Math | CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP | UI easing, shader uniforms, procedural animation, AI parameter clamping |
| G10 — Distance Heuristics | MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2 | A* heuristics, grid-based pathfinding, broad-phase collision distance estimates |
| G11 — Quaternion Math | QMUL.S/.D, QROT.S/.D | Skeletal animation, camera orientation, character bone updates |
| G12 — Multi-Precision Integer | ADDC, SUBC, ROLC, RORC | Add-with-carry / borrow chains, 128-bit+ integer math, bignum shifts, crypto |
1.2 Mode Availability
| Group | Narrow Mode | Wide Mode |
|---|---|---|
| G1 — Integer Fused MAC | ✓ | ✓ |
| G2 — Saturating Arithmetic | ✓ | ✓ |
| G3 — Min/Max/Sign/Abs | ✓ | ✓ |
| G4 — FP Approximations | ✓ | ✓ |
| G5 — BAM Trigonometry | ✓ | ✓ |
| G6 — 3D Vector Math Bundles | ✓ (with constraints on register tuple placement) | ✓ |
| G7 — Vector Componentwise Bundles | ✓ (with constraints on tuple placement) | ✓ |
| G8 — 2D Math Primitives | ✓ (with constraints on 2-tuple placement) | ✓ |
| G9 — Game / Animation Math | ✓ | ✓ |
| G10 — Distance Heuristics | ✓ | ✓ |
| G11 — Quaternion Math | ✓ (4-tuple must fit in register file) | ✓ |
| G12 — Multi-Precision Integer | ✓ | ✓ |
Every Xmath instruction is available in both narrow and wide modes. Wide mode buys access to the extended register file (x32–x63 / f32–f63) but does not unlock any additional Xmath operations. This means a vanilla RV64GC binary recompiled for FireStorm narrow gets the full Xmath acceleration; only access to the larger register file is wide-mode-specific.
1.3 Detection
Xmath presence is signalled in the FireStorm-specific mxfeatures CSR (0xFC0), bit 6:
| Bit | Feature |
|---|---|
| 0 | Xcrisp |
| 1 | Xstack |
| 2 | Xcond |
| 3 | Xlate |
| 4 | Xctx |
| 5 | Xwide (always 1; mode is fetch-address driven) |
| 6 | Xmath |
Software queries mxfeatures to detect Xmath at runtime; the compiler emits +xmath at build time. The riscv64-firestorm-elf target enables Xmath by default; the vanilla riscv64-unknown-elf target does not.
The G12 multi-precision group and its xcarry CSR (§14) are part of Xmath and are present whenever mxfeatures bit 6 is set; they do not carry a separate feature bit. (If a future implementation ever ships Xmath without G12, a capability bit can be added to a dedicated mxmath CSR at that point — the detection scheme leaves room for it.)
1.4 Composition with Other Extensions
Xmath composes cleanly with the rest of FireStorm:
- Xcrisp (memory primitives): Xmath operates on registers; Xcrisp loads vectors / matrices into registers; the two are orthogonal and chain naturally. The Xcrisp B-tree primitives (BSRCH, BSCAN, BSHIFT — see §7 of
ee_xcrisp) are entirely independent of Xmath but compose well in code that combines indexed structure traversal with math. - Xstack (BSRAM hardware stacks): Xmath has no special interaction with the stack; standard caller/callee-saved conventions apply.
- Xctx (hardware contexts): Xmath state lives entirely in standard GPRs/FPRs and is saved/restored as part of normal context switching. The G12
xcarrybit (§14.1) is the one piece of non-GPR state; it is one bit folded into the per-context flag storage Xctx already swaps. - Xlate (memory translators): Xmath instructions are register-to-register and are not touched by translators — only the loads/stores that move operands in and out of registers are. This composes usefully with G12 multi-precision: limbs held big-endian in memory can be loaded through a byteswap read-translator so they arrive in host order (and written back through the matching write-translator — Xlate's involutory round-trip,
ee_xlate§3.2). ThexcarryCSR sits at0x808, immediately after Xlate's translator-config block (0x800–0x807); it is plain CSR state accessed with standard CSR instructions, never a memory operand, so translators never apply to it. Seeee_xlate§10.5.
Xcond Predication on Xmath R-Type Instructions
Wide mode only: every Xmath R-type instruction with an unused nibble bit 35 inherits Xcond's PRED-EN bit as a conditional-execution gate. When bit[35] = 1, the instruction executes conditionally based on the contents of the predicate register (mxcond_p); when bit[35] = 0, it executes unconditionally as described in the instruction's group section.
The conditional-execution semantics exactly match Xcond's R-type predication (§4 of ee_xcond): the instruction reads its operands and computes its result, but the result is written to rd only if the predicate condition holds. If the predicate is false, rd is left unchanged and any side effects (memory loads via vector bundles, flag updates) are suppressed.
The following table shows which Xmath instructions support predication in wide mode:
| Group | Instructions | Predicable? | Notes |
|---|---|---|---|
| G1 — Integer Fused MAC | MADD, MSUB, RMSUB, MADDH, MADDU, MADDW | No | R4-type uses all 4 nibble bits for register extension (rd, rs1, rs2, rs3) |
| G2 — Saturating Arithmetic | ADDS, SUBS, ADDSU, SUBSU, MULSAT, SHIFTSAT.{B,H,W} | Yes | All standard R-type; bit 35 = PRED-EN |
| G2 — Width Saturation | SAT.B, SAT.H, SAT.W | Yes | Unary; 2 spare bits, 35 = PRED-EN |
| G3 — Min/Max/Sign/Abs | MIN, MAX, MINU, MAXU, ABS, SIGN | Yes | All standard R-type |
| G4 — FP Approximations | FRECIP, FRSQRT, FSIN, FCOS, FSINCOS, FATAN2 | Yes | Unary FP; bit 34 = precision mode, bit 35 = PRED-EN |
| G5 — BAM Trigonometry | FSINBAM, FCOSBAM, FSINCOSBAM, FRAD2BAM, FBAM2RAD | Yes | Mixed integer→FP; bit 35 = PRED-EN |
| G6 — 3D Vector Bundles | DOT3, DOT4, CROSS3, LENSQ3, LERP | Yes | bit 35 = PRED-EN |
| G7 — Vector Componentwise | VADD3, VSUB3, VSCALE3, VMADD3, VNORM3 | Yes | bit 35 = PRED-EN; per-bundle gate (all 3 element writes suppressed if false) |
| G8 — 2D Math Primitives | DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2 | Yes | bit 35 = PRED-EN |
| G9 — Game / Animation Math | CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP | Yes | bit 35 = PRED-EN |
| G10 — Distance Heuristics | MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2 | Yes | bit 35 = PRED-EN |
| G11 — Quaternion Math | QMUL, QROT | Yes | bit 35 = PRED-EN; per-bundle gate |
| G12 — Multi-Precision Integer | ADDC, SUBC, ROLC, RORC | Yes | bit 35 = PRED-EN; predicate false ⇒ rd unchanged and xcarry unchanged |
Use cases for predicated Xmath:
-
Conditional MADD chains (G2): sparse FIR filters that skip zero coefficients without branches:
PCMP zero, coef[i] ; set predicate if coef != 0 ADDS.p acc, acc, sample[i]*coef[i] ; predicated MADD (only if nonzero) -
Conditional vector normalisation (G7): normalise only if the input has non-zero length:
LENSQ3 len_sq, v PCMP fzero, len_sq ; set predicate if length != 0 VNORM3.p normalised, v ; only normalise if non-zero -
Conditional bone update (G11): skip animation update for bones whose parent has no animation change:
PCMP quat_zero, parent_delta QMUL.p world_quat, parent_quat, local_quat ; only if parent moved -
*Conditional A heuristic** (G10): compute heuristic only for unvisited nodes:
PCMP unvisited, visited_flag OCTILE2.p h, neighbour, goal ; skip for visited nodes
In assembly, the predicated form is written with a .p suffix (or by setting PRED-EN explicitly in the encoding). The compiler emits predicated forms automatically when it detects branch-condition patterns that match Xcond's predicate semantics.
Narrow mode: PRED-EN bit does not exist; all Xmath R-type instructions execute unconditionally. Code that requires predication must place itself in .text.wide.
2. Encoding
Xmath uses two opcodes:
| Opcode | Standard meaning | Type | Purpose in FireStorm |
|---|---|---|---|
0x57 |
OP-V (vector) — not implemented | R-type | All Xmath R-type instructions (G2–G12) |
0x6B |
reserved (no standard claim) | R4-type | Integer fused multiply-add family (G1) |
FireStorm reallocates 0x57 because it does not implement the RISC-V Vector extension; Xmath's scalar fused operations capture most of the practical benefit V would provide for FireStorm's target workloads (games, audio synthesis, retro emulation, DSP) without V's implementation complexity. The 0x6B slot is genuinely reserved in the ISA — no standard extension claims it — and the R4 format slots in cleanly.
The standard FP opcodes are completely unchanged:
0x07LOAD-FP,0x27STORE-FP — RV-F/D scalar floating-point loads and stores0x43FMADD,0x47FMSUB,0x4BFNMSUB,0x4FFNMADD — RV-F/D fused multiply-add (R4-type)0x53OP-FP — RV-F/D scalar FP ALU (FADD, FSUB, FMUL, FDIV, FSQRT, FMV, FCVT, FSGNJ, FCLASS, etc.)
Xmath does not touch any of these. The integer MADD at 0x6B sits alongside the FP FMA family (0x43–0x4F) at the architectural level — same R4 encoding format — but at its own opcode, so the two families decode independently with no conflict.
2.1 R4-Type Layout (0x6B)
Bit: 31 27 26 25 24 20 19 15 14 12 11 7 6 0
┌───────┬─────┬────────┬────────┬─────┬──────┬───────┐
│ rs3 │ fmt │ rs2 │ rs1 │funct3│ rd │ 0x6B │
└───────┴─────┴────────┴────────┴─────┴──────┴───────┘
| Field | Bits | Purpose |
|---|---|---|
rs3 |
[31:27] | Third source register (added/subtracted operand) |
fmt |
[26:25] | Variant selector (see §3) |
rs2 |
[24:20] | Second source register (multiplier) |
rs1 |
[19:15] | First source register (multiplicand) |
funct3 |
[14:12] | Operation selector |
rd |
[11:7] | Destination register |
| opcode | [6:0] | 0x6B |
In wide mode, the extension nibble provides bit 5 for each register field, giving 6-bit register indices.
2.2 R-Type Layout (0x57)
Bit: 31 25 24 20 19 15 14 12 11 7 6 0
┌────────────┬────────┬────────┬─────┬──────┬───────┐
│ funct7 │ rs2 │ rs1 │funct3│ rd │ 0x57 │
└────────────┴────────┴────────┴─────┴──────┴───────┘
The standard R-type format with FireStorm conventions: 7-bit funct7 + 3-bit funct3 = 10 bits of operation space (1024 slots), far more than Xmath needs.
For instructions that consume an FP register, the convention follows standard RV F/D: rs1 and rs2 are FP register fields when the instruction is FP-typed, integer register fields when integer-typed. The funct7 high bits typically encode the source/destination types (00 for FP32, 01 for FP64, etc., mirroring the F/D format conventions).
2.3 Encoding Allocation Summary
The 0x57 opcode has 7-bit funct7 + 3-bit funct3 = 1024 instruction slots. The funct3 field selects the operation family; funct7 distinguishes specific operations and (for FP) the format (.S/.D via funct7's low bits per RV F/D conventions).
| funct3 in 0x57 | Group | funct7 sub-allocation |
|---|---|---|
000 |
G2 Saturating + G12 Multi-Precision | funct7[6:0] selects ADDS/SUBS/ADDSU/SUBSU/MULSAT (G2) and ADDC/SUBC/ROLC/RORC (G12) — G12 occupies free funct7 codes in this lane |
001 |
G3 Min/Max + G10 Heuristics | funct7[6:5]=00 → MIN/MAX/MINU/MAXU; funct7[6:5]=01 → MANHATTAN/CHEBYSHEV/OCTILE |
010 |
G3 Abs/Sign + G2 SAT.x | funct7 selects ABS/SIGN/SAT.B/SAT.H/SAT.W |
011 |
G4 FP unary + G9 Game Math | funct7 high bits select group (G4 or G9), low bits select operation and .S/.D |
100 |
G4 FP binary + G9 Game Math binary | funct7 selects FSINCOS/FATAN2/CLAMP/SMOOTHSTEP/etc. |
101 |
G5 BAM | funct7 selects FSINBAM/FCOSBAM/etc., .S/.D |
110 |
G6+G7+G8+G11 — single result | funct7 selects which bundle (DOT3, DOT4, DOT2, LENSQ2/3, LERP, VNORM2/3, QMUL); .S/.D via funct7 low bits |
111 |
G6+G7+G8+G11 — multi result | funct7 selects CROSS3, CROSS2, VADD/SUB/SCALE/MADD (3D and 2D), QROT |
| fmt in 0x6B (R4-type) | Group | Variant |
|---|---|---|
00 |
G1 | MADD / MSUB (signed, low 64 bits, selected by funct3) |
01 |
G1 | MADDH / MADDU (signed/unsigned high 64 bits, selected by funct3) |
10 |
G1 | MADDW / MSUBW (32-bit signed, sign-extended to 64) |
11 |
reserved | future expansion |
The total used encoding space is approximately 64 instructions out of the available 1024 slots in 0x57 plus 4 slots used in 0x6B — substantial headroom for future Xmath additions. The G12 multi-precision ops (ADDC, SUBC, ROLC, RORC) are integer R-type and take nominal funct7 codes in the funct3 = 000 lane alongside the saturating-arithmetic family; the carry bit they read and write is architectural state (the xcarry CSR, §14), not an encoding field, so it costs nothing in opcode space.
3. Group 1: Integer Fused Multiply-Add
All G1 instructions use the R4-type encoding (opcode 0x6B), parallel in format to the FP FMADD family (which lives at 0x43/0x47/0x4B/0x4F). Latency: 2–3 cycles (DSP-backed). Throughput: 1 result per cycle (fully pipelined).
3.1 MADD — Multiply-Add
MADD rd, rs1, rs2, rs3
rd = (rs1 × rs2)[63:0] + rs3
Computes the low 64 bits of rs1 × rs2, adds rs3. The standard form for fixed-point DSP inner loops:
// FIR filter inner loop
acc = acc + sample[i] * coef[i]; // → MADD acc, sample[i], coef[i], acc
Two instructions become one. On FireStorm, MADD has the same latency as MUL alone (no penalty for the additional add — DSP block accumulator handles it). 5× faster on integer FIR inner loops in steady state.
3.2 MSUB — Multiply-Subtract
MSUB rd, rs1, rs2, rs3
rd = (rs1 × rs2)[63:0] - rs3
3.3 RMSUB — Reverse Multiply-Subtract
RMSUB rd, rs1, rs2, rs3
rd = rs3 - (rs1 × rs2)[63:0]
Useful for accumulator decrements and for computing c - a*b (common in error-correction codes).
3.4 MADDH — Multiply-High Add
MADDH rd, rs1, rs2, rs3
rd = (rs1 × rs2)[127:64] + rs3
The signed 128-bit product's high 64 bits, plus an add. The Q63.64 fixed-point multiply-add primitive — multiply two Q1.63 values, take the high half (giving back a Q1.63), add to accumulator. Standard pattern for audio gain stages, normalised fixed-point integration, fractional rotation.
3.5 MADDU — Multiply-High Add (Unsigned)
MADDU rd, rs1, rs2, rs3
rd = ((u64)rs1 × (u64)rs2)[127:64] + rs3
Unsigned variant of MADDH for hash computation, modular arithmetic, big-integer multiplies.
3.6 MADDW — 32-bit Multiply-Add (Sign-Extended)
MADDW rd, rs1, rs2, rs3
rd = sext64((rs1[31:0] × rs2[31:0])[31:0] + rs3[31:0])
The W (32-bit operand) variant. Result is sign-extended to 64 bits per standard RV64 conventions. Useful when the application is genuinely 32-bit fixed-point.
3.7 Performance
On Ant64 (DSP-backed MUL):
| Operation pattern | Standard RV64 | Xmath G1 | Speedup |
|---|---|---|---|
| FIR filter inner loop (per tap) | 4 cycles (MUL + ADD + load + bounds) | 2 cycles (load + MADD) | 2× |
| Fixed-point Q31 dot product | 3 cycles (MUL + ADD + shift) | 1 cycle (MADDH) | 3× |
| Karatsuba multiply outer | 8 cycles per chunk | 5 cycles | 1.6× |
| 4-tap polynomial evaluation | 8 cycles | 4 cycles | 2× |
4. Group 2: Saturating Arithmetic
Saturating operations clamp results to the maximum or minimum representable value of the type on overflow, rather than wrapping. Critical for audio mixing (sums that exceed int16 must clamp to ±32767, not wrap to negative numbers, which produces audible distortion).
All G2 instructions are R-type at opcode 0x57, funct3 000 (for add/sub/mul) and 010 (for SAT.x). 1-cycle latency. 1/cycle throughput.
4.1 ADDS — Signed Saturating Add
ADDS rd, rs1, rs2
rd = clamp(rs1 + rs2, INT64_MIN, INT64_MAX)
If rs1 + rs2 overflows the 64-bit signed range, the result is clamped to INT64_MAX (on positive overflow) or INT64_MIN (on negative overflow). The standard 2's-complement wrap behaviour is replaced with explicit saturation.
4.2 SUBS — Signed Saturating Subtract
SUBS rd, rs1, rs2
rd = clamp(rs1 - rs2, INT64_MIN, INT64_MAX)
4.3 ADDSU — Unsigned Saturating Add
ADDSU rd, rs1, rs2
rd = min((u64)rs1 + (u64)rs2, UINT64_MAX)
Caps at 0xFFFF_FFFF_FFFF_FFFF on overflow (the carry-out is converted to "stay at max").
4.4 SUBSU — Unsigned Saturating Subtract
SUBSU rd, rs1, rs2
rd = max((s128)((u64)rs1 - (u64)rs2), 0)
Caps at 0 on underflow (clamps negative results to zero).
4.5 MULSAT — Signed Saturating Multiply
MULSAT rd, rs1, rs2
rd = clamp(rs1 × rs2, INT64_MIN, INT64_MAX)
The full 128-bit product is computed; if it doesn't fit in 64 bits signed, the result clamps. Useful for fixed-point gain stages where overflow protection matters more than precision.
4.6 SAT.B — Saturate to Int8
SAT.B rd, rs1
rd = clamp(rs1, -128, +127)
Single-instruction clamp to signed 8-bit range. Used for colour channel output (clamp computed pixel value to byte range) and audio downsampling (clamp 16-bit sample to 8-bit for low-bit-depth output).
4.7 SAT.H — Saturate to Int16
SAT.H rd, rs1
rd = clamp(rs1, -32768, +32767)
The audio-output workhorse. Mix N voices into a 32-bit accumulator, then SAT.H to clamp to the 16-bit DAC range:
// Audio mixer output stage
int32 mix = 0;
for (int v = 0; v < 128; v++) mix += voices[v].sample;
int16 output = SAT.H(mix); // → 1 instruction, no branch
Standard RV64 alternative: 3-4 instructions with branches or a 5-instruction branchless if-then-else sequence. 5× speedup on the output stage of a 128-voice mixer.
4.8 SAT.W — Saturate to Int32
SAT.W rd, rs1
rd = clamp(rs1, INT32_MIN, INT32_MAX)
For applications that genuinely use 32-bit fixed-point output (less common than .H but still useful).
4.9 SHIFTSAT.B / SHIFTSAT.H / SHIFTSAT.W — Combined Arithmetic Shift + Saturate
SHIFTSAT.B rd, rs1, #imm5 # rd = clamp(rs1 >> imm5, -128, +127)
SHIFTSAT.H rd, rs1, #imm5 # rd = clamp(rs1 >> imm5, -32768, +32767)
SHIFTSAT.W rd, rs1, #imm5 # rd = clamp(rs1 >> imm5, INT32_MIN, INT32_MAX)
Performs an arithmetic right shift by a 5-bit immediate amount (0–31 bits), then saturates the result to the target width. Combines the two most common operations in fixed-point output stages into a single instruction.
The immediate field reuses the rs2 register slot of standard R-type encoding (otherwise unused for these unary operations); in wide mode the 5-bit rs2 field is available directly.
SHIFTSAT.X: opcode = 0x57, funct3 = 010, funct7[6:3] = 1010, funct7[2:1] = width selector,
rs2 field = shift amount (5-bit)
Use case: the Q15 → int16 conversion universal in audio output:
// Mix 128 voices into a Q15.16 fixed-point accumulator (int32)
// Then convert to int16 sample for DAC output
int32_t mix = ...;
int16_t sample = SHIFTSAT.H(mix, 15); // >>15 then saturate to ±32767
Without SHIFTSAT: 3 instructions (SRA + SAT.H + handle edge case where shift overflowed). With SHIFTSAT: 1 instruction, 1 cycle.
For colour-channel processing in 8-bit-output rendering pipelines, SHIFTSAT.B combines the >>8 and clamp-to-byte step:
uint8_t r = SHIFTSAT.B(red_acc, 8); // mix accumulated in higher precision, output as 8-bit
This is the universal "downsample to N bits with saturation" primitive that appears in every audio output stage and every fixed-point shader output.
4.10 Performance Impact
For a 128-voice 48 kHz audio mixer:
- Per-sample cost without SAT.H: ~10 cycles for clamp-and-write
- Per-sample cost with SAT.H: ~2 cycles
- At 48 kHz: ~50% reduction in mixer output overhead
For colour-channel processing (e.g., real-time per-pixel shading):
- Per-channel cost without SAT.B: ~6 cycles
- Per-channel cost with SAT.B: ~1 cycle
- For RGBA processing at 1080p60: ~7% reduction in shading inner loop
5. Group 3: Min / Max / Sign / Abs
Single-cycle scalar conditional/sign operations. R-type at opcode 0x57. These overlap conceptually with the Zbb extension's MIN/MAX/MINU/MAXU; Xmath provides them whether or not Zbb is implemented, plus ABS and SIGN which Zbb does not include.
5.1 MIN / MAX (Signed)
MIN rd, rs1, rs2 # rd = (rs1 < rs2) ? rs1 : rs2 (signed)
MAX rd, rs1, rs2 # rd = (rs1 > rs2) ? rs1 : rs2 (signed)
5.2 MINU / MAXU (Unsigned)
MINU rd, rs1, rs2 # rd = (rs1 < rs2) ? rs1 : rs2 (unsigned)
MAXU rd, rs1, rs2 # rd = (rs1 > rs2) ? rs1 : rs2 (unsigned)
5.3 ABS — Absolute Value
ABS rd, rs1
rd = (rs1 < 0) ? -rs1 : rs1 # signed
Single instruction; standard RV64 needs 3 (compare + branch + negate, or branchless 2-step sequence). Distance calculations, Manhattan-distance heuristics, signal magnitude.
5.4 SIGN — Signum
SIGN rd, rs1
rd = (rs1 > 0) ? +1 : (rs1 < 0) ? -1 : 0
Single instruction; standard RV64 needs 4-5. Useful in branchless conditionals, gradient-direction code, dithering. Combined with MUL gives branchless select-on-sign patterns.
5.5 Bounding-Box Test Example
// Standard RV64:
// if (x < bbox_min) clamped_x = bbox_min;
// else if (x > bbox_max) clamped_x = bbox_max;
// else clamped_x = x;
// ... 5+ instructions with branches
// With Xmath:
clamped_x = MIN(MAX(x, bbox_min), bbox_max); // 2 cycles, no branches
Branchless = no branch-predictor pollution. For inner-loop bounding-box tests on thousands of objects per frame, this is meaningful.
6. Group 4: FP Approximations
Approximations for the transcendentals that game code uses frequently. These are not IEEE-754-correct; they trade precision for speed. Accuracy targets are inspired by SSE RCPSS / RSQRTSS: 0.05–0.1% relative error, which is far more than adequate for games.
All G4 instructions are R-type at opcode 0x57, funct3 011 (unary) or 100 (binary). Implementation uses table lookup + polynomial refinement (typically Newton-Raphson 1 iteration). Latency: 3 cycles. Throughput: 1/cycle.
6.1 FRECIP.S / FRECIP.D — Reciprocal
FRECIP.S fd, fs1 # fd ≈ 1 / fs1 (FP32, ~0.05% error)
FRECIP.D fd, fs1 # fd ≈ 1 / fs1 (FP64, ~0.05% error)
The perspective-divide workhorse for 3D graphics. Standard FP division is ~10–20 cycles; FRECIP is 3 cycles.
// Perspective divide
float inv_w = FRECIP.S(w); // 3 cycles
float xs = x * inv_w; // 1 cycle (MUL)
float ys = y * inv_w;
float zs = z * inv_w;
// Total: 6 cycles per vertex vs ~15-20 for FDIV-based version
If extra precision is needed, one Newton-Raphson iteration brings error to ~10⁻⁹ at 5 extra cycles; usually not needed for visual rendering.
6.2 FRSQRT.S / FRSQRT.D — Reciprocal Square Root
FRSQRT.S fd, fs1 # fd ≈ 1 / √fs1 (FP32, ~0.1% error)
FRSQRT.D fd, fs1 # fd ≈ 1 / √fs1 (FP64, ~0.1% error)
The vector-normalisation workhorse. Famously hand-optimised in Quake III (the Q_rsqrt integer hack); FireStorm gives you a hardware instruction for the same operation.
// Vector normalisation
float len_sq = x*x + y*y + z*z; // 3 cycles (with FMA)
float inv_len = FRSQRT.S(len_sq); // 3 cycles
nx = x * inv_len; ny = y * inv_len; nz = z * inv_len; // 3 cycles
// Total: 9 cycles vs ~30 for software fsqrt + fdiv
~3× speedup on lighting/normalisation inner loops — large on Phong-style or normal-mapped renderers.
6.3 FSIN.S / FSIN.D — Sine
FSIN.S fd, fs1 # fd ≈ sin(fs1) (input in radians, ~0.01% error)
FSIN.D fd, fs1 # fd ≈ sin(fs1) (FP64, ~0.01% error)
Range reduction is performed in hardware. Input may be any FP value; the implementation reduces modulo 2π using a high-precision constant. Output accuracy degrades for very large inputs (above ~10⁶) due to range-reduction precision loss; for game-typical angle values, accuracy is uniformly within 0.01%.
6.4 FCOS.S / FCOS.D — Cosine
FCOS.S fd, fs1 # fd ≈ cos(fs1)
FCOS.D fd, fs1 # fd ≈ cos(fs1)
6.5 FSINCOS.S / FSINCOS.D — Sine + Cosine (Pair)
FSINCOS.S fd, fs1
⇒ fd ← sin(fs1)
f(d+1) ← cos(fs1)
Returns both sine and cosine in one instruction. The destination is fd and fd+1 (the next FP register). For wide mode, fd and fd+1 may be any consecutive pair in f0–f63; in narrow mode, the pair must lie in f0–f30 (since f31 has no successor).
The single-instruction form is faster than two separate FSIN + FCOS instructions (~3 cycles total instead of 6) because the angle reduction and table lookup are shared. Essential for rotation-matrix construction:
// 2D rotation matrix
FSINCOS.S f1, theta; // f1 = sin(theta), f2 = cos(theta) in 3 cycles
// matrix is then [[f2, -f1], [f1, f2]]
6.6 FATAN2.S / FATAN2.D — Arc-tangent with Quadrant
FATAN2.S fd, fs1, fs2 # fd ≈ atan2(fs1, fs2)
FATAN2.D fd, fs1, fs2 # fd ≈ atan2(fs1, fs2)
Full-circle arc-tangent (returns angle in [-π, +π]) given y (fs1) and x (fs2). Quadrant determined by sign of both inputs. Used for vector-to-angle conversion, target tracking, joystick deadzone calculation. 4 cycles (one extra cycle for quadrant resolution).
6.7 Precision Mode (Wide-Mode Bit Extension)
Wide mode only: in wide-mode encoding, the G4 unary FP approximation instructions (FRECIP, FRSQRT, FSIN, FCOS, FSINCOS, FATAN2) have two spare nibble bits after register-extension uses bit 32 (rd) and bit 33 (rs1). Bit 35 is consumed by Xcond's PRED-EN (see §6.9 below); bit 34 is repurposed as a precision mode bit:
| bit[34] | Mode | Latency | Accuracy |
|---|---|---|---|
| 0 | Approximate (default) | 3 cycles | ~0.05–0.1% relative error |
| 1 | Refined (one Newton-Raphson iteration) | 6–7 cycles | ~10⁻⁹ relative error |
The refined mode performs a single Newton-Raphson refinement step on the approximation result, dramatically improving accuracy at the cost of roughly 2× latency. Useful when:
- Application has higher precision requirements (e.g., physics simulation needing energy conservation)
- Range-reduction precision matters (e.g., FSIN of accumulated phases over thousands of iterations)
- The approximation result will be cascaded into many further FMA operations where error accumulates
The selection is per-instruction, not per-context — code can mix approximate and refined uses freely. Standard precision-sensitive libraries can default to refined; game inner loops default to approximate.
In assembly, the refined form is written with a .R (refined) suffix on the mnemonic:
FRECIP.S fd, fs1 # approximate, 3 cycles, ~0.05% error
FRECIP.S.R fd, fs1 # refined, 6 cycles, ~10⁻⁹ error
FRSQRT.S fd, fs1 # approximate, 3 cycles
FRSQRT.S.R fd, fs1 # refined, 6 cycles
Narrow-mode behaviour: the precision bit does not exist in narrow encoding. Narrow mode always uses the approximate form. Code requiring refined precision must place itself in .text.wide, or fall back to the standard FP library sin / cos / sqrt / 1.0 / operations (which are IEEE-754 correct, but ~50–100 cycles each).
6.8 Composability with Scoreboarding
Whether approximate or refined, FP approximations issue to the FPU and release the main pipeline via scoreboarding (see §15.1 of ee_cpu). The 3- or 6-cycle latency is hidden behind any subsequent independent work. For typical game inner loops:
FRSQRT.S inv_len, lensq ; 3 cycles, marked pending
VSCALE3.S normalised, v, inv_len ; would stall waiting for inv_len
...
A peephole optimisation would schedule independent work between the FRSQRT and its consumer, hiding the entire latency. The compiler is expected to do this aggressively for the refined variants where the latency cost is higher.
6.9 Xcond Predication
All Xmath instructions with an unused nibble bit 35 inherit Xcond predication automatically. See §1.4 for the full table of which Xmath groups support predication and example use cases.
6.10 Precision Notes
| Function | Max relative error | Compares to |
|---|---|---|
| FRECIP | 5×10⁻⁴ | SSE RCPSS: 1.5×10⁻¹² (refined); RCPPS estimate: 5.7×10⁻⁴; ARM FRECPE: ~3×10⁻³ |
| FRSQRT | 10⁻³ | SSE RSQRTSS estimate: 1.5×10⁻³; ARM FRSQRTE: ~3×10⁻³ |
| FSIN, FCOS | 10⁻⁴ | x87 FSIN: 10⁻¹⁸; library sin: 10⁻¹⁵; SSE: not directly provided |
| FATAN2 | 10⁻³ | library atan2: 10⁻¹⁴ |
For game and audio rendering at typical 8-bit / 10-bit precision pipelines, all of these are massively more accurate than needed. For applications that need full IEEE-754 precision (physics simulation backbones, scientific computation), the standard library sin / cos / sqrt remain available — they just take ~50–100 cycles instead of 3.
7. Group 5: BAM (Binary Angle Measure) Trigonometry
Binary Angle Measure represents angles as integers where the full circle (0–2π) maps to the integer range (0 to 2^N − 1). This makes range reduction free — integer arithmetic naturally wraps modulo 2^N — and the lookup table is indexed directly by the BAM value.
BAM is the retro / demoscene / fixed-point native way of representing angles. It avoids the precision-loss problem of FP angle accumulation (where adding a small angle increment over many frames eventually drifts), gives perfect wraparound, and makes the sin/cos lookup a single table read.
7.1 BAM Representation
FireStorm BAM functions accept BAM as an integer in rs1. The format is 32-bit unsigned, mapping 0x00000000 to 0 radians and 0xFFFFFFFF (effectively) to 2π. Resolution: 2π / 2^32 ≈ 1.46×10⁻⁹ radians — far finer than any FP32 representation, comparable to FP64 representation near zero, perfect modular wraparound.
A 16-bit BAM is supported via the low 16 bits of rs1 (the upper 16 bits are ignored). Resolution: 2π / 2^16 ≈ 0.0055° — finer than human vision can resolve, perfect for game rotation animations.
7.2 FSINBAM.S / FSINBAM.D — Sine of BAM
FSINBAM.S fd, rs1 # fd = sin(2π × (rs1[31:0] / 2^32))
FSINBAM.D fd, rs1
The integer-to-FP boundary crossing is handled in hardware. rs1 is an integer register holding the BAM value; fd is an FP register receiving the sine result. Latency: 2 cycles (faster than FSIN.S because no FP range reduction needed). Throughput: 1/cycle.
7.3 FCOSBAM.S / FCOSBAM.D — Cosine of BAM
FCOSBAM.S fd, rs1 # fd = cos(2π × (rs1[31:0] / 2^32))
FCOSBAM.D fd, rs1
7.4 FSINCOSBAM.S / FSINCOSBAM.D — Sin + Cos Pair of BAM
FSINCOSBAM.S fd, rs1
⇒ fd ← sin(BAM)
f(d+1) ← cos(BAM)
The rotation-matrix-from-BAM workhorse. Combined with the rotation-matrix structure, this gives a complete 2D rotation in 3 cycles.
7.5 FRAD2BAM and FBAM2RAD — Conversions
FRAD2BAM rd, fs1 # rd = round(fs1 × (2^32 / 2π)) (FP → BAM int)
FBAM2RAD fd, rs1 # fd = rs1 × (2π / 2^32) (BAM int → FP)
For interoperation with FP-radian-using code (e.g., math libraries that want radians). Both 2 cycles, 1/cycle throughput.
7.6 BAM-Based Rotation Example
// Game character rotation, BAM-based:
uint32_t angle_bam; // Persistent in game state
uint32_t turn_rate = 0x00400000; // 90°/sec at 60 Hz (one full circle per 4 sec)
// Per-frame update:
angle_bam += turn_rate; // 1 cycle integer add; perfect wraparound
// Render:
float sin_a, cos_a;
FSINCOSBAM.S sin_a, angle_bam; // 3 cycles
// Build 2D rotation matrix from sin_a, cos_a (2 negations)
// Total angle handling per frame: ~5 cycles
Compared to FP-radian-based equivalent:
float angle_rad;
float turn_rate = M_PI / 2; // Same 90°/sec
// Per-frame:
angle_rad += turn_rate / 60.0f; // 2 cycles FADD + FDIV (which is slow)
if (angle_rad > 2*M_PI) angle_rad -= 2*M_PI; // Branchy wraparound
// Render:
float sin_a = sin(angle_rad); // 50+ cycles (library) or 3 (FSIN.S)
float cos_a = cos(angle_rad); // 50+ or 3
// Total: 6-100+ cycles
BAM wins ~3× per frame, with the larger savings on accumulated-precision: FP rotation drifts over thousands of frames; BAM is bit-exact forever.
7.7 Retro / Demoscene Use Cases
BAM is the native representation for:
- Rotozoomers: rotate-and-zoom raster effects (Amiga / ST demoscene classic). Use FSINCOSBAM for the rotation matrix per scanline.
- Plasma effects: combine multiple BAM-indexed sine waves for plasma rendering.
- Wavetable oscillators: phase accumulation in BAM, wavetable index = BAM, no precision drift.
- Velocity-based animation curves: BAM phase advances at constant rate, gives smooth, drift-free cyclic motion.
- Particle system rotation: per-particle BAM rotation state, all updated identically per frame.
For the kind of retro-style code Anthony's project landscape favours, BAM is the right primitive.
8. Group 6: Vector Math Bundles
Fixed-shape multi-element operations for 3D vector math. These are bundles — single instructions that execute a sequence of internal multiply-adds — not SIMD operations on multiple registers in parallel. Their value is in encoding density and register-allocation efficiency, not in parallelism per se.
All G6 instructions are R-type at opcode 0x57, funct3 110 (single result) or 111 (multi-result).
8.1 Register Tuple Convention
The vector bundles operate on consecutive FP register tuples. The rs1 and rs2 fields name the first register of each tuple; the hardware reads consecutive registers from there.
- 3-element tuples (DOT3, CROSS3, LENSQ3): the bundle reads
f[rs1],f[rs1+1],f[rs1+2](and similarly for rs2). - 4-element tuples (DOT4): the bundle reads
f[rs1],f[rs1+1],f[rs1+2],f[rs1+3].
Narrow-mode constraint: the starting register must allow the tuple to fit within f0–f31. For DOT3, rs1 ≤ f29; for DOT4, rs1 ≤ f28. In wide mode, the tuple can start anywhere in f0–f60 (DOT3) or f0–f59 (DOT4).
This constraint is rarely binding in practice — compilers naturally allocate vectors at well-aligned bases.
8.2 DOT3 — 3-Element Dot Product
DOT3.S fd, rs1, rs2
fd = f[rs1] × f[rs2] + f[rs1+1] × f[rs2+1] + f[rs1+2] × f[rs2+2]
3 internal FP FMAs in sequence using the FMA chain. Latency: 6 cycles (3 FMAs × 2 cycles each, no parallelism within the bundle). Throughput: depends on FMA unit availability; on Ant64 with single FMA unit, 1 DOT3 per 6 cycles.
For batches of dot products, the compiler can pipeline by issuing the next DOT3 before the previous one's result is needed — scoreboarding handles this naturally.
FP64 variant (DOT3.D): same structure, 8-cycle latency on FP64 FMAs.
8.3 DOT4 — 4-Element Dot Product
DOT4.S fd, rs1, rs2
fd = Σ (f[rs1+k] × f[rs2+k]) for k in 0..3
The homogeneous-coordinate dot product — vertex transform stage. 4 internal FMAs. Latency: 8 cycles. Critical for the matrix-times-vector inner loop of 3D rendering.
8.4 CROSS3 — 3-Element Cross Product
CROSS3.S fd, rs1, rs2
f[fd] = f[rs1+1] × f[rs2+2] - f[rs1+2] × f[rs2+1]
f[fd+1] = f[rs1+2] × f[rs2+0] - f[rs1+0] × f[rs2+2]
f[fd+2] = f[rs1+0] × f[rs2+1] - f[rs1+1] × f[rs2+0]
Writes 3 FP registers f[fd], f[fd+1], f[fd+2]. Used for normal-vector computation in 3D graphics, angular velocity, torque, and any "perpendicular vector" operation.
Latency: 10 cycles (6 internal multiplies + 3 subtracts, pipelined). Throughput limited by the multi-port register file write needed for the 3 output writes.
Narrow-mode constraint: fd ≤ f29 (the 3-register output must fit in f0–f31).
8.5 LENSQ3 — Squared Length
LENSQ3.S fd, rs1
fd = f[rs1]² + f[rs1+1]² + f[rs1+2]²
The "is point A closer to B than C?" primitive — squared distance is sufficient for comparison and avoids the FSQRT. Latency: 6 cycles (3 internal FMAs).
// Branchless nearest-point selection (sphere of N candidates):
float min_dist_sq = FLT_MAX;
int best = -1;
for (int i = 0; i < N; i++) {
float d = LENSQ3.S(diff[i].xyz); // 6 cycles
if (d < min_dist_sq) { min_dist_sq = d; best = i; }
}
For N=8 candidates, the inner loop is ~10 cycles each (LENSQ3 + compare + conditional update). Standard RV64 with FMA: ~15 cycles each (3 FMAs + compare + update). ~1.5× speedup on neighbour-search inner loops.
8.6 LERP — Linear Interpolation
LERP.S fd, rs1, rs2, rs3
fd = f[rs1] + (f[rs2] - f[rs1]) × f[rs3]
rs3 is the interpolation parameter t (typically in [0, 1]). The fundamental animation / blending primitive. 1 internal FMA + 1 internal FSUB = 3 cycles total.
Note: implemented as R4-type-style with 3 source operands but using the R-type slot (opcode 0x57, funct3 110, with rs3 placed in a reserved bit-field). This is unusual; an alternative would be to compute as fd = f[rs1]*(1 − f[rs3]) + f[rs2]*f[rs3] from a pair of FMAs (which decomposes to 2 instructions in standard RV with FMA: same total cycles, but more register pressure).
The single LERP instruction is faster in register-pressure-bound code (texture sampling inner loops, animation blending) and saves the temporary register for (1 − t).
9. Group 7: Vector Componentwise Bundles
Componentwise operations on 3-element FP tuples. These are the bread-and-butter of game state math: physics integration, steering, collision response, transform composition. Where G6's bundles return scalars (DOT3 → 1 result), G7's bundles return vectors (VADD3 → 3 results).
All G7 instructions use opcode 0x57, funct3 111 (multi-result). Tuple convention follows §8.1 — rs1, rs2, rd name the first register of each 3-tuple; hardware reads/writes consecutive registers.
Narrow-mode constraint: each named register must allow a 3-tuple to fit within f0–f31. Wide mode allows any starting register through f60.
9.1 VADD3 — Componentwise Vector Add
VADD3.S fd, rs1, rs2
f[fd+0] = f[rs1+0] + f[rs2+0]
f[fd+1] = f[rs1+1] + f[rs2+1]
f[fd+2] = f[rs1+2] + f[rs2+2]
Three parallel FP adds in one instruction. Latency: 3 cycles. Throughput: 1 per cycle (multi-port FP register file).
The physics-update workhorse: position += velocity is one VADD3 instead of three FADD instructions.
9.2 VSUB3 — Componentwise Vector Subtract
VSUB3.S fd, rs1, rs2
f[fd+k] = f[rs1+k] - f[rs2+k] for k in 0..2
difference = target - position patterns; collision-normal computation; relative-velocity calculation.
9.3 VSCALE3 — Vector Scale by Scalar
VSCALE3.S fd, rs1, rs2
f[fd+k] = f[rs1+k] × f[rs2] for k in 0..2
rs2 is a single FP register holding the scalar; each component of rs1's tuple is multiplied by it. Useful for unit-vector-to-velocity conversion, light-intensity application, gain scaling.
9.4 VMADD3 — Fused Vector Multiply-Add
VMADD3.S fd, rs1, rs2, rs3 (encoded as R4-type-style; rs3 in reserved bit-field as with LERP)
f[fd+k] = f[rs1+k] + f[rs2+k] × f[rs3] for k in 0..2
The physics-integration primitive: pos = pos + vel * dt in one instruction. Standard verlet/euler integration becomes:
VMADD3.S new_pos, old_pos, vel, dt ; pos += vel * dt
VMADD3.S new_vel, old_vel, accel, dt ; vel += accel * dt
Two instructions for a full integration step. Without VMADD3: ~9 instructions (3 FMUL + 3 FADD + 3 register moves) or ~6 with FMA (3 FMADD + 3 moves).
Latency: 4 cycles (3 parallel FMAs + writeback).
9.5 VNORM3 — Vector Normalisation
VNORM3.S fd, rs1
length_sq = LENSQ3(f[rs1+0..2])
inv_length = FRSQRT(length_sq)
f[fd+0] = f[rs1+0] × inv_length
f[fd+1] = f[rs1+1] × inv_length
f[fd+2] = f[rs1+2] × inv_length
The lighting / direction normalisation primitive fused into one instruction. Without VNORM3, this takes 3 separate steps (LENSQ3 + FRSQRT + VSCALE3) costing 12 cycles; VNORM3 does it in 8 cycles through internal forwarding.
Used in every lighting calculation (normal vectors must be unit length), every steering computation (direction vectors), every camera/AI orientation calculation.
9.6 Game-Logic Example: Steering Behaviours
A "seek" steering behaviour (chase a target):
// Compute desired velocity (toward target, at max speed)
diff = VSUB3(target, position); // 3 cycles
desired = VNORM3(diff); // 8 cycles
desired = VSCALE3(desired, max_speed); // 3 cycles
// Compute steering force
steer = VSUB3(desired, velocity); // 3 cycles
// Total: 17 cycles
Without G7 (using only G6 + standard FP):
diff_x = target_x - position_x; // 3× FSUB (3 cycles)
... similar for y, z
length_sq = LENSQ3(diff); // 6 cycles
inv_length = FRSQRT(length_sq); // 3 cycles
desired_x = diff_x * inv_length * max_speed; // 3× FMUL (3 cycles)
... similar for y, z
steer_x = desired_x - velocity_x; // 3× FSUB (3 cycles)
// Total: ~24 cycles
~30% speedup on a single steering behaviour, repeated hundreds of times per frame for AI flocking, particle systems, projectile homing.
10. Group 8: 2D Math Primitives
2D versions of the 3D vector operations. Critical for top-down games, raycaster setup, navmesh pathfinding, 2D physics, and any "x-y but no z" math.
All G8 instructions use opcode 0x57. Tuple convention: rs1, rs2, rd name the first register of each 2-tuple.
10.1 DOT2 — 2D Dot Product
DOT2.S fd, rs1, rs2
fd = f[rs1+0] × f[rs2+0] + f[rs1+1] × f[rs2+1]
2 internal FMAs. Latency: 4 cycles. The 2D projection primitive: dot of a 2D vector against a 2D axis. Used in:
- Navmesh edge-side tests
- 2D collision (SAT separating axis projections)
- Sprite-vs-line tests
- Top-down field-of-view tests
10.2 LENSQ2 — 2D Squared Length
LENSQ2.S fd, rs1
fd = f[rs1+0]² + f[rs1+1]²
The 2D distance-compare primitive. 2 internal FMAs, 4 cycles. The "is point A within radius R of point B?" test uses LENSQ2(A - B) < R*R — one subtract bundle plus one LENSQ2 plus one compare.
10.3 CROSS2 — 2D Cross Product (Scalar Result)
CROSS2.S fd, rs1, rs2
fd = f[rs1+0] × f[rs2+1] - f[rs1+1] × f[rs2+0]
The 2D cross product produces a scalar (the z-component of the 3D cross). Single FMSUB. Latency: 2 cycles.
The funnel algorithm primitive for navmesh pathfinding: the sign of the 2D cross determines which side of a line a point is on. Also the 2D winding test for polygon orientation, the 2D segment intersection helper, and the 2D rotation direction tester.
10.4 VADD2 / VSUB2 / VSCALE2 — 2D Vector Componentwise
VADD2.S fd, rs1, rs2 # 2-element componentwise add (2 cycles)
VSUB2.S fd, rs1, rs2 # 2-element componentwise subtract (2 cycles)
VSCALE2.S fd, rs1, rs2 # f[fd+k] = f[rs1+k] × f[rs2] for k in 0..1
2D versions of the G7 bundles. Each is 2 parallel FP ops + writeback. 2-cycle latency.
10.5 VNORM2 — 2D Vector Normalisation
VNORM2.S fd, rs1
length_sq = LENSQ2(f[rs1+0..1])
inv_length = FRSQRT(length_sq)
f[fd+0] = f[rs1+0] × inv_length
f[fd+1] = f[rs1+1] × inv_length
The 2D direction-vector primitive: get a unit vector pointing the same way. Fused: 6 cycles (vs 9 cycles for separate LENSQ2 + FRSQRT + VSCALE2).
10.6 2D Game Logic Example: Navmesh Pathfinding
The funnel algorithm walks a path through a navmesh, maintaining left and right "apex" vertices. Each new portal vertex requires testing which side of the current funnel it falls on:
// Funnel-vertex test
side_left = CROSS2(left_edge, new_vertex - apex); // 1 SUB2 + 1 CROSS2 = 4 cycles
side_right = CROSS2(right_edge, new_vertex - apex); // 1 SUB2 + 1 CROSS2 = 4 cycles
// Branch on signs of side_left / side_right
A complete funnel-walk per portal is ~10 cycles with G8. Without G8, it's ~25 cycles (more individual FSUB / FMUL / FADD). 2.5× speedup on navmesh path smoothing.
10.7 Raycaster Example: Wolfenstein-Style Ray Casting
For each screen column, cast a ray through a 2D grid:
// Setup
ray_dir = VSCALE2(camera_plane, x_offset); // 2 cycles
ray_dir = VADD2(camera_dir, ray_dir); // 2 cycles
// DDA step (integer, no Xmath needed)
// ...
// Wall distance computation (per hit)
hit_dist = DOT2(diff, ray_dir); // 4 cycles
wall_height = FRECIP.S(hit_dist); // 3 cycles, then * screen_height
// Total per ray (excluding pixel work): ~12 cycles
Without G8: ray setup costs 5 cycles vs 4; distance computation costs 4 vs 4 (using DOT3 with z=0). Marginal win on this — the bulk of Wolfenstein-style raycasting is the integer DDA step and per-pixel work, which the dedicated drawing hardware will handle.
11. Group 9: Game / Animation Math
Single-instruction versions of common patterns that occur in UI animation, shader uniforms (passed to the drawing hardware), AI parameter clamping, and procedural generation.
All G9 instructions use opcode 0x57. Most are simple FMA chains internally.
11.1 CLAMP.S / CLAMP.D — FP Clamp
CLAMP.S fd, rs1, rs2, rs3
fd = max(min(f[rs1], f[rs3]), f[rs2]) # clamp f[rs1] to [f[rs2], f[rs3]]
The FP equivalent of SAT.x but with user-supplied bounds rather than type bounds. Encoded as R-type-style with rs3 in a reserved bit-field (similar to LERP). Latency: 2 cycles (2 sequential compares + selects).
Used everywhere — every shader uniform that has a valid range, every AI weight that must stay positive, every camera parameter with min/max, every animation phase modulo-1.
11.2 SMOOTHSTEP.S / SMOOTHSTEP.D — Cubic Easing
SMOOTHSTEP.S fd, rs1
t = clamp(f[rs1], 0, 1)
fd = t × t × (3 - 2 × t)
The standard "ease-in/ease-out" curve. C1-continuous at 0 and 1. 3 cycles (1 internal CLAMP + 2 FMAs).
Used in: UI fade-in/out, camera lerp curves, animation transitions, shader edge softening, procedural texture gradients.
11.3 SMOOTHERSTEP.S / SMOOTHERSTEP.D — Quintic Easing
SMOOTHERSTEP.S fd, rs1
t = clamp(f[rs1], 0, 1)
fd = t³ × (t × (t × 6 - 15) + 10)
Ken Perlin's improved easing: C2-continuous at 0 and 1 (the first and second derivatives are zero at the endpoints, eliminating visible "kinks" in chained interpolations). 5 cycles (1 CLAMP + 4 FMAs).
Used in: procedural noise functions (Perlin's gradient noise relies on SMOOTHERSTEP), high-quality animation curves, smooth camera transitions where SMOOTHSTEP's C1-discontinuity in second derivative would be visible.
11.4 STEP.S / STEP.D — Branchless Threshold
STEP.S fd, rs1, rs2
fd = (f[rs1] < f[rs2]) ? 0.0 : 1.0
The branchless threshold primitive: returns 0 if input is below edge, 1 otherwise. Latency: 1 cycle.
Used in: shader masking, animation triggers, AI decision thresholds, particle visibility cutoffs. Combined with multiplication, gives branchless conditional values: result = STEP(x, threshold) * value is "value if x ≥ threshold else 0."
11.5 Game Animation Example: UI Element Fade-In
// Per-frame UI element fade-in over 0.5 seconds
float phase = (time - start_time) * 2.0f; // 1 FMUL (1 cycle)
float alpha = SMOOTHSTEP(phase); // 3 cycles
// alpha is now smoothly 0→1 over the fade window, clamped naturally
// Pass alpha to drawing hardware
vs without G9:
float phase = (time - start_time) * 2.0f;
phase = max(0.0f, min(1.0f, phase)); // 4 cycles (manual clamp)
float alpha = phase * phase * (3 - 2 * phase); // 4 cycles (3 FMUL + 1 FMA)
// Total: 9 cycles
~70% faster on this pattern, multiplied by every UI element with smooth animation.
12. Group 10: Distance Heuristics
Integer distance metrics for pathfinding (A*, jump-point search, hierarchical pathfinding) and broad-phase collision distance estimates. All operate on integer registers (using GPRs, not FPRs), enabling use in tight integer loops without FP unit pressure.
All G10 instructions use opcode 0x57, funct3 001, funct7[6:5]=01. Latency: 1–2 cycles. Throughput: 1 per cycle.
12.1 MANHATTAN2 / MANHATTAN3 — Manhattan Distance
MANHATTAN2 rd, rs1, rs2
rd = |rs1[31:0] - rs2[31:0]| + |rs1[63:32] - rs2[63:32]|
MANHATTAN3 rd, rs1, rs2
# rs1 and rs2 each treated as 3-element packed-integer (low 21 bits per element, signed)
rd = sum of absolute differences across the 3 elements
The classic A* heuristic for 4-connected grid worlds. Standard RV64 needs 5+ instructions (load, sub, abs, sub, abs, add); MANHATTAN2 is one instruction.
For pathfinding on a 1000×1000 grid with ~5000 nodes expanded per query, heuristic computation is ~5% of total time. With MANHATTAN2, that fraction drops to ~1%. A* throughput improves by ~4%.
12.2 CHEBYSHEV2 / CHEBYSHEV3 — Chebyshev Distance
CHEBYSHEV2 rd, rs1, rs2
rd = max(|rs1[31:0] - rs2[31:0]|, |rs1[63:32] - rs2[63:32]|)
CHEBYSHEV3 rd, rs1, rs2
rd = max(|dx|, |dy|, |dz|) for the 3-element packed inputs
The A* heuristic for 8-connected (king-move) grid pathfinding. Or, infinity-norm distance for broad-phase collision tests (chess-style movement, RTS game unit movement).
12.3 OCTILE2 — Octile Distance
OCTILE2 rd, rs1, rs2
dx = |rs1[31:0] - rs2[31:0]|
dy = |rs1[63:32] - rs2[63:32]|
rd = max(dx, dy) × 1024 + min(dx, dy) × 424 # approximating (√2 - 1) ≈ 0.4142
The optimal admissible heuristic for 8-connected grids where diagonal moves cost √2 (more expensive than 4-connected moves). Used in tile-based RPGs, RTS games, top-down shooters. The internal scale factor (1024) lets the result stay in integer arithmetic; software typically divides by 1024 if a normalised distance is needed.
3 cycles including the internal multiply. Standard RV64: ~10 cycles with branches.
12.4 Pathfinding Example: A* with OCTILE2
// Inner loop: expand a node, push neighbours with h = octile distance to goal
for (each neighbour) {
new_g = current_g + step_cost; // 1 cycle
h = OCTILE2(neighbour, goal); // 3 cycles
f = new_g + h; // 1 cycle
push_to_open_set(neighbour, f); // ~10 cycles for heap insert
}
// Total per neighbour: ~15 cycles
Without G10:
for (each neighbour) {
new_g = current_g + step_cost;
// octile distance, manually computed:
dx = abs(neighbour.x - goal.x); // 2 cycles
dy = abs(neighbour.y - goal.y); // 2 cycles
h = (dx > dy) ? (dx * 1024 + dy * 424) : (dy * 1024 + dx * 424); // ~6 cycles
f = new_g + h;
push_to_open_set(neighbour, f);
}
// Total per neighbour: ~22 cycles
*~30% speedup on A inner loop**, with several thousand nodes expanded per pathfind on a typical RPG map.
13. Group 11: Quaternion Math
Quaternions are the standard representation for 3D rotations in modern games. They avoid gimbal lock, interpolate smoothly (SLERP), and chain compactly. Every modern game with bone-based character animation uses quaternions in inner loops; every camera-orientation update uses them.
Quaternions are stored as 4 consecutive FP registers: (w, x, y, z) with w as the scalar part and (x, y, z) as the vector part. The tuple naming convention from §8.1 applies — rs1 and rd name the first register of a 4-tuple.
Narrow-mode constraint: starting register must allow the 4-tuple to fit in f0–f31, so rs1 ≤ f28, rd ≤ f28.
13.1 QMUL.S / QMUL.D — Quaternion Multiplication
QMUL.S fd, rs1, rs2
# q1 = (w1, x1, y1, z1) at f[rs1..rs1+3]
# q2 = (w2, x2, y2, z2) at f[rs2..rs2+3]
# result = q1 × q2 stored at f[fd..fd+3]
#
# w = w1×w2 - x1×x2 - y1×y2 - z1×z2
# x = w1×x2 + x1×w2 + y1×z2 - z1×y2
# y = w1×y2 - x1×z2 + y1×w2 + z1×x2
# z = w1×z2 + x1×y2 - y1×x2 + z1×w2
The quaternion-composition primitive. Internal cost: 16 multiplies + 12 adds. Latency: 8 cycles (pipelined through 4 internal FMA units in parallel where the DSP block budget allows).
Skeletal animation impact: a character with 50 bones needs 50 QMULs per frame to compose local-to-world transforms. With QMUL at 8 cycles each, that's 400 cycles per character; at 60 fps with 20 characters on-screen, that's 480K cycles/sec = 0.13% of CPU. Without QMUL, the same work takes ~2500 cycles per character (~5× slower).
13.2 QROT.S / QROT.D — Rotate Vector by Quaternion
QROT.S fd, rs1, rs2
# q = (w, x, y, z) at f[rs1..rs1+3] (assumed normalised)
# v = (vx, vy, vz) at f[rs2..rs2+2]
# result = q × v × q⁻¹ stored at f[fd..fd+2] (rotated vector)
#
# Computed via the formula:
# v' = v + 2 × cross(q.xyz, cross(q.xyz, v) + q.w × v)
The "apply this orientation to this direction" primitive. Used for transforming forward/right/up vectors, applying bone rotations to attached objects, computing facing direction after orientation changes.
Latency: 10 cycles (2 internal CROSS3 + 2 FMA chains). Internal implementation uses DSP block cascade for efficiency.
vs software: typically 25-30 FMA instructions, ~25 cycles. ~2.5× speedup.
13.3 Game Example: Character Bone Update
// Per-bone update (50 bones, 60 fps):
QMUL.S world_quat, parent_quat, local_quat; // 8 cycles
QROT.S bone_forward, world_quat, base_forward; // 10 cycles
// Compute bone position via additional VMADD3 chain
// ...
// Total per bone: ~25 cycles
// Total per character: 50 × 25 = 1250 cycles
// Total per frame at 20 characters: 25000 cycles = 0.007% of CPU at 380 MHz
vs software quaternion math:
// Per-bone, software:
// QMUL via 16 FMUL + 12 FADD: ~28 cycles
// QROT via 25+ FMA: ~25 cycles
// Total per bone: ~55 cycles
// Total per character: 50 × 55 = 2750 cycles
// Total per frame at 20 characters: 55000 cycles = 0.014% of CPU
~2.2× speedup on bone updates. At higher character counts (action games with 100+ on-screen characters), the savings scale linearly.
14. Group 12: Multi-Precision Integer Arithmetic with Carry
Standard RV64GC has no add-with-carry: a multi-limb add is an add plus an sltu to recover the carry into a register, fed to the next limb. Xmath's G1 already provides multiply-high (MADDH/MADDU) — the partial-product half of schoolbook bignum multiply — but offers no way to accumulate those partials, because accumulation is the carry chain. G12 supplies the missing piece: a single carry bit and four instructions that consume and produce it, turning each limb of a multi-precision add, subtract, or shift into one instruction.
This is deliberately the only condition state in the EE. The full Z80/eZX flag model (Z, C, S, V, P) was kept out precisely because a register every arithmetic op writes serialises the dual-issue pipeline (see ee_xcond §1.2 Non-Goals). Carry is the exception worth making: limb N+1 genuinely cannot start until limb N's carry exists, so a single serial carry resource adds no false dependency the algorithm didn't already impose — and because only this four-instruction family touches it, it never sits in the hot ALU-pairing path the way a general flags register would.
14.1 The xcarry register
Carry/borrow lives in a one-bit CSR.
| CSR | Address (suggested) | Privilege | Description |
|---|---|---|---|
xcarry |
0x808 |
URW | Multi-precision carry/borrow bit |
| Bits | Field | Meaning |
|---|---|---|
[0] |
C | Carry-out (after add / rotate) or borrow (after subtract) |
[63:1] |
reserved | WPRI — read as 0 |
xcarry is user read/write so bignum and crypto code can manage it directly. It is per-context state: Xctx saves and restores it across YIELD and the trap entry/return path folds it into the per-context flag storage already maintained — one bit, with none of the mscratch-bits convention the eZX needs for its 5-bit ezflags.
14.2 ADDC — Add with Carry
| Instruction | Operation | Updates |
|---|---|---|
ADDC rd, rs1, rs2 |
rd = rs1 + rs2 + xcarry.C |
xcarry.C ← carry-out (bit 64) of the unsigned 65-bit sum |
The workhorse. A chain of ADDC adds integers of any width, one instruction per 64-bit limb.
14.3 SUBC — Subtract with Borrow
| Instruction | Operation | Updates |
|---|---|---|
SUBC rd, rs1, rs2 |
rd = rs1 - rs2 - xcarry.C |
xcarry.C ← borrow (1 if rs1 < rs2 + C, unsigned) |
After a SUBC chain, xcarry.C = 1 indicates the minuend was smaller than the subtrahend — the multi-precision unsigned-less-than result, recovered without a flag-branch (§14.5).
14.4 ROLC / RORC — Rotate Through Carry
| Instruction | Operation |
|---|---|
ROLC rd, rs1 |
rd = (rs1 << 1) \| xcarry.C ; new xcarry.C = rs1[63] |
RORC rd, rs1 |
rd = (rs1 >> 1) \| (xcarry.C << 63) ; new xcarry.C = rs1[0] |
These are the through-carry rotates the standard Zbb rol/ror (circular, no carry) don't provide. They thread the carry bit across limbs, giving multi-precision shifts: RORC from the high limb down for a right shift, and for a left shift either ROLC from the low limb up, or — neatly — an ADDC self-add chain (§14.8), since shifting a number left by one is adding it to itself.
14.5 Carry management — clear, set, test
All three are standard CSR operations on xcarry; G12 adds no instructions for them.
csrrci x0, xcarry, 1 ; CLEAR carry (start of an add chain)
csrrsi x0, xcarry, 1 ; SET carry (rarely needed)
csrr t0, xcarry ; TEST: carry → t0[0]
Clearing is the common case — an add chain must begin with carry known-zero (or, equivalently, use a plain ADD/ADDS for limb 0 and ADDC thereafter). Setting is a Z80-ism (SCF) that borrow chains rarely need. Testing reads the bit into a GPR; you then branch with an ordinary register branch:
csrr t0, xcarry
bnez t0, overflow ; standard branch — there is no branch-on-carry
There is deliberately no branch-on-carry instruction. The EE has no flag-branch paradigm (that is what ee_xcond's register-relative predication replaces), and routing the test through a GPR keeps it that way. The cost is one extra instruction over a fused BC/BNC, but branches do not dual-issue in any case, so the two-instruction form is the same throughput.
14.6 Predication
ADDC, SUBC, ROLC and RORC are R-type and inherit Xcond's wide-mode PRED-EN gate (bit 35) like the rest of Xmath (§1.4). The one corner worth stating: when the predicate is false, the instruction writes neither rd nor xcarry — the carry bit is treated as a second destination and is left unchanged, so a predicated-off limb does not silently corrupt the chain.
14.7 Microarchitecture
- Single-cycle. All four ops are combinational ALU operations, 1-cycle throughput, like
add/sub. - Serial by nature. Each carry-consuming op depends on the previous carry-producer through
xcarry. This is intrinsic to multi-precision arithmetic, not added serialisation. - Dual-issue. The decoder will not co-issue two carry-touching operations in the same cycle — the four arithmetic ops plus any
csraccess toxcarryare all carry readers/writers forming a read-modify-write on one bit. Because carry ops are rare relative to general ALU work, the lost pairing is negligible, and crucially it is confined to this family: it does not impose the broad ALU-pairing penalty a general condition-code register would. - Scoreboarding.
xcarryis tracked as a one-bit scoreboarded resource (like a register): a reader stalls only if a prior carry-writer has not yet retired.
14.8 Examples
256-bit add (limbs in a0..a3 += b0..b3):
csrrci x0, xcarry, 1 ; carry = 0
ADDC a0, a0, b0 ; limb 0
ADDC a1, a1, b1 ; limb 1 + carry
ADDC a2, a2, b2
ADDC a3, a3, b3
csrr t0, xcarry ; final carry-out
bnez t0, overflow ; standard branch
256-bit left shift by 1 (ADDC self-add — the MSB of each limb falls into carry and enters the next):
csrrci x0, xcarry, 1
ADDC a0, a0, a0
ADDC a1, a1, a1
ADDC a2, a2, a2
ADDC a3, a3, a3
256-bit right shift by 1 (RORC from the high limb down, shifting in 0 at the top):
csrrci x0, xcarry, 1 ; 0 shifts into bit 255
RORC a3, a3
RORC a2, a2
RORC a1, a1
RORC a0, a0
14.9 eZX (Z80) lineage and assembler lowering
G12 is the EE-native landing zone for the eZX Xez carry instructions, so eZX source ports without a flag register. The assembler lowers:
| eZX / Xez | Lowers to (EE) |
|---|---|
ADDC, SUBC |
ADDC, SUBC (direct) |
RL, RR (rotate through carry) |
ROLC, RORC |
RLC, RRC (rotate circular) |
Zbb rol, ror |
SCF / CCF / clear-carry idioms |
csrrsi / csrr+toggle / csrrci on xcarry |
The eZX *F flag-setting shifts (SLAF/SRAF/SRLF) that also update Z/S/P are not reproduced: the EE has no condition-code register, so those lower to a plain shift plus, where a later test is actually needed, an explicit compare. Only the carry bit survives the port — it is the single flag whose cross-instruction dependency is intrinsic to the arithmetic rather than an ergonomic convenience.
15. Examples
15.1 Audio Synthesis: 128-Voice Mixer
Mixing 128 voices into a stereo int16 output buffer:
// Per-sample inner loop:
int32 mix_L = 0;
int32 mix_R = 0;
for (int v = 0; v < 128; v++) {
int32 sample = voices[v].sample();
mix_L += sample * voices[v].pan_left; // Q15 multiply-accumulate
mix_R += sample * voices[v].pan_right;
}
int16 out_L = SAT.H(mix_L >> 15);
int16 out_R = SAT.H(mix_R >> 15);
With Xmath:
- The MAC pair → 2× MADD instructions per voice = 256 cycles per sample
- SAT.H × 2 = 2 cycles per sample
- Voice generation (sample()) = ~10 cycles per voice = 1280 cycles per sample
- Total: ~1540 cycles per stereo sample
At 48 kHz: 1540 × 48000 = 74M cycles/sec = 19% of CPU at 380 MHz.
Without Xmath:
- MAC pair = 4 cycles per voice = 512 cycles per sample
- SAT.H = ~10 cycles each = 20 cycles per sample
- Voice generation unchanged
- Total: ~1810 cycles per stereo sample
At 48 kHz: 22% of CPU. Xmath saves ~3% of CPU on this workload.
For voices with FMA-heavy synthesis (FM synthesis, wavetable interpolation with cubic), the savings climb to ~10–15% of CPU at 128 voices.
15.2 3D Vertex Transform
Transform a vertex by a 4×4 matrix (matrix-times-vector):
// out = M × v
out.x = DOT4.S(m_row0, v); // 8 cycles
out.y = DOT4.S(m_row1, v);
out.z = DOT4.S(m_row2, v);
out.w = DOT4.S(m_row3, v);
// Total: 32 cycles per vertex
Without Xmath (vanilla RV64GD with FMA):
- Each row: 4 FMADD.S = 4 × 4 = 16 cycles (no FMA parallelism in scalar)
- 4 rows: 64 cycles
Xmath = 2× speedup on per-vertex transform. For a scene of 50,000 vertices/frame: 50000 × 32 = 1.6M cycles for transform = ~0.4% of CPU at 380 MHz, leaving abundant headroom for shading and rasterization.
15.3 Sprite Rotation (BAM-Based)
A classic 2D sprite rotation:
// Per-frame angle update + rotation matrix build:
angle_bam += turn_rate; // 1 cycle
FSINCOSBAM.S sin_a, angle_bam; // 3 cycles
// matrix: [[cos_a, -sin_a], [sin_a, cos_a]]
// 4 element references = 4 register copies, ~4 cycles
// Per-pixel rotation: (apply inverse rotation to source coordinate)
src_x = MADD.S(out_x, cos_a, MADD.S(out_y, sin_a, 0)); // 4 cycles
src_y = MSUB.S(out_y, cos_a, MUL.S(out_x, sin_a)); // 4 cycles
Plus pixel sampling. For a 128×128 sprite rotating per frame at 60 Hz:
- Per-pixel cost: ~12 cycles
- 128² × 12 = 196K cycles per sprite
- At 60 Hz: 60 × 196K = 11.8M cycles/sec = ~3% of CPU per rotated sprite
Without Xmath (FMA-based, library trig):
- Per-pixel cost: ~25 cycles
- ~6% of CPU per rotated sprite
Xmath = 2× speedup on rotation effects. Aggregate over many rotating sprites becomes substantial.
15.4 Vector Normalisation in Lighting
Phong lighting requires normalising every interpolated normal per pixel:
// Per-pixel normalisation:
float lensq = LENSQ3.S(n.xyz); // 6 cycles
float inv_len = FRSQRT.S(lensq); // 3 cycles
n.x *= inv_len;
n.y *= inv_len;
n.z *= inv_len; // 3 cycles
// Total: 12 cycles per pixel
Without Xmath:
// Per-pixel:
float lensq = n.x*n.x + n.y*n.y + n.z*n.z; // 4 cycles (3 MULs + 2 ADDs with FMA: 3 cycles)
float len = sqrt(lensq); // ~20 cycles (software sqrt)
float inv_len = 1.0f / len; // ~15 cycles (FDIV)
n.x *= inv_len; n.y *= inv_len; n.z *= inv_len; // 3 cycles
// Total: ~41 cycles per pixel
Xmath = ~3.4× speedup on per-pixel normalisation — critical for Phong-shaded or normal-mapped renderers (when run on the CPU; dedicated drawing hardware in Ant64 typically handles per-pixel work).
15.5 Collision Detection: Ray vs Triangle (Möller-Trumbore)
Testing a ray against a triangle is the fundamental primitive for picking, raycasting bullets, line-of-sight checks, and BSP traversal:
// Möller-Trumbore ray-triangle test using G6 + G7
edge1 = VSUB3(v1, v0); // 3 cycles
edge2 = VSUB3(v2, v0); // 3 cycles
h = CROSS3(ray_dir, edge2); // 10 cycles
a = DOT3(edge1, h); // 6 cycles
if (fabs(a) < EPSILON) return false; // 2 cycles
f = FRECIP.S(a); // 3 cycles
s = VSUB3(ray_origin, v0); // 3 cycles
u = f * DOT3(s, h); // 6 + 1 = 7 cycles
if (u < 0 || u > 1) return false;
q = CROSS3(s, edge1); // 10 cycles
v = f * DOT3(ray_dir, q); // 7 cycles
if (v < 0 || u + v > 1) return false;
t = f * DOT3(edge2, q); // 7 cycles
return (t > EPSILON);
// Total per ray-triangle test: ~60 cycles
vs without Xmath (using only standard RV64GD):
// Each VSUB3 → 3 FSUB (3 cycles)
// Each CROSS3 → 6 FMUL + 3 FSUB + register juggling (~15 cycles)
// Each DOT3 → 3 FMUL + 2 FADD or 2 FMA + FMUL (~6 cycles, similar)
// FRECIP → FDIV (~15 cycles)
// Total: ~110 cycles per ray-triangle test
~1.8× speedup on ray-triangle tests. A 3D game raycasting against ~500 triangles per query gets back several hundred microseconds per query on FireStorm.
15.6 Pathfinding: A* with OCTILE2 Heuristic
A* on a 1000×1000 grid with typical agent paths expanding ~3000 nodes:
// Inner expansion loop
while (!open_set.empty()) {
node = open_set.pop_min_f();
if (node == goal) return reconstruct_path();
for (neighbour in node.successors()) { // 8 neighbours
new_g = node.g + cost(node, neighbour); // 2 cycles
if (new_g < neighbour.g) {
neighbour.parent = node;
neighbour.g = new_g;
neighbour.h = OCTILE2(neighbour, goal); // 3 cycles
neighbour.f = new_g + neighbour.h; // 1 cycle
open_set.update_or_insert(neighbour); // ~20 cycles (heap)
}
}
}
Per-node cost with G10: ~30 cycles + heap. Per-pathfind cost (3000 nodes × 30) = 90K cycles.
Without G10 (manual heuristic): ~50 cycles + heap = 150K cycles per pathfind.
*~1.7× speedup on A throughput.** A combat encounter triggering 20 AI pathfinds per second uses ~2% of CPU instead of ~3%.
15.7 AI Steering: Flocking Behaviour
Boids-style flocking, computing separation/alignment/cohesion forces for each agent:
// Per-agent flocking inner loop (against N nearby agents)
sep_total = VEC3(0, 0, 0);
ali_total = VEC3(0, 0, 0);
coh_total = VEC3(0, 0, 0);
for (other in nearby_agents) {
diff = VSUB3(self.pos, other.pos); // 3 cycles
dist_sq = LENSQ3(diff); // 6 cycles
if (dist_sq < separation_radius_sq) {
inv_dist = FRSQRT.S(dist_sq); // 3 cycles
force_mag = VSCALE3(diff, inv_dist); // 3 cycles
sep_total = VADD3(sep_total, force_mag); // 3 cycles
}
if (dist_sq < neighbour_radius_sq) {
ali_total = VADD3(ali_total, other.velocity); // 3 cycles
coh_total = VADD3(coh_total, other.pos); // 3 cycles
}
}
// Combine forces, integrate position with VMADD3
// Per-agent cost: ~50 cycles + per-neighbour work
Per-neighbour cost with G7: ~24 cycles (without separation), ~33 cycles (with). Without G7: ~50 cycles per neighbour (more individual FSUB / FMUL / FADD).
~1.6× speedup on flocking inner loop. A typical flock of 50 boids checking the nearest 8 neighbours each runs at ~10K cycles per frame at 60 fps — well under 1% of CPU.
15.8 Skeletal Animation: Character Bone Hierarchy
A typical humanoid character has 50–80 bones, each with a local rotation expressed as a quaternion. Per-frame update walks the hierarchy and composes local rotations with parent world rotations:
// Per-bone update (50 bones, 60 fps, 20 characters on screen)
for (bone in skeleton.bones_in_hierarchy_order) {
QMUL.S bone.world_quat, bone.parent.world_quat, bone.local_quat; // 8 cycles
QROT.S bone.world_forward, bone.world_quat, bone.local_forward; // 10 cycles
// Compute bone tip position via VMADD3 from parent + offset rotated by world_quat
VMADD3.S bone.world_pos, bone.parent.world_pos, bone.world_forward, bone.length; // 4 cycles
}
// Per-bone: ~22 cycles
// Per-character: 50 × 22 = 1100 cycles
// Per frame (20 characters): 22,000 cycles = 0.006% of CPU at 380 MHz
Without G11 quaternion ops:
// QMUL via software: 16 FMUL + 12 FADD ≈ 28 cycles
// QROT via software: ~25 FMA chain ≈ 25 cycles
// Per-bone: ~57 cycles
// Per-character: 50 × 57 = 2850 cycles
// Per frame (20 characters): 57,000 cycles = 0.015% of CPU
~2.5× speedup on skeletal animation. At higher character counts (action games with 100+ characters), this becomes a substantial fraction of frame budget.
15.9 Game Animation: Smooth UI Tween
A UI panel sliding in from off-screen over 0.3 seconds with eased motion:
// Per-frame
phase = (time - start_time) * 3.333; // 1 cycle
alpha = SMOOTHERSTEP.S(phase); // 5 cycles (clamps + cubic ease)
panel.x = LERP.S(start_x, end_x, alpha); // 3 cycles
// Total: 9 cycles per animated property per frame
For 50 simultaneously-animating UI elements: 450 cycles/frame at 60 fps = 27K cycles/sec = essentially free.
Without G9: ~25 cycles per element. For 50 elements: 1250 cycles/frame = 75K cycles/sec.
~2.8× speedup on UI animation. Not large in absolute terms, but uniformly applied across every animated UI element keeps the UI thread's overhead negligible.
16. Implementation Notes
16.1 DSP Block Budget
| Group | DSP blocks | Notes |
|---|---|---|
| G1 — Integer MAC | 4 (shared with M-ext MUL) | Reuses the multiplier hardware with post-add via DSP block ALU |
| G2 — Saturating | 0 | Detection + clamp in fabric (~800 LUTs) |
| G3 — Min/Max/Sign/Abs | 0 | Combinational comparators (~300 LUTs) |
| G4 — FP Approximations | 2–4 per active unit | LUT-backed approximations, possibly with Newton-Raphson refinement |
| G5 — BAM Trigonometry | 1 (shared with G4) | Hardware modular reduction + table lookup |
| G6 — 3D Vector Bundles | 4–6 (shared with FP FMA) | Sequenced FMA chain through existing FMA unit |
| G7 — Vector Componentwise | 3–6 (shared with FP FMA) | 3 parallel FMA paths or one with serial completion |
| G8 — 2D Math Primitives | 2 (shared with G6) | Same FMA hardware as 3D bundles, one fewer lane |
| G9 — Game / Animation Math | 2 (shared with G4) | SMOOTHSTEP / SMOOTHERSTEP use the FP FMA chain; CLAMP / STEP are combinational |
| G10 — Distance Heuristics | 1 (integer multiplier) | OCTILE2's internal multiply uses one DSP block; others combinational |
| G11 — Quaternion Math | 6–8 (shared with FP FMA) | QMUL and QROT internally use the same FMA chain; need extra register-file write bandwidth |
| Total dedicated DSPs | ~10–15 | Most groups share with existing units |
Total Xmath area cost on Ant64: ~8,000–10,000 LUTs + 4 BSRAM blocks (for FP approximation tables) + ~10–15 dedicated DSP blocks beyond the existing M-extension and F/D extension hardware.
This fits comfortably on the GW5AST-138 (138K LUT, 298 DSPs), consuming a small fraction of the available fabric.
16.2 Implementation
Xmath is fully implemented with the following latencies and throughput:
| Group | GW5AST-138 |
|---|---|
| G1 — Integer MAC | Full |
| G2 — Saturating | Full |
| G3 — Min/Max/Sign/Abs | Full |
| G4 — FP Approximations | Full (FP32 + FP64) |
| G5 — BAM Trigonometry | Full |
| G6 — 3D Vector Bundles | Full |
| G7 — Vector Componentwise | Full (parallel FMA paths) |
| G8 — 2D Math Primitives | Full |
| G9 — Game / Animation Math | Full |
| G10 — Distance Heuristics | Full |
| G11 — Quaternion Math | Full |
| Throughput on G6/G7 | 1 / 3 cycles per bundle |
| Throughput on G11 QMUL | 1 / 8 cycles |
All models implement all Xmath instructions with identical latencies and throughput — the fabric is the same GW5AST-138 everywhere, so there is no per-variant difference and software portability is automatic.
16.3 Power and Frequency
The FP approximation tables and modular-reduction logic are mostly static (LUTs) and have minimal switching activity — power impact is small. The DSP-block-backed integer MAC and FP FMA paths use the same DSP blocks as the M and F/D extensions, so their power impact is the marginal cost of additional operations through the same units.
All Xmath operations are clocked at the same 380 MHz BSRAM/CPU clock; no separate clock domain.
16.4 Composition with Scoreboarding
Xmath operations interact with the scoreboarding system (§15.1 of ee_cpu) the same way other multi-cycle operations do:
- An Xmath instruction issues to its functional unit and marks its destination register pending.
- Subsequent instructions continue executing on the main pipeline until one needs the pending result.
- Especially useful for G4 (FP approximations) and G6 (vector bundles), where the 3–10 cycle latencies are well-hidden behind subsequent independent work.
For G1 (integer MAC), the 2–3 cycle latency is short enough that scoreboarding's main benefit is hiding back-to-back MAC chains where the previous result is needed only a few instructions later.
17. Open Items
Items deferred to subsequent Xmath revisions:
-
Half-precision (FP16) Xmath variants. FRECIP.H / FRSQRT.H / FSIN.H / FCOS.H / FSINCOS.H / FATAN2.H for the Zfh extension. Useful for shader-style code where FP16 precision suffices, and saves 50% on DSP block usage per FMA. Likely to land in v0.2 with FP16 support.
-
Quad-precision (FP128) approximations. Same family for the Q (Zfh extension) if implemented; defers to whenever Q lands.
-
Saturating shift operations. SLLS, SRLS (saturating left/right shifts) — niche but useful for fixed-point DSP. Deferred to a future revision.
-
Additional vector bundles.
- MATMUL3x3 / MATMUL4x4 as full 3×3 / 4×4 matrix multiplies. Cost: substantial — these need many internal FMAs and may better suit a future V (vector) extension.
- LERP3 / LERP4: vector linear interpolation (componentwise LERP on a 3- or 4-element tuple). Useful for skeletal animation blending.
- SLERP: spherical linear interpolation for quaternions. Significant complexity; deferred.
-
BAM-specific extensions.
- FSINBAM_LU / FCOSBAM_LU: linear-interpolation-between-table-entries variants of the BAM trig, for cases where bit-exact wraparound is needed but accuracy can be slightly better than 12-bit lookup table. v0.2 candidate.
- BAMADD / BAMSUB: angle addition/subtraction with proper modular semantics. Trivial as integer ADD/SUB on BAM values — but a clear mnemonic might improve readability.
-
Vector extension (V) revisited. If, post-implementation, a clear need emerges for SIMD-style data parallelism that Xmath does not address (e.g., bulk image filtering, mass voice mixing into wide vectors), V could be added in v0.3+. The OP-V opcode (
0x57) is currently allocated to Xmath; if V is added later it will need a different opcode allocation, which is straightforward given that several custom-N spaces and one or two reserved standard slots remain free. -
FP exception handling. Xmath FP approximations do not raise IEEE-754 exception flags (they are approximations, not exact). Specifically: FRECIP of zero does not raise divide-by-zero; FRSQRT of negative does not raise invalid; FSIN/FCOS/FATAN2 of infinity return implementation-defined values. The
fcsrregister is unaffected by Xmath operations. This is consistent with similar approximation instructions on other architectures (SSE RCPSS, ARM FRECPE). -
Encoding finalisation. The funct7 / funct3 assignments in §2.3 are nominal; the exact bit-level encoding pending finalisation in coordination with the rest of the FireStorm extensions and with toolchain implementation. This includes the G12 multi-precision ops (ADDC/SUBC/ROLC/RORC) in the funct3 =
000lane. -
xcarryCSR address.0x808(§14.1) is suggested — the first free user-custom slot after the Xlate translator block at0x800–0x807. Final assignment requires coordination with the other FireStorm extensions' CSR allocations, as flagged for Xstack, Xctx and Xlate. Also open: whether to widen G12 to multi-word rotate-by-N (currently rotate-by-1 only) and whether a futureMULX/multiply-wide pairing with G1's MADDH would round out a complete bignum-multiply primitive set. -
Compiler intrinsics. Naming and conventions for C / Rust intrinsics for Xmath operations. Standard practice (e.g.,
__builtin_firestorm_madd,__builtin_firestorm_fsincosbam) but exact names TBD. -
Library integration. Whether libm's
sin/cos/sqrt/ etc. should dispatch to Xmath approximations by default (faster, less accurate) or only via explicit intrinsics (preserves IEEE-754 contract for libm). Recommended: explicit intrinsics for Xmath; libm preserves strict semantics with software fallback.
18. Glossary
| Term | Meaning |
|---|---|
| BAM | Binary Angle Measure. An angle representation using a fixed-width integer (typically 16 or 32 bits) to span [0, 2π) with perfect modular wraparound. Native to fixed-point and retro-style code. |
| Fused MAC | Multiply-and-add performed in a single instruction with a single rounding step (FP) or no intermediate truncation (integer). |
| Saturating arithmetic | Arithmetic where overflow clamps to the maximum/minimum representable value rather than wrapping. Critical for audio (avoids audible distortion from wrap-around). |
| Vector bundle | A single instruction that executes a sequence of internal multiply-adds on a small fixed-size tuple of values. Not SIMD — sequential internal execution, but encoded as one instruction for density. |
| FRECIP / FRSQRT | Reciprocal and reciprocal-square-root FP approximations, typically with 0.05–0.1% relative error. Comparable to SSE RCPSS / RSQRTSS. |
| Register tuple | A consecutive group of registers (e.g., f5, f6, f7 for a DOT3 operand). Vector bundles read/write tuples; the instruction names only the starting register. |
xcarry |
The single-bit carry/borrow CSR (0x808, URW) used by the G12 multi-precision instructions (ADDC/SUBC/ROLC/RORC). The EE's only condition-code state; carried across instructions so that limb-by-limb integer arithmetic of any width is one instruction per limb. There is no branch-on-carry — the bit is read into a GPR and tested with a standard branch. |
| Approximation | An Xmath operation that produces a result correct to a stated precision target (typically 0.01% – 0.1%) but not IEEE-754-correct. Used where game-rendering precision suffices. |