FireStorm Xmath Extension

1. Overview

The Xmath extension accelerates games, audio synthesis, fixed-point DSP, and retro-style demoscene code. Its instruction set captures the operations that dominate inner loops in those workloads: fused multiply-add, saturating arithmetic, min/max/sign/abs, fast transcendental approximations, BAM-based trigonometry, fixed-shape 3D vector math, and multi-precision integer arithmetic with carry.

Xmath replaces the previously-reserved RISC-V V (vector) extension space in FireStorm. V was originally reserved for a future implementation, but the actual workload mix FireStorm targets — per-voice audio synthesis, per-vertex 3D math, per-pixel image processing with non-trivial dependencies — benefits more from fast scalar fused ops than from multi-element data parallelism. V remains a possible v0.3+ addition if a clear need emerges, but it is no longer reserved at the encoding level.

1.1 Scope

Xmath provides ~64 instructions across twelve groups:

Group	Instructions	Purpose
G1 — Integer Fused MAC	MADD, MSUB, RMSUB, MADDH, MADDU, MADDW	Fixed-point math, FIR/IIR filters, integer dot products
G2 — Saturating Arithmetic	ADDS, SUBS, ADDSU, SUBSU, MULSAT, SAT.B, SAT.H, SAT.W	Audio mix-down, colour channel clamping
G3 — Min/Max/Sign/Abs	MIN, MAX, MINU, MAXU, ABS, SIGN	Clipping, bounding-box tests, branchless conditionals
G4 — FP Approximations	FRECIP.S/.D, FRSQRT.S/.D, FSIN.S/.D, FCOS.S/.D, FSINCOS.S/.D, FATAN2.S/.D	Reciprocals for perspective divide, normalisation, rotation, vector angle
G5 — BAM Trigonometry	FSINBAM.S/.D, FCOSBAM.S/.D, FSINCOSBAM.S/.D, FRAD2BAM, FBAM2RAD	Binary-angle measure trig — retro/demoscene-native, single-cycle modular reduction
G6 — 3D Vector Math Bundles	DOT3, DOT4, CROSS3, LENSQ3, LERP	3D vertex math, vector normalisation, interpolation
G7 — Vector Componentwise Bundles	VADD3, VSUB3, VSCALE3, VMADD3, VNORM3	Physics integration, steering, collision response, lighting normals
G8 — 2D Math Primitives	DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2	Navmesh pathfinding, 2D collision, raycaster setup, top-down games
G9 — Game / Animation Math	CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP	UI easing, shader uniforms, procedural animation, AI parameter clamping
G10 — Distance Heuristics	MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2	A* heuristics, grid-based pathfinding, broad-phase collision distance estimates
G11 — Quaternion Math	QMUL.S/.D, QROT.S/.D	Skeletal animation, camera orientation, character bone updates
G12 — Multi-Precision Integer	ADDC, SUBC, ROLC, RORC	Add-with-carry / borrow chains, 128-bit+ integer math, bignum shifts, crypto

1.2 Mode Availability

Group	Narrow Mode	Wide Mode
G1 — Integer Fused MAC	✓	✓
G2 — Saturating Arithmetic	✓	✓
G3 — Min/Max/Sign/Abs	✓	✓
G4 — FP Approximations	✓	✓
G5 — BAM Trigonometry	✓	✓
G6 — 3D Vector Math Bundles	✓ (with constraints on register tuple placement)	✓
G7 — Vector Componentwise Bundles	✓ (with constraints on tuple placement)	✓
G8 — 2D Math Primitives	✓ (with constraints on 2-tuple placement)	✓
G9 — Game / Animation Math	✓	✓
G10 — Distance Heuristics	✓	✓
G11 — Quaternion Math	✓ (4-tuple must fit in register file)	✓
G12 — Multi-Precision Integer	✓	✓

Every Xmath instruction is available in both narrow and wide modes. Wide mode buys access to the extended register file (x32–x63 / f32–f63) but does not unlock any additional Xmath operations. This means a vanilla RV64GC binary recompiled for FireStorm narrow gets the full Xmath acceleration; only access to the larger register file is wide-mode-specific.

1.3 Detection

Xmath presence is signalled in the FireStorm-specific mxfeatures CSR (0xFC0), bit 6:

Bit	Feature
0	Xcrisp
1	Xstack
2	Xcond
3	Xlate
4	Xctx
5	Xwide (always 1; mode is fetch-address driven)
6	Xmath

Software queries mxfeatures to detect Xmath at runtime; the compiler emits +xmath at build time. The riscv64-firestorm-elf target enables Xmath by default; the vanilla riscv64-unknown-elf target does not.

The G12 multi-precision group and its xcarry CSR (§14) are part of Xmath and are present whenever mxfeatures bit 6 is set; they do not carry a separate feature bit. (If a future implementation ever ships Xmath without G12, a capability bit can be added to a dedicated mxmath CSR at that point — the detection scheme leaves room for it.)

1.4 Composition with Other Extensions

Xmath composes cleanly with the rest of FireStorm:

Xcrisp (memory primitives): Xmath operates on registers; Xcrisp loads vectors / matrices into registers; the two are orthogonal and chain naturally. The Xcrisp B-tree primitives (BSRCH, BSCAN, BSHIFT — see §7 of ee_xcrisp) are entirely independent of Xmath but compose well in code that combines indexed structure traversal with math.
Xstack (BSRAM hardware stacks): Xmath has no special interaction with the stack; standard caller/callee-saved conventions apply.
Xctx (hardware contexts): Xmath state lives entirely in standard GPRs/FPRs and is saved/restored as part of normal context switching. The G12 xcarry bit (§14.1) is the one piece of non-GPR state; it is one bit folded into the per-context flag storage Xctx already swaps.
Xlate (memory translators): Xmath instructions are register-to-register and are not touched by translators — only the loads/stores that move operands in and out of registers are. This composes usefully with G12 multi-precision: limbs held big-endian in memory can be loaded through a byteswap read-translator so they arrive in host order (and written back through the matching write-translator — Xlate's involutory round-trip, ee_xlate §3.2). The xcarry CSR sits at 0x808, immediately after Xlate's translator-config block (0x800–0x807); it is plain CSR state accessed with standard CSR instructions, never a memory operand, so translators never apply to it. See ee_xlate §10.5.

Xcond Predication on Xmath R-Type Instructions

Wide mode only: every Xmath R-type instruction with an unused nibble bit 35 inherits Xcond's PRED-EN bit as a conditional-execution gate. When bit[35] = 1, the instruction executes conditionally based on the contents of the predicate register (mxcond_p); when bit[35] = 0, it executes unconditionally as described in the instruction's group section.

The conditional-execution semantics exactly match Xcond's R-type predication (§4 of ee_xcond): the instruction reads its operands and computes its result, but the result is written to rd only if the predicate condition holds. If the predicate is false, rd is left unchanged and any side effects (memory loads via vector bundles, flag updates) are suppressed.

The following table shows which Xmath instructions support predication in wide mode:

Group	Instructions	Predicable?	Notes
G1 — Integer Fused MAC	MADD, MSUB, RMSUB, MADDH, MADDU, MADDW	No	R4-type uses all 4 nibble bits for register extension (rd, rs1, rs2, rs3)
G2 — Saturating Arithmetic	ADDS, SUBS, ADDSU, SUBSU, MULSAT, SHIFTSAT.{B,H,W}	Yes	All standard R-type; bit 35 = PRED-EN
G2 — Width Saturation	SAT.B, SAT.H, SAT.W	Yes	Unary; 2 spare bits, 35 = PRED-EN
G3 — Min/Max/Sign/Abs	MIN, MAX, MINU, MAXU, ABS, SIGN	Yes	All standard R-type
G4 — FP Approximations	FRECIP, FRSQRT, FSIN, FCOS, FSINCOS, FATAN2	Yes	Unary FP; bit 34 = precision mode, bit 35 = PRED-EN
G5 — BAM Trigonometry	FSINBAM, FCOSBAM, FSINCOSBAM, FRAD2BAM, FBAM2RAD	Yes	Mixed integer→FP; bit 35 = PRED-EN
G6 — 3D Vector Bundles	DOT3, DOT4, CROSS3, LENSQ3, LERP	Yes	bit 35 = PRED-EN
G7 — Vector Componentwise	VADD3, VSUB3, VSCALE3, VMADD3, VNORM3	Yes	bit 35 = PRED-EN; per-bundle gate (all 3 element writes suppressed if false)
G8 — 2D Math Primitives	DOT2, LENSQ2, CROSS2, VADD2, VSUB2, VSCALE2, VNORM2	Yes	bit 35 = PRED-EN
G9 — Game / Animation Math	CLAMP, SMOOTHSTEP, SMOOTHERSTEP, STEP	Yes	bit 35 = PRED-EN
G10 — Distance Heuristics	MANHATTAN2/3, CHEBYSHEV2/3, OCTILE2	Yes	bit 35 = PRED-EN
G11 — Quaternion Math	QMUL, QROT	Yes	bit 35 = PRED-EN; per-bundle gate
G12 — Multi-Precision Integer	ADDC, SUBC, ROLC, RORC	Yes	bit 35 = PRED-EN; predicate false ⇒ `rd` unchanged and `xcarry` unchanged

Use cases for predicated Xmath:

Conditional MADD chains (G2): sparse FIR filters that skip zero coefficients without branches:

PCMP    zero, coef[i]               ; set predicate if coef != 0
ADDS.p  acc, acc, sample[i]*coef[i] ; predicated MADD (only if nonzero)

Conditional vector normalisation (G7): normalise only if the input has non-zero length:

LENSQ3   len_sq, v
PCMP    fzero, len_sq               ; set predicate if length != 0
VNORM3.p normalised, v              ; only normalise if non-zero

Conditional bone update (G11): skip animation update for bones whose parent has no animation change:

PCMP    quat_zero, parent_delta
QMUL.p   world_quat, parent_quat, local_quat   ; only if parent moved

*Conditional A heuristic** (G10): compute heuristic only for unvisited nodes:

PCMP    unvisited, visited_flag
OCTILE2.p h, neighbour, goal        ; skip for visited nodes

In assembly, the predicated form is written with a .p suffix (or by setting PRED-EN explicitly in the encoding). The compiler emits predicated forms automatically when it detects branch-condition patterns that match Xcond's predicate semantics.

Narrow mode: PRED-EN bit does not exist; all Xmath R-type instructions execute unconditionally. Code that requires predication must place itself in .text.wide.

2. Encoding

Xmath uses two opcodes:

Opcode	Standard meaning	Type	Purpose in FireStorm
`0x57`	OP-V (vector) — not implemented	R-type	All Xmath R-type instructions (G2–G12)
`0x6B`	reserved (no standard claim)	R4-type	Integer fused multiply-add family (G1)

FireStorm reallocates 0x57 because it does not implement the RISC-V Vector extension; Xmath's scalar fused operations capture most of the practical benefit V would provide for FireStorm's target workloads (games, audio synthesis, retro emulation, DSP) without V's implementation complexity. The 0x6B slot is genuinely reserved in the ISA — no standard extension claims it — and the R4 format slots in cleanly.

The standard FP opcodes are completely unchanged:

0x07 LOAD-FP, 0x27 STORE-FP — RV-F/D scalar floating-point loads and stores
0x43 FMADD, 0x47 FMSUB, 0x4B FNMSUB, 0x4F FNMADD — RV-F/D fused multiply-add (R4-type)
0x53 OP-FP — RV-F/D scalar FP ALU (FADD, FSUB, FMUL, FDIV, FSQRT, FMV, FCVT, FSGNJ, FCLASS, etc.)

Xmath does not touch any of these. The integer MADD at 0x6B sits alongside the FP FMA family (0x43–0x4F) at the architectural level — same R4 encoding format — but at its own opcode, so the two families decode independently with no conflict.

2.1 R4-Type Layout (0x6B)

Bit:    31    27 26 25 24    20 19    15 14  12 11   7 6     0
       ┌───────┬─────┬────────┬────────┬─────┬──────┬───────┐
       │  rs3  │ fmt │  rs2   │  rs1   │funct3│ rd  │ 0x6B │
       └───────┴─────┴────────┴────────┴─────┴──────┴───────┘

Field	Bits	Purpose
`rs3`	[31:27]	Third source register (added/subtracted operand)
`fmt`	[26:25]	Variant selector (see §3)
`rs2`	[24:20]	Second source register (multiplier)
`rs1`	[19:15]	First source register (multiplicand)
`funct3`	[14:12]	Operation selector
`rd`	[11:7]	Destination register
opcode	[6:0]	`0x6B`

In wide mode, the extension nibble provides bit 5 for each register field, giving 6-bit register indices.

2.2 R-Type Layout (0x57)

Bit:    31         25 24    20 19    15 14  12 11   7 6     0
       ┌────────────┬────────┬────────┬─────┬──────┬───────┐
       │   funct7   │  rs2   │  rs1   │funct3│ rd  │ 0x57 │
       └────────────┴────────┴────────┴─────┴──────┴───────┘

The standard R-type format with FireStorm conventions: 7-bit funct7 + 3-bit funct3 = 10 bits of operation space (1024 slots), far more than Xmath needs.

For instructions that consume an FP register, the convention follows standard RV F/D: rs1 and rs2 are FP register fields when the instruction is FP-typed, integer register fields when integer-typed. The funct7 high bits typically encode the source/destination types (00 for FP32, 01 for FP64, etc., mirroring the F/D format conventions).

2.3 Encoding Allocation Summary

The 0x57 opcode has 7-bit funct7 + 3-bit funct3 = 1024 instruction slots. The funct3 field selects the operation family; funct7 distinguishes specific operations and (for FP) the format (.S/.D via funct7's low bits per RV F/D conventions).

funct3 in 0x57	Group	funct7 sub-allocation
`000`	G2 Saturating + G12 Multi-Precision	funct7[6:0] selects ADDS/SUBS/ADDSU/SUBSU/MULSAT (G2) and ADDC/SUBC/ROLC/RORC (G12) — G12 occupies free funct7 codes in this lane
`001`	G3 Min/Max + G10 Heuristics	funct7[6:5]=00 → MIN/MAX/MINU/MAXU; funct7[6:5]=01 → MANHATTAN/CHEBYSHEV/OCTILE
`010`	G3 Abs/Sign + G2 SAT.x	funct7 selects ABS/SIGN/SAT.B/SAT.H/SAT.W
`011`	G4 FP unary + G9 Game Math	funct7 high bits select group (G4 or G9), low bits select operation and .S/.D
`100`	G4 FP binary + G9 Game Math binary	funct7 selects FSINCOS/FATAN2/CLAMP/SMOOTHSTEP/etc.
`101`	G5 BAM	funct7 selects FSINBAM/FCOSBAM/etc., .S/.D
`110`	G6+G7+G8+G11 — single result	funct7 selects which bundle (DOT3, DOT4, DOT2, LENSQ2/3, LERP, VNORM2/3, QMUL); .S/.D via funct7 low bits
`111`	G6+G7+G8+G11 — multi result	funct7 selects CROSS3, CROSS2, VADD/SUB/SCALE/MADD (3D and 2D), QROT

fmt in 0x6B (R4-type)	Group	Variant
`00`	G1	MADD / MSUB (signed, low 64 bits, selected by funct3)
`01`	G1	MADDH / MADDU (signed/unsigned high 64 bits, selected by funct3)
`10`	G1	MADDW / MSUBW (32-bit signed, sign-extended to 64)
`11`	reserved	future expansion

The total used encoding space is approximately 64 instructions out of the available 1024 slots in 0x57 plus 4 slots used in 0x6B — substantial headroom for future Xmath additions. The G12 multi-precision ops (ADDC, SUBC, ROLC, RORC) are integer R-type and take nominal funct7 codes in the funct3 = 000 lane alongside the saturating-arithmetic family; the carry bit they read and write is architectural state (the xcarry CSR, §14), not an encoding field, so it costs nothing in opcode space.

3. Group 1: Integer Fused Multiply-Add

All G1 instructions use the R4-type encoding (opcode 0x6B), parallel in format to the FP FMADD family (which lives at 0x43/0x47/0x4B/0x4F). Latency: 2–3 cycles (DSP-backed). Throughput: 1 result per cycle (fully pipelined).

3.1 MADD — Multiply-Add

MADD rd, rs1, rs2, rs3
rd = (rs1 × rs2)[63:0] + rs3

Computes the low 64 bits of rs1 × rs2, adds rs3. The standard form for fixed-point DSP inner loops:

// FIR filter inner loop
acc = acc + sample[i] * coef[i];   // → MADD acc, sample[i], coef[i], acc

Two instructions become one. On FireStorm, MADD has the same latency as MUL alone (no penalty for the additional add — DSP block accumulator handles it). 5× faster on integer FIR inner loops in steady state.

3.2 MSUB — Multiply-Subtract

MSUB rd, rs1, rs2, rs3
rd = (rs1 × rs2)[63:0] - rs3

3.3 RMSUB — Reverse Multiply-Subtract

RMSUB rd, rs1, rs2, rs3
rd = rs3 - (rs1 × rs2)[63:0]

Useful for accumulator decrements and for computing c - a*b (common in error-correction codes).

3.4 MADDH — Multiply-High Add

MADDH rd, rs1, rs2, rs3
rd = (rs1 × rs2)[127:64] + rs3

The signed 128-bit product's high 64 bits, plus an add. The Q63.64 fixed-point multiply-add primitive — multiply two Q1.63 values, take the high half (giving back a Q1.63), add to accumulator. Standard pattern for audio gain stages, normalised fixed-point integration, fractional rotation.

3.5 MADDU — Multiply-High Add (Unsigned)

MADDU rd, rs1, rs2, rs3
rd = ((u64)rs1 × (u64)rs2)[127:64] + rs3

Unsigned variant of MADDH for hash computation, modular arithmetic, big-integer multiplies.

3.6 MADDW — 32-bit Multiply-Add (Sign-Extended)

MADDW rd, rs1, rs2, rs3
rd = sext64((rs1[31:0] × rs2[31:0])[31:0] + rs3[31:0])

The W (32-bit operand) variant. Result is sign-extended to 64 bits per standard RV64 conventions. Useful when the application is genuinely 32-bit fixed-point.

3.7 Performance

On Ant64 (DSP-backed MUL):

Operation pattern	Standard RV64	Xmath G1	Speedup
FIR filter inner loop (per tap)	4 cycles (MUL + ADD + load + bounds)	2 cycles (load + MADD)	2×
Fixed-point Q31 dot product	3 cycles (MUL + ADD + shift)	1 cycle (MADDH)	3×
Karatsuba multiply outer	8 cycles per chunk	5 cycles	1.6×
4-tap polynomial evaluation	8 cycles	4 cycles	2×

4. Group 2: Saturating Arithmetic

Saturating operations clamp results to the maximum or minimum representable value of the type on overflow, rather than wrapping. Critical for audio mixing (sums that exceed int16 must clamp to ±32767, not wrap to negative numbers, which produces audible distortion).

All G2 instructions are R-type at opcode 0x57, funct3 000 (for add/sub/mul) and 010 (for SAT.x). 1-cycle latency. 1/cycle throughput.

4.1 ADDS — Signed Saturating Add

ADDS rd, rs1, rs2
rd = clamp(rs1 + rs2, INT64_MIN, INT64_MAX)

If rs1 + rs2 overflows the 64-bit signed range, the result is clamped to INT64_MAX (on positive overflow) or INT64_MIN (on negative overflow). The standard 2's-complement wrap behaviour is replaced with explicit saturation.

4.2 SUBS — Signed Saturating Subtract

SUBS rd, rs1, rs2
rd = clamp(rs1 - rs2, INT64_MIN, INT64_MAX)

4.3 ADDSU — Unsigned Saturating Add

ADDSU rd, rs1, rs2
rd = min((u64)rs1 + (u64)rs2, UINT64_MAX)

Caps at 0xFFFF_FFFF_FFFF_FFFF on overflow (the carry-out is converted to "stay at max").

4.4 SUBSU — Unsigned Saturating Subtract

SUBSU rd, rs1, rs2
rd = max((s128)((u64)rs1 - (u64)rs2), 0)

Caps at 0 on underflow (clamps negative results to zero).

4.5 MULSAT — Signed Saturating Multiply

MULSAT rd, rs1, rs2
rd = clamp(rs1 × rs2, INT64_MIN, INT64_MAX)

The full 128-bit product is computed; if it doesn't fit in 64 bits signed, the result clamps. Useful for fixed-point gain stages where overflow protection matters more than precision.

4.6 SAT.B — Saturate to Int8

SAT.B rd, rs1
rd = clamp(rs1, -128, +127)

Single-instruction clamp to signed 8-bit range. Used for colour channel output (clamp computed pixel value to byte range) and audio downsampling (clamp 16-bit sample to 8-bit for low-bit-depth output).

4.7 SAT.H — Saturate to Int16

SAT.H rd, rs1
rd = clamp(rs1, -32768, +32767)

The audio-output workhorse. Mix N voices into a 32-bit accumulator, then SAT.H to clamp to the 16-bit DAC range:

// Audio mixer output stage
int32 mix = 0;
for (int v = 0; v < 128; v++) mix += voices[v].sample;
int16 output = SAT.H(mix);     // → 1 instruction, no branch

Standard RV64 alternative: 3-4 instructions with branches or a 5-instruction branchless if-then-else sequence. 5× speedup on the output stage of a 128-voice mixer.

4.8 SAT.W — Saturate to Int32

SAT.W rd, rs1
rd = clamp(rs1, INT32_MIN, INT32_MAX)

For applications that genuinely use 32-bit fixed-point output (less common than .H but still useful).

4.9 SHIFTSAT.B / SHIFTSAT.H / SHIFTSAT.W — Combined Arithmetic Shift + Saturate

SHIFTSAT.B rd, rs1, #imm5     # rd = clamp(rs1 >> imm5, -128, +127)
SHIFTSAT.H rd, rs1, #imm5     # rd = clamp(rs1 >> imm5, -32768, +32767)
SHIFTSAT.W rd, rs1, #imm5     # rd = clamp(rs1 >> imm5, INT32_MIN, INT32_MAX)

Performs an arithmetic right shift by a 5-bit immediate amount (0–31 bits), then saturates the result to the target width. Combines the two most common operations in fixed-point output stages into a single instruction.

The immediate field reuses the rs2 register slot of standard R-type encoding (otherwise unused for these unary operations); in wide mode the 5-bit rs2 field is available directly.

SHIFTSAT.X: opcode = 0x57, funct3 = 010, funct7[6:3] = 1010, funct7[2:1] = width selector,
            rs2 field = shift amount (5-bit)

Use case: the Q15 → int16 conversion universal in audio output:

// Mix 128 voices into a Q15.16 fixed-point accumulator (int32)
// Then convert to int16 sample for DAC output
int32_t mix = ...;
int16_t sample = SHIFTSAT.H(mix, 15);    // >>15 then saturate to ±32767

Without SHIFTSAT: 3 instructions (SRA + SAT.H + handle edge case where shift overflowed). With SHIFTSAT: 1 instruction, 1 cycle.

For colour-channel processing in 8-bit-output rendering pipelines, SHIFTSAT.B combines the >>8 and clamp-to-byte step:

uint8_t r = SHIFTSAT.B(red_acc, 8);   // mix accumulated in higher precision, output as 8-bit

This is the universal "downsample to N bits with saturation" primitive that appears in every audio output stage and every fixed-point shader output.

4.10 Performance Impact

For a 128-voice 48 kHz audio mixer:

Per-sample cost without SAT.H: ~10 cycles for clamp-and-write
Per-sample cost with SAT.H: ~2 cycles
At 48 kHz: ~50% reduction in mixer output overhead

For colour-channel processing (e.g., real-time per-pixel shading):

Per-channel cost without SAT.B: ~6 cycles
Per-channel cost with SAT.B: ~1 cycle
For RGBA processing at 1080p60: ~7% reduction in shading inner loop

5. Group 3: Min / Max / Sign / Abs

Single-cycle scalar conditional/sign operations. R-type at opcode 0x57. These overlap conceptually with the Zbb extension's MIN/MAX/MINU/MAXU; Xmath provides them whether or not Zbb is implemented, plus ABS and SIGN which Zbb does not include.

5.1 MIN / MAX (Signed)

MIN rd, rs1, rs2     # rd = (rs1 < rs2) ? rs1 : rs2  (signed)
MAX rd, rs1, rs2     # rd = (rs1 > rs2) ? rs1 : rs2  (signed)

5.2 MINU / MAXU (Unsigned)

MINU rd, rs1, rs2    # rd = (rs1 < rs2) ? rs1 : rs2  (unsigned)
MAXU rd, rs1, rs2    # rd = (rs1 > rs2) ? rs1 : rs2  (unsigned)

5.3 ABS — Absolute Value

ABS rd, rs1
rd = (rs1 < 0) ? -rs1 : rs1     # signed

Single instruction; standard RV64 needs 3 (compare + branch + negate, or branchless 2-step sequence). Distance calculations, Manhattan-distance heuristics, signal magnitude.

5.4 SIGN — Signum

SIGN rd, rs1
rd = (rs1 > 0) ? +1 : (rs1 < 0) ? -1 : 0

Single instruction; standard RV64 needs 4-5. Useful in branchless conditionals, gradient-direction code, dithering. Combined with MUL gives branchless select-on-sign patterns.

5.5 Bounding-Box Test Example

// Standard RV64:
//   if (x < bbox_min) clamped_x = bbox_min;
//   else if (x > bbox_max) clamped_x = bbox_max;
//   else clamped_x = x;
//   ... 5+ instructions with branches

// With Xmath:
clamped_x = MIN(MAX(x, bbox_min), bbox_max);     // 2 cycles, no branches

Branchless = no branch-predictor pollution. For inner-loop bounding-box tests on thousands of objects per frame, this is meaningful.

6. Group 4: FP Approximations

Approximations for the transcendentals that game code uses frequently. These are not IEEE-754-correct; they trade precision for speed. Accuracy targets are inspired by SSE RCPSS / RSQRTSS: 0.05–0.1% relative error, which is far more than adequate for games.

All G4 instructions are R-type at opcode 0x57, funct3 011 (unary) or 100 (binary). Implementation uses table lookup + polynomial refinement (typically Newton-Raphson 1 iteration). Latency: 3 cycles. Throughput: 1/cycle.

6.1 FRECIP.S / FRECIP.D — Reciprocal

FRECIP.S fd, fs1     # fd ≈ 1 / fs1   (FP32, ~0.05% error)
FRECIP.D fd, fs1     # fd ≈ 1 / fs1   (FP64, ~0.05% error)

The perspective-divide workhorse for 3D graphics. Standard FP division is ~10–20 cycles; FRECIP is 3 cycles.

// Perspective divide
float inv_w = FRECIP.S(w);     // 3 cycles
float xs = x * inv_w;           // 1 cycle (MUL)
float ys = y * inv_w;
float zs = z * inv_w;
// Total: 6 cycles per vertex vs ~15-20 for FDIV-based version

If extra precision is needed, one Newton-Raphson iteration brings error to ~10⁻⁹ at 5 extra cycles; usually not needed for visual rendering.

6.2 FRSQRT.S / FRSQRT.D — Reciprocal Square Root

FRSQRT.S fd, fs1     # fd ≈ 1 / √fs1  (FP32, ~0.1% error)
FRSQRT.D fd, fs1     # fd ≈ 1 / √fs1  (FP64, ~0.1% error)

The vector-normalisation workhorse. Famously hand-optimised in Quake III (the Q_rsqrt integer hack); FireStorm gives you a hardware instruction for the same operation.

// Vector normalisation
float len_sq = x*x + y*y + z*z;             // 3 cycles (with FMA)
float inv_len = FRSQRT.S(len_sq);           // 3 cycles
nx = x * inv_len; ny = y * inv_len; nz = z * inv_len;  // 3 cycles
// Total: 9 cycles vs ~30 for software fsqrt + fdiv

~3× speedup on lighting/normalisation inner loops — large on Phong-style or normal-mapped renderers.

6.3 FSIN.S / FSIN.D — Sine

FSIN.S fd, fs1     # fd ≈ sin(fs1)   (input in radians, ~0.01% error)
FSIN.D fd, fs1     # fd ≈ sin(fs1)   (FP64, ~0.01% error)

Range reduction is performed in hardware. Input may be any FP value; the implementation reduces modulo 2π using a high-precision constant. Output accuracy degrades for very large inputs (above ~10⁶) due to range-reduction precision loss; for game-typical angle values, accuracy is uniformly within 0.01%.

6.4 FCOS.S / FCOS.D — Cosine

FCOS.S fd, fs1     # fd ≈ cos(fs1)
FCOS.D fd, fs1     # fd ≈ cos(fs1)

6.5 FSINCOS.S / FSINCOS.D — Sine + Cosine (Pair)

FSINCOS.S fd, fs1
   ⇒ fd     ← sin(fs1)
     f(d+1) ← cos(fs1)

Returns both sine and cosine in one instruction. The destination is fd and fd+1 (the next FP register). For wide mode, fd and fd+1 may be any consecutive pair in f0–f63; in narrow mode, the pair must lie in f0–f30 (since f31 has no successor).

The single-instruction form is faster than two separate FSIN + FCOS instructions (~3 cycles total instead of 6) because the angle reduction and table lookup are shared. Essential for rotation-matrix construction:

// 2D rotation matrix
FSINCOS.S f1, theta;     // f1 = sin(theta), f2 = cos(theta) in 3 cycles
// matrix is then [[f2, -f1], [f1, f2]]

6.6 FATAN2.S / FATAN2.D — Arc-tangent with Quadrant

FATAN2.S fd, fs1, fs2     # fd ≈ atan2(fs1, fs2)
FATAN2.D fd, fs1, fs2     # fd ≈ atan2(fs1, fs2)

Full-circle arc-tangent (returns angle in [-π, +π]) given y (fs1) and x (fs2). Quadrant determined by sign of both inputs. Used for vector-to-angle conversion, target tracking, joystick deadzone calculation. 4 cycles (one extra cycle for quadrant resolution).

6.7 Precision Mode (Wide-Mode Bit Extension)

Wide mode only: in wide-mode encoding, the G4 unary FP approximation instructions (FRECIP, FRSQRT, FSIN, FCOS, FSINCOS, FATAN2) have two spare nibble bits after register-extension uses bit 32 (rd) and bit 33 (rs1). Bit 35 is consumed by Xcond's PRED-EN (see §6.9 below); bit 34 is repurposed as a precision mode bit:

bit[34]	Mode	Latency	Accuracy
0	Approximate (default)	3 cycles	~0.05–0.1% relative error
1	Refined (one Newton-Raphson iteration)	6–7 cycles	~10⁻⁹ relative error

The refined mode performs a single Newton-Raphson refinement step on the approximation result, dramatically improving accuracy at the cost of roughly 2× latency. Useful when:

Application has higher precision requirements (e.g., physics simulation needing energy conservation)
Range-reduction precision matters (e.g., FSIN of accumulated phases over thousands of iterations)
The approximation result will be cascaded into many further FMA operations where error accumulates

The selection is per-instruction, not per-context — code can mix approximate and refined uses freely. Standard precision-sensitive libraries can default to refined; game inner loops default to approximate.

In assembly, the refined form is written with a .R (refined) suffix on the mnemonic:

FRECIP.S    fd, fs1        # approximate, 3 cycles, ~0.05% error
FRECIP.S.R  fd, fs1        # refined,     6 cycles, ~10⁻⁹ error
FRSQRT.S    fd, fs1        # approximate, 3 cycles
FRSQRT.S.R  fd, fs1        # refined,     6 cycles

Narrow-mode behaviour: the precision bit does not exist in narrow encoding. Narrow mode always uses the approximate form. Code requiring refined precision must place itself in .text.wide, or fall back to the standard FP library sin / cos / sqrt / 1.0 / operations (which are IEEE-754 correct, but ~50–100 cycles each).

6.8 Composability with Scoreboarding

Whether approximate or refined, FP approximations issue to the FPU and release the main pipeline via scoreboarding (see §15.1 of ee_cpu). The 3- or 6-cycle latency is hidden behind any subsequent independent work. For typical game inner loops:

FRSQRT.S    inv_len, lensq       ; 3 cycles, marked pending
VSCALE3.S   normalised, v, inv_len   ; would stall waiting for inv_len
...

A peephole optimisation would schedule independent work between the FRSQRT and its consumer, hiding the entire latency. The compiler is expected to do this aggressively for the refined variants where the latency cost is higher.

6.9 Xcond Predication

All Xmath instructions with an unused nibble bit 35 inherit Xcond predication automatically. See §1.4 for the full table of which Xmath groups support predication and example use cases.

6.10 Precision Notes

Function	Max relative error	Compares to
FRECIP	5×10⁻⁴	SSE RCPSS: 1.5×10⁻¹² (refined); RCPPS estimate: 5.7×10⁻⁴; ARM FRECPE: ~3×10⁻³
FRSQRT	10⁻³	SSE RSQRTSS estimate: 1.5×10⁻³; ARM FRSQRTE: ~3×10⁻³
FSIN, FCOS	10⁻⁴	x87 FSIN: 10⁻¹⁸; library sin: 10⁻¹⁵; SSE: not directly provided
FATAN2	10⁻³	library atan2: 10⁻¹⁴

For game and audio rendering at typical 8-bit / 10-bit precision pipelines, all of these are massively more accurate than needed. For applications that need full IEEE-754 precision (physics simulation backbones, scientific computation), the standard library sin / cos / sqrt remain available — they just take ~50–100 cycles instead of 3.

7. Group 5: BAM (Binary Angle Measure) Trigonometry

Binary Angle Measure represents angles as integers where the full circle (0–2π) maps to the integer range (0 to 2^N − 1). This makes range reduction free — integer arithmetic naturally wraps modulo 2^N — and the lookup table is indexed directly by the BAM value.

BAM is the retro / demoscene / fixed-point native way of representing angles. It avoids the precision-loss problem of FP angle accumulation (where adding a small angle increment over many frames eventually drifts), gives perfect wraparound, and makes the sin/cos lookup a single table read.

7.1 BAM Representation

FireStorm BAM functions accept BAM as an integer in rs1. The format is 32-bit unsigned, mapping 0x00000000 to 0 radians and 0xFFFFFFFF (effectively) to 2π. Resolution: 2π / 2^32 ≈ 1.46×10⁻⁹ radians — far finer than any FP32 representation, comparable to FP64 representation near zero, perfect modular wraparound.

A 16-bit BAM is supported via the low 16 bits of rs1 (the upper 16 bits are ignored). Resolution: 2π / 2^16 ≈ 0.0055° — finer than human vision can resolve, perfect for game rotation animations.

7.2 FSINBAM.S / FSINBAM.D — Sine of BAM

FSINBAM.S fd, rs1     # fd = sin(2π × (rs1[31:0] / 2^32))
FSINBAM.D fd, rs1

The integer-to-FP boundary crossing is handled in hardware. rs1 is an integer register holding the BAM value; fd is an FP register receiving the sine result. Latency: 2 cycles (faster than FSIN.S because no FP range reduction needed). Throughput: 1/cycle.

7.3 FCOSBAM.S / FCOSBAM.D — Cosine of BAM

FCOSBAM.S fd, rs1     # fd = cos(2π × (rs1[31:0] / 2^32))
FCOSBAM.D fd, rs1

7.4 FSINCOSBAM.S / FSINCOSBAM.D — Sin + Cos Pair of BAM

FSINCOSBAM.S fd, rs1
   ⇒ fd     ← sin(BAM)
     f(d+1) ← cos(BAM)

The rotation-matrix-from-BAM workhorse. Combined with the rotation-matrix structure, this gives a complete 2D rotation in 3 cycles.

7.5 FRAD2BAM and FBAM2RAD — Conversions

FRAD2BAM rd, fs1     # rd = round(fs1 × (2^32 / 2π))   (FP → BAM int)
FBAM2RAD fd, rs1     # fd = rs1 × (2π / 2^32)           (BAM int → FP)

For interoperation with FP-radian-using code (e.g., math libraries that want radians). Both 2 cycles, 1/cycle throughput.

7.6 BAM-Based Rotation Example

// Game character rotation, BAM-based:
uint32_t angle_bam;          // Persistent in game state
uint32_t turn_rate = 0x00400000;     // 90°/sec at 60 Hz (one full circle per 4 sec)

// Per-frame update:
angle_bam += turn_rate;       // 1 cycle integer add; perfect wraparound

// Render:
float sin_a, cos_a;
FSINCOSBAM.S sin_a, angle_bam;       // 3 cycles
// Build 2D rotation matrix from sin_a, cos_a (2 negations)
// Total angle handling per frame: ~5 cycles

Compared to FP-radian-based equivalent:

float angle_rad;
float turn_rate = M_PI / 2;     // Same 90°/sec

// Per-frame:
angle_rad += turn_rate / 60.0f;  // 2 cycles FADD + FDIV (which is slow)
if (angle_rad > 2*M_PI) angle_rad -= 2*M_PI;   // Branchy wraparound

// Render:
float sin_a = sin(angle_rad);     // 50+ cycles (library) or 3 (FSIN.S)
float cos_a = cos(angle_rad);     // 50+ or 3
// Total: 6-100+ cycles

BAM wins ~3× per frame, with the larger savings on accumulated-precision: FP rotation drifts over thousands of frames; BAM is bit-exact forever.

7.7 Retro / Demoscene Use Cases

BAM is the native representation for:

Rotozoomers: rotate-and-zoom raster effects (Amiga / ST demoscene classic). Use FSINCOSBAM for the rotation matrix per scanline.
Plasma effects: combine multiple BAM-indexed sine waves for plasma rendering.
Wavetable oscillators: phase accumulation in BAM, wavetable index = BAM, no precision drift.
Velocity-based animation curves: BAM phase advances at constant rate, gives smooth, drift-free cyclic motion.
Particle system rotation: per-particle BAM rotation state, all updated identically per frame.

For the kind of retro-style code Anthony's project landscape favours, BAM is the right primitive.

8. Group 6: Vector Math Bundles

Fixed-shape multi-element operations for 3D vector math. These are bundles — single instructions that execute a sequence of internal multiply-adds — not SIMD operations on multiple registers in parallel. Their value is in encoding density and register-allocation efficiency, not in parallelism per se.

All G6 instructions are R-type at opcode 0x57, funct3 110 (single result) or 111 (multi-result).

8.1 Register Tuple Convention

The vector bundles operate on consecutive FP register tuples. The rs1 and rs2 fields name the first register of each tuple; the hardware reads consecutive registers from there.

3-element tuples (DOT3, CROSS3, LENSQ3): the bundle reads f[rs1], f[rs1+1], f[rs1+2] (and similarly for rs2).
4-element tuples (DOT4): the bundle reads f[rs1], f[rs1+1], f[rs1+2], f[rs1+3].

Narrow-mode constraint: the starting register must allow the tuple to fit within f0–f31. For DOT3, rs1 ≤ f29; for DOT4, rs1 ≤ f28. In wide mode, the tuple can start anywhere in f0–f60 (DOT3) or f0–f59 (DOT4).

This constraint is rarely binding in practice — compilers naturally allocate vectors at well-aligned bases.

8.2 DOT3 — 3-Element Dot Product

DOT3.S fd, rs1, rs2
fd = f[rs1] × f[rs2] + f[rs1+1] × f[rs2+1] + f[rs1+2] × f[rs2+2]

3 internal FP FMAs in sequence using the FMA chain. Latency: 6 cycles (3 FMAs × 2 cycles each, no parallelism within the bundle). Throughput: depends on FMA unit availability; on Ant64 with single FMA unit, 1 DOT3 per 6 cycles.

For batches of dot products, the compiler can pipeline by issuing the next DOT3 before the previous one's result is needed — scoreboarding handles this naturally.

FP64 variant (DOT3.D): same structure, 8-cycle latency on FP64 FMAs.

8.3 DOT4 — 4-Element Dot Product

DOT4.S fd, rs1, rs2
fd = Σ (f[rs1+k] × f[rs2+k]) for k in 0..3

The homogeneous-coordinate dot product — vertex transform stage. 4 internal FMAs. Latency: 8 cycles. Critical for the matrix-times-vector inner loop of 3D rendering.

8.4 CROSS3 — 3-Element Cross Product

CROSS3.S fd, rs1, rs2
f[fd]   = f[rs1+1] × f[rs2+2] - f[rs1+2] × f[rs2+1]
f[fd+1] = f[rs1+2] × f[rs2+0] - f[rs1+0] × f[rs2+2]
f[fd+2] = f[rs1+0] × f[rs2+1] - f[rs1+1] × f[rs2+0]

Writes 3 FP registers f[fd], f[fd+1], f[fd+2]. Used for normal-vector computation in 3D graphics, angular velocity, torque, and any "perpendicular vector" operation.

Latency: 10 cycles (6 internal multiplies + 3 subtracts, pipelined). Throughput limited by the multi-port register file write needed for the 3 output writes.

Narrow-mode constraint: fd ≤ f29 (the 3-register output must fit in f0–f31).

8.5 LENSQ3 — Squared Length

LENSQ3.S fd, rs1
fd = f[rs1]² + f[rs1+1]² + f[rs1+2]²

The "is point A closer to B than C?" primitive — squared distance is sufficient for comparison and avoids the FSQRT. Latency: 6 cycles (3 internal FMAs).

// Branchless nearest-point selection (sphere of N candidates):
float min_dist_sq = FLT_MAX;
int best = -1;
for (int i = 0; i < N; i++) {
    float d = LENSQ3.S(diff[i].xyz);     // 6 cycles
    if (d < min_dist_sq) { min_dist_sq = d; best = i; }
}

For N=8 candidates, the inner loop is ~10 cycles each (LENSQ3 + compare + conditional update). Standard RV64 with FMA: ~15 cycles each (3 FMAs + compare + update). ~1.5× speedup on neighbour-search inner loops.

8.6 LERP — Linear Interpolation

LERP.S fd, rs1, rs2, rs3
fd = f[rs1] + (f[rs2] - f[rs1]) × f[rs3]

rs3 is the interpolation parameter t (typically in [0, 1]). The fundamental animation / blending primitive. 1 internal FMA + 1 internal FSUB = 3 cycles total.

Note: implemented as R4-type-style with 3 source operands but using the R-type slot (opcode 0x57, funct3 110, with rs3 placed in a reserved bit-field). This is unusual; an alternative would be to compute as fd = f[rs1]*(1 − f[rs3]) + f[rs2]*f[rs3] from a pair of FMAs (which decomposes to 2 instructions in standard RV with FMA: same total cycles, but more register pressure).

The single LERP instruction is faster in register-pressure-bound code (texture sampling inner loops, animation blending) and saves the temporary register for (1 − t).

9. Group 7: Vector Componentwise Bundles

Componentwise operations on 3-element FP tuples. These are the bread-and-butter of game state math: physics integration, steering, collision response, transform composition. Where G6's bundles return scalars (DOT3 → 1 result), G7's bundles return vectors (VADD3 → 3 results).

All G7 instructions use opcode 0x57, funct3 111 (multi-result). Tuple convention follows §8.1 — rs1, rs2, rd name the first register of each 3-tuple; hardware reads/writes consecutive registers.

Narrow-mode constraint: each named register must allow a 3-tuple to fit within f0–f31. Wide mode allows any starting register through f60.

9.1 VADD3 — Componentwise Vector Add

VADD3.S fd, rs1, rs2
f[fd+0] = f[rs1+0] + f[rs2+0]
f[fd+1] = f[rs1+1] + f[rs2+1]
f[fd+2] = f[rs1+2] + f[rs2+2]

Three parallel FP adds in one instruction. Latency: 3 cycles. Throughput: 1 per cycle (multi-port FP register file).

The physics-update workhorse: position += velocity is one VADD3 instead of three FADD instructions.

9.2 VSUB3 — Componentwise Vector Subtract

VSUB3.S fd, rs1, rs2
f[fd+k] = f[rs1+k] - f[rs2+k]    for k in 0..2

difference = target - position patterns; collision-normal computation; relative-velocity calculation.

9.3 VSCALE3 — Vector Scale by Scalar

VSCALE3.S fd, rs1, rs2
f[fd+k] = f[rs1+k] × f[rs2]    for k in 0..2

rs2 is a single FP register holding the scalar; each component of rs1's tuple is multiplied by it. Useful for unit-vector-to-velocity conversion, light-intensity application, gain scaling.

9.4 VMADD3 — Fused Vector Multiply-Add

VMADD3.S fd, rs1, rs2, rs3      (encoded as R4-type-style; rs3 in reserved bit-field as with LERP)
f[fd+k] = f[rs1+k] + f[rs2+k] × f[rs3]    for k in 0..2

The physics-integration primitive: pos = pos + vel * dt in one instruction. Standard verlet/euler integration becomes:

VMADD3.S new_pos, old_pos, vel, dt              ; pos += vel * dt
VMADD3.S new_vel, old_vel, accel, dt            ; vel += accel * dt

Two instructions for a full integration step. Without VMADD3: ~9 instructions (3 FMUL + 3 FADD + 3 register moves) or ~6 with FMA (3 FMADD + 3 moves).

Latency: 4 cycles (3 parallel FMAs + writeback).

9.5 VNORM3 — Vector Normalisation

VNORM3.S fd, rs1
length_sq = LENSQ3(f[rs1+0..2])
inv_length = FRSQRT(length_sq)
f[fd+0] = f[rs1+0] × inv_length
f[fd+1] = f[rs1+1] × inv_length
f[fd+2] = f[rs1+2] × inv_length

The lighting / direction normalisation primitive fused into one instruction. Without VNORM3, this takes 3 separate steps (LENSQ3 + FRSQRT + VSCALE3) costing 12 cycles; VNORM3 does it in 8 cycles through internal forwarding.

Used in every lighting calculation (normal vectors must be unit length), every steering computation (direction vectors), every camera/AI orientation calculation.

9.6 Game-Logic Example: Steering Behaviours

A "seek" steering behaviour (chase a target):

// Compute desired velocity (toward target, at max speed)
diff = VSUB3(target, position);              // 3 cycles
desired = VNORM3(diff);                       // 8 cycles
desired = VSCALE3(desired, max_speed);        // 3 cycles
// Compute steering force
steer = VSUB3(desired, velocity);             // 3 cycles
// Total: 17 cycles

Without G7 (using only G6 + standard FP):

diff_x = target_x - position_x;               // 3× FSUB (3 cycles)
... similar for y, z
length_sq = LENSQ3(diff);                     // 6 cycles
inv_length = FRSQRT(length_sq);               // 3 cycles
desired_x = diff_x * inv_length * max_speed;  // 3× FMUL (3 cycles)
... similar for y, z
steer_x = desired_x - velocity_x;             // 3× FSUB (3 cycles)
// Total: ~24 cycles

~30% speedup on a single steering behaviour, repeated hundreds of times per frame for AI flocking, particle systems, projectile homing.

10. Group 8: 2D Math Primitives

2D versions of the 3D vector operations. Critical for top-down games, raycaster setup, navmesh pathfinding, 2D physics, and any "x-y but no z" math.

All G8 instructions use opcode 0x57. Tuple convention: rs1, rs2, rd name the first register of each 2-tuple.

10.1 DOT2 — 2D Dot Product

DOT2.S fd, rs1, rs2
fd = f[rs1+0] × f[rs2+0] + f[rs1+1] × f[rs2+1]

2 internal FMAs. Latency: 4 cycles. The 2D projection primitive: dot of a 2D vector against a 2D axis. Used in:

Navmesh edge-side tests
2D collision (SAT separating axis projections)
Sprite-vs-line tests
Top-down field-of-view tests

10.2 LENSQ2 — 2D Squared Length

LENSQ2.S fd, rs1
fd = f[rs1+0]² + f[rs1+1]²

The 2D distance-compare primitive. 2 internal FMAs, 4 cycles. The "is point A within radius R of point B?" test uses LENSQ2(A - B) < R*R — one subtract bundle plus one LENSQ2 plus one compare.

10.3 CROSS2 — 2D Cross Product (Scalar Result)

CROSS2.S fd, rs1, rs2
fd = f[rs1+0] × f[rs2+1] - f[rs1+1] × f[rs2+0]

The 2D cross product produces a scalar (the z-component of the 3D cross). Single FMSUB. Latency: 2 cycles.

The funnel algorithm primitive for navmesh pathfinding: the sign of the 2D cross determines which side of a line a point is on. Also the 2D winding test for polygon orientation, the 2D segment intersection helper, and the 2D rotation direction tester.

10.4 VADD2 / VSUB2 / VSCALE2 — 2D Vector Componentwise

VADD2.S fd, rs1, rs2     # 2-element componentwise add (2 cycles)
VSUB2.S fd, rs1, rs2     # 2-element componentwise subtract (2 cycles)
VSCALE2.S fd, rs1, rs2   # f[fd+k] = f[rs1+k] × f[rs2]   for k in 0..1

2D versions of the G7 bundles. Each is 2 parallel FP ops + writeback. 2-cycle latency.

10.5 VNORM2 — 2D Vector Normalisation

VNORM2.S fd, rs1
length_sq = LENSQ2(f[rs1+0..1])
inv_length = FRSQRT(length_sq)
f[fd+0] = f[rs1+0] × inv_length
f[fd+1] = f[rs1+1] × inv_length

The 2D direction-vector primitive: get a unit vector pointing the same way. Fused: 6 cycles (vs 9 cycles for separate LENSQ2 + FRSQRT + VSCALE2).

10.6 2D Game Logic Example: Navmesh Pathfinding

The funnel algorithm walks a path through a navmesh, maintaining left and right "apex" vertices. Each new portal vertex requires testing which side of the current funnel it falls on:

// Funnel-vertex test
side_left  = CROSS2(left_edge, new_vertex - apex);     // 1 SUB2 + 1 CROSS2 = 4 cycles
side_right = CROSS2(right_edge, new_vertex - apex);    // 1 SUB2 + 1 CROSS2 = 4 cycles
// Branch on signs of side_left / side_right

A complete funnel-walk per portal is ~10 cycles with G8. Without G8, it's ~25 cycles (more individual FSUB / FMUL / FADD). 2.5× speedup on navmesh path smoothing.

10.7 Raycaster Example: Wolfenstein-Style Ray Casting

For each screen column, cast a ray through a 2D grid:

// Setup
ray_dir = VSCALE2(camera_plane, x_offset);    // 2 cycles
ray_dir = VADD2(camera_dir, ray_dir);          // 2 cycles
// DDA step (integer, no Xmath needed)
// ...
// Wall distance computation (per hit)
hit_dist = DOT2(diff, ray_dir);                // 4 cycles
wall_height = FRECIP.S(hit_dist);              // 3 cycles, then * screen_height
// Total per ray (excluding pixel work): ~12 cycles

Without G8: ray setup costs 5 cycles vs 4; distance computation costs 4 vs 4 (using DOT3 with z=0). Marginal win on this — the bulk of Wolfenstein-style raycasting is the integer DDA step and per-pixel work, which the dedicated drawing hardware will handle.

11. Group 9: Game / Animation Math

Single-instruction versions of common patterns that occur in UI animation, shader uniforms (passed to the drawing hardware), AI parameter clamping, and procedural generation.

All G9 instructions use opcode 0x57. Most are simple FMA chains internally.

11.1 CLAMP.S / CLAMP.D — FP Clamp

CLAMP.S fd, rs1, rs2, rs3
fd = max(min(f[rs1], f[rs3]), f[rs2])      # clamp f[rs1] to [f[rs2], f[rs3]]

The FP equivalent of SAT.x but with user-supplied bounds rather than type bounds. Encoded as R-type-style with rs3 in a reserved bit-field (similar to LERP). Latency: 2 cycles (2 sequential compares + selects).

Used everywhere — every shader uniform that has a valid range, every AI weight that must stay positive, every camera parameter with min/max, every animation phase modulo-1.

11.2 SMOOTHSTEP.S / SMOOTHSTEP.D — Cubic Easing

SMOOTHSTEP.S fd, rs1
t = clamp(f[rs1], 0, 1)
fd = t × t × (3 - 2 × t)

The standard "ease-in/ease-out" curve. C1-continuous at 0 and 1. 3 cycles (1 internal CLAMP + 2 FMAs).

Used in: UI fade-in/out, camera lerp curves, animation transitions, shader edge softening, procedural texture gradients.

11.3 SMOOTHERSTEP.S / SMOOTHERSTEP.D — Quintic Easing

SMOOTHERSTEP.S fd, rs1
t = clamp(f[rs1], 0, 1)
fd = t³ × (t × (t × 6 - 15) + 10)

Ken Perlin's improved easing: C2-continuous at 0 and 1 (the first and second derivatives are zero at the endpoints, eliminating visible "kinks" in chained interpolations). 5 cycles (1 CLAMP + 4 FMAs).

Used in: procedural noise functions (Perlin's gradient noise relies on SMOOTHERSTEP), high-quality animation curves, smooth camera transitions where SMOOTHSTEP's C1-discontinuity in second derivative would be visible.

11.4 STEP.S / STEP.D — Branchless Threshold

STEP.S fd, rs1, rs2
fd = (f[rs1] < f[rs2]) ? 0.0 : 1.0

The branchless threshold primitive: returns 0 if input is below edge, 1 otherwise. Latency: 1 cycle.

Used in: shader masking, animation triggers, AI decision thresholds, particle visibility cutoffs. Combined with multiplication, gives branchless conditional values: result = STEP(x, threshold) * value is "value if x ≥ threshold else 0."

11.5 Game Animation Example: UI Element Fade-In

// Per-frame UI element fade-in over 0.5 seconds
float phase = (time - start_time) * 2.0f;     // 1 FMUL (1 cycle)
float alpha = SMOOTHSTEP(phase);               // 3 cycles
// alpha is now smoothly 0→1 over the fade window, clamped naturally
// Pass alpha to drawing hardware

vs without G9:

float phase = (time - start_time) * 2.0f;
phase = max(0.0f, min(1.0f, phase));            // 4 cycles (manual clamp)
float alpha = phase * phase * (3 - 2 * phase);  // 4 cycles (3 FMUL + 1 FMA)
// Total: 9 cycles

~70% faster on this pattern, multiplied by every UI element with smooth animation.

12. Group 10: Distance Heuristics

Integer distance metrics for pathfinding (A*, jump-point search, hierarchical pathfinding) and broad-phase collision distance estimates. All operate on integer registers (using GPRs, not FPRs), enabling use in tight integer loops without FP unit pressure.

All G10 instructions use opcode 0x57, funct3 001, funct7[6:5]=01. Latency: 1–2 cycles. Throughput: 1 per cycle.

12.1 MANHATTAN2 / MANHATTAN3 — Manhattan Distance

MANHATTAN2 rd, rs1, rs2
   rd = |rs1[31:0] - rs2[31:0]|   +   |rs1[63:32] - rs2[63:32]|

MANHATTAN3 rd, rs1, rs2
   # rs1 and rs2 each treated as 3-element packed-integer (low 21 bits per element, signed)
   rd = sum of absolute differences across the 3 elements

The classic A* heuristic for 4-connected grid worlds. Standard RV64 needs 5+ instructions (load, sub, abs, sub, abs, add); MANHATTAN2 is one instruction.

For pathfinding on a 1000×1000 grid with ~5000 nodes expanded per query, heuristic computation is ~5% of total time. With MANHATTAN2, that fraction drops to ~1%. A* throughput improves by ~4%.

12.2 CHEBYSHEV2 / CHEBYSHEV3 — Chebyshev Distance

CHEBYSHEV2 rd, rs1, rs2
   rd = max(|rs1[31:0] - rs2[31:0]|, |rs1[63:32] - rs2[63:32]|)

CHEBYSHEV3 rd, rs1, rs2
   rd = max(|dx|, |dy|, |dz|) for the 3-element packed inputs

The A* heuristic for 8-connected (king-move) grid pathfinding. Or, infinity-norm distance for broad-phase collision tests (chess-style movement, RTS game unit movement).

12.3 OCTILE2 — Octile Distance

OCTILE2 rd, rs1, rs2
   dx = |rs1[31:0] - rs2[31:0]|
   dy = |rs1[63:32] - rs2[63:32]|
   rd = max(dx, dy) × 1024 + min(dx, dy) × 424     # approximating (√2 - 1) ≈ 0.4142

The optimal admissible heuristic for 8-connected grids where diagonal moves cost √2 (more expensive than 4-connected moves). Used in tile-based RPGs, RTS games, top-down shooters. The internal scale factor (1024) lets the result stay in integer arithmetic; software typically divides by 1024 if a normalised distance is needed.

3 cycles including the internal multiply. Standard RV64: ~10 cycles with branches.

12.4 Pathfinding Example: A* with OCTILE2

// Inner loop: expand a node, push neighbours with h = octile distance to goal
for (each neighbour) {
    new_g = current_g + step_cost;          // 1 cycle
    h = OCTILE2(neighbour, goal);            // 3 cycles
    f = new_g + h;                            // 1 cycle
    push_to_open_set(neighbour, f);           // ~10 cycles for heap insert
}
// Total per neighbour: ~15 cycles

Without G10:

for (each neighbour) {
    new_g = current_g + step_cost;
    // octile distance, manually computed:
    dx = abs(neighbour.x - goal.x);           // 2 cycles
    dy = abs(neighbour.y - goal.y);           // 2 cycles
    h = (dx > dy) ? (dx * 1024 + dy * 424) : (dy * 1024 + dx * 424);   // ~6 cycles
    f = new_g + h;
    push_to_open_set(neighbour, f);
}
// Total per neighbour: ~22 cycles

*~30% speedup on A inner loop**, with several thousand nodes expanded per pathfind on a typical RPG map.

13. Group 11: Quaternion Math

Quaternions are the standard representation for 3D rotations in modern games. They avoid gimbal lock, interpolate smoothly (SLERP), and chain compactly. Every modern game with bone-based character animation uses quaternions in inner loops; every camera-orientation update uses them.

Quaternions are stored as 4 consecutive FP registers: (w, x, y, z) with w as the scalar part and (x, y, z) as the vector part. The tuple naming convention from §8.1 applies — rs1 and rd name the first register of a 4-tuple.

Narrow-mode constraint: starting register must allow the 4-tuple to fit in f0–f31, so rs1 ≤ f28, rd ≤ f28.

13.1 QMUL.S / QMUL.D — Quaternion Multiplication

QMUL.S fd, rs1, rs2
   # q1 = (w1, x1, y1, z1) at f[rs1..rs1+3]
   # q2 = (w2, x2, y2, z2) at f[rs2..rs2+3]
   # result = q1 × q2 stored at f[fd..fd+3]
   #
   # w = w1×w2 - x1×x2 - y1×y2 - z1×z2
   # x = w1×x2 + x1×w2 + y1×z2 - z1×y2
   # y = w1×y2 - x1×z2 + y1×w2 + z1×x2
   # z = w1×z2 + x1×y2 - y1×x2 + z1×w2

The quaternion-composition primitive. Internal cost: 16 multiplies + 12 adds. Latency: 8 cycles (pipelined through 4 internal FMA units in parallel where the DSP block budget allows).

Skeletal animation impact: a character with 50 bones needs 50 QMULs per frame to compose local-to-world transforms. With QMUL at 8 cycles each, that's 400 cycles per character; at 60 fps with 20 characters on-screen, that's 480K cycles/sec = 0.13% of CPU. Without QMUL, the same work takes ~2500 cycles per character (~5× slower).

13.2 QROT.S / QROT.D — Rotate Vector by Quaternion

QROT.S fd, rs1, rs2
   # q = (w, x, y, z) at f[rs1..rs1+3]   (assumed normalised)
   # v = (vx, vy, vz) at f[rs2..rs2+2]
   # result = q × v × q⁻¹ stored at f[fd..fd+2]   (rotated vector)
   #
   # Computed via the formula:
   # v' = v + 2 × cross(q.xyz, cross(q.xyz, v) + q.w × v)

The "apply this orientation to this direction" primitive. Used for transforming forward/right/up vectors, applying bone rotations to attached objects, computing facing direction after orientation changes.

Latency: 10 cycles (2 internal CROSS3 + 2 FMA chains). Internal implementation uses DSP block cascade for efficiency.

vs software: typically 25-30 FMA instructions, ~25 cycles. ~2.5× speedup.

13.3 Game Example: Character Bone Update

// Per-bone update (50 bones, 60 fps):
QMUL.S    world_quat, parent_quat, local_quat;     // 8 cycles
QROT.S    bone_forward, world_quat, base_forward;  // 10 cycles
// Compute bone position via additional VMADD3 chain
// ...
// Total per bone: ~25 cycles
// Total per character: 50 × 25 = 1250 cycles
// Total per frame at 20 characters: 25000 cycles = 0.007% of CPU at 380 MHz

vs software quaternion math:

// Per-bone, software:
// QMUL via 16 FMUL + 12 FADD: ~28 cycles
// QROT via 25+ FMA: ~25 cycles
// Total per bone: ~55 cycles
// Total per character: 50 × 55 = 2750 cycles
// Total per frame at 20 characters: 55000 cycles = 0.014% of CPU

~2.2× speedup on bone updates. At higher character counts (action games with 100+ on-screen characters), the savings scale linearly.

14. Group 12: Multi-Precision Integer Arithmetic with Carry

Standard RV64GC has no add-with-carry: a multi-limb add is an add plus an sltu to recover the carry into a register, fed to the next limb. Xmath's G1 already provides multiply-high (MADDH/MADDU) — the partial-product half of schoolbook bignum multiply — but offers no way to accumulate those partials, because accumulation is the carry chain. G12 supplies the missing piece: a single carry bit and four instructions that consume and produce it, turning each limb of a multi-precision add, subtract, or shift into one instruction.

This is deliberately the only condition state in the EE. The full Z80/eZX flag model (Z, C, S, V, P) was kept out precisely because a register every arithmetic op writes serialises the dual-issue pipeline (see ee_xcond §1.2 Non-Goals). Carry is the exception worth making: limb N+1 genuinely cannot start until limb N's carry exists, so a single serial carry resource adds no false dependency the algorithm didn't already impose — and because only this four-instruction family touches it, it never sits in the hot ALU-pairing path the way a general flags register would.

14.1 The `xcarry` register

Carry/borrow lives in a one-bit CSR.

CSR	Address (suggested)	Privilege	Description
`xcarry`	`0x808`	URW	Multi-precision carry/borrow bit

Bits	Field	Meaning
`[0]`	C	Carry-out (after add / rotate) or borrow (after subtract)
`[63:1]`	reserved	WPRI — read as 0

xcarry is user read/write so bignum and crypto code can manage it directly. It is per-context state: Xctx saves and restores it across YIELD and the trap entry/return path folds it into the per-context flag storage already maintained — one bit, with none of the mscratch-bits convention the eZX needs for its 5-bit ezflags.

14.2 `ADDC` — Add with Carry

Instruction	Operation	Updates
`ADDC rd, rs1, rs2`	`rd = rs1 + rs2 + xcarry.C`	`xcarry.C ← carry-out (bit 64) of the unsigned 65-bit sum`

The workhorse. A chain of ADDC adds integers of any width, one instruction per 64-bit limb.

14.3 `SUBC` — Subtract with Borrow

Instruction	Operation	Updates
`SUBC rd, rs1, rs2`	`rd = rs1 - rs2 - xcarry.C`	`xcarry.C ← borrow (1 if rs1 < rs2 + C, unsigned)`

After a SUBC chain, xcarry.C = 1 indicates the minuend was smaller than the subtrahend — the multi-precision unsigned-less-than result, recovered without a flag-branch (§14.5).

14.4 `ROLC` / `RORC` — Rotate Through Carry

Instruction	Operation
`ROLC rd, rs1`	`rd = (rs1 << 1) \\| xcarry.C` ; new `xcarry.C = rs1[63]`
`RORC rd, rs1`	`rd = (rs1 >> 1) \\| (xcarry.C << 63)` ; new `xcarry.C = rs1[0]`

These are the through-carry rotates the standard Zbb rol/ror (circular, no carry) don't provide. They thread the carry bit across limbs, giving multi-precision shifts: RORC from the high limb down for a right shift, and for a left shift either ROLC from the low limb up, or — neatly — an ADDC self-add chain (§14.8), since shifting a number left by one is adding it to itself.

14.5 Carry management — clear, set, test

All three are standard CSR operations on xcarry; G12 adds no instructions for them.

    csrrci  x0, xcarry, 1     ; CLEAR carry   (start of an add chain)
    csrrsi  x0, xcarry, 1     ; SET carry     (rarely needed)
    csrr    t0, xcarry        ; TEST: carry → t0[0]

Clearing is the common case — an add chain must begin with carry known-zero (or, equivalently, use a plain ADD/ADDS for limb 0 and ADDC thereafter). Setting is a Z80-ism (SCF) that borrow chains rarely need. Testing reads the bit into a GPR; you then branch with an ordinary register branch:

    csrr    t0, xcarry
    bnez    t0, overflow      ; standard branch — there is no branch-on-carry

There is deliberately no branch-on-carry instruction. The EE has no flag-branch paradigm (that is what ee_xcond's register-relative predication replaces), and routing the test through a GPR keeps it that way. The cost is one extra instruction over a fused BC/BNC, but branches do not dual-issue in any case, so the two-instruction form is the same throughput.

14.6 Predication

ADDC, SUBC, ROLC and RORC are R-type and inherit Xcond's wide-mode PRED-EN gate (bit 35) like the rest of Xmath (§1.4). The one corner worth stating: when the predicate is false, the instruction writes neither rd nor xcarry — the carry bit is treated as a second destination and is left unchanged, so a predicated-off limb does not silently corrupt the chain.

14.7 Microarchitecture

Single-cycle. All four ops are combinational ALU operations, 1-cycle throughput, like add/sub.
Serial by nature. Each carry-consuming op depends on the previous carry-producer through xcarry. This is intrinsic to multi-precision arithmetic, not added serialisation.
Dual-issue. The decoder will not co-issue two carry-touching operations in the same cycle — the four arithmetic ops plus any csr access to xcarry are all carry readers/writers forming a read-modify-write on one bit. Because carry ops are rare relative to general ALU work, the lost pairing is negligible, and crucially it is confined to this family: it does not impose the broad ALU-pairing penalty a general condition-code register would.
Scoreboarding. xcarry is tracked as a one-bit scoreboarded resource (like a register): a reader stalls only if a prior carry-writer has not yet retired.

14.8 Examples

256-bit add (limbs in a0..a3 += b0..b3):

    csrrci  x0, xcarry, 1     ; carry = 0
    ADDC    a0, a0, b0        ; limb 0
    ADDC    a1, a1, b1        ; limb 1 + carry
    ADDC    a2, a2, b2
    ADDC    a3, a3, b3
    csrr    t0, xcarry        ; final carry-out
    bnez    t0, overflow       ; standard branch

256-bit left shift by 1 (ADDC self-add — the MSB of each limb falls into carry and enters the next):

    csrrci  x0, xcarry, 1
    ADDC    a0, a0, a0
    ADDC    a1, a1, a1
    ADDC    a2, a2, a2
    ADDC    a3, a3, a3

256-bit right shift by 1 (RORC from the high limb down, shifting in 0 at the top):

    csrrci  x0, xcarry, 1     ; 0 shifts into bit 255
    RORC    a3, a3
    RORC    a2, a2
    RORC    a1, a1
    RORC    a0, a0

14.9 eZX (Z80) lineage and assembler lowering

G12 is the EE-native landing zone for the eZX Xez carry instructions, so eZX source ports without a flag register. The assembler lowers:

eZX / Xez	Lowers to (EE)
`ADDC`, `SUBC`	`ADDC`, `SUBC` (direct)
`RL`, `RR` (rotate through carry)	`ROLC`, `RORC`
`RLC`, `RRC` (rotate circular)	Zbb `rol`, `ror`
`SCF` / `CCF` / clear-carry idioms	`csrrsi` / `csrr`+toggle / `csrrci` on `xcarry`

The eZX *F flag-setting shifts (SLAF/SRAF/SRLF) that also update Z/S/P are not reproduced: the EE has no condition-code register, so those lower to a plain shift plus, where a later test is actually needed, an explicit compare. Only the carry bit survives the port — it is the single flag whose cross-instruction dependency is intrinsic to the arithmetic rather than an ergonomic convenience.

15. Examples

15.1 Audio Synthesis: 128-Voice Mixer

Mixing 128 voices into a stereo int16 output buffer:

// Per-sample inner loop:
int32 mix_L = 0;
int32 mix_R = 0;
for (int v = 0; v < 128; v++) {
    int32 sample = voices[v].sample();
    mix_L += sample * voices[v].pan_left;     // Q15 multiply-accumulate
    mix_R += sample * voices[v].pan_right;
}
int16 out_L = SAT.H(mix_L >> 15);
int16 out_R = SAT.H(mix_R >> 15);

With Xmath:

The MAC pair → 2× MADD instructions per voice = 256 cycles per sample
SAT.H × 2 = 2 cycles per sample
Voice generation (sample()) = ~10 cycles per voice = 1280 cycles per sample
Total: ~1540 cycles per stereo sample

At 48 kHz: 1540 × 48000 = 74M cycles/sec = 19% of CPU at 380 MHz.

Without Xmath:

MAC pair = 4 cycles per voice = 512 cycles per sample
SAT.H = ~10 cycles each = 20 cycles per sample
Voice generation unchanged
Total: ~1810 cycles per stereo sample

At 48 kHz: 22% of CPU. Xmath saves ~3% of CPU on this workload.

For voices with FMA-heavy synthesis (FM synthesis, wavetable interpolation with cubic), the savings climb to ~10–15% of CPU at 128 voices.

15.2 3D Vertex Transform

Transform a vertex by a 4×4 matrix (matrix-times-vector):

// out = M × v
out.x = DOT4.S(m_row0, v);    // 8 cycles
out.y = DOT4.S(m_row1, v);
out.z = DOT4.S(m_row2, v);
out.w = DOT4.S(m_row3, v);
// Total: 32 cycles per vertex

Without Xmath (vanilla RV64GD with FMA):

Each row: 4 FMADD.S = 4 × 4 = 16 cycles (no FMA parallelism in scalar)
4 rows: 64 cycles

Xmath = 2× speedup on per-vertex transform. For a scene of 50,000 vertices/frame: 50000 × 32 = 1.6M cycles for transform = ~0.4% of CPU at 380 MHz, leaving abundant headroom for shading and rasterization.

15.3 Sprite Rotation (BAM-Based)

A classic 2D sprite rotation:

// Per-frame angle update + rotation matrix build:
angle_bam += turn_rate;                          // 1 cycle
FSINCOSBAM.S sin_a, angle_bam;                   // 3 cycles
// matrix: [[cos_a, -sin_a], [sin_a, cos_a]]
// 4 element references = 4 register copies, ~4 cycles

// Per-pixel rotation: (apply inverse rotation to source coordinate)
src_x = MADD.S(out_x, cos_a, MADD.S(out_y, sin_a, 0));   // 4 cycles
src_y = MSUB.S(out_y, cos_a, MUL.S(out_x, sin_a));       // 4 cycles

Plus pixel sampling. For a 128×128 sprite rotating per frame at 60 Hz:

Per-pixel cost: ~12 cycles
128² × 12 = 196K cycles per sprite
At 60 Hz: 60 × 196K = 11.8M cycles/sec = ~3% of CPU per rotated sprite

Without Xmath (FMA-based, library trig):

Per-pixel cost: ~25 cycles
~6% of CPU per rotated sprite

Xmath = 2× speedup on rotation effects. Aggregate over many rotating sprites becomes substantial.

15.4 Vector Normalisation in Lighting

Phong lighting requires normalising every interpolated normal per pixel:

// Per-pixel normalisation:
float lensq = LENSQ3.S(n.xyz);         // 6 cycles
float inv_len = FRSQRT.S(lensq);       // 3 cycles
n.x *= inv_len;
n.y *= inv_len;
n.z *= inv_len;                         // 3 cycles
// Total: 12 cycles per pixel

Without Xmath:

// Per-pixel:
float lensq = n.x*n.x + n.y*n.y + n.z*n.z;     // 4 cycles (3 MULs + 2 ADDs with FMA: 3 cycles)
float len = sqrt(lensq);                          // ~20 cycles (software sqrt)
float inv_len = 1.0f / len;                       // ~15 cycles (FDIV)
n.x *= inv_len; n.y *= inv_len; n.z *= inv_len;  // 3 cycles
// Total: ~41 cycles per pixel

Xmath = ~3.4× speedup on per-pixel normalisation — critical for Phong-shaded or normal-mapped renderers (when run on the CPU; dedicated drawing hardware in Ant64 typically handles per-pixel work).

15.5 Collision Detection: Ray vs Triangle (Möller-Trumbore)

Testing a ray against a triangle is the fundamental primitive for picking, raycasting bullets, line-of-sight checks, and BSP traversal:

// Möller-Trumbore ray-triangle test using G6 + G7
edge1 = VSUB3(v1, v0);                   // 3 cycles
edge2 = VSUB3(v2, v0);                   // 3 cycles
h = CROSS3(ray_dir, edge2);              // 10 cycles
a = DOT3(edge1, h);                       // 6 cycles
if (fabs(a) < EPSILON) return false;     // 2 cycles
f = FRECIP.S(a);                          // 3 cycles
s = VSUB3(ray_origin, v0);                // 3 cycles
u = f * DOT3(s, h);                       // 6 + 1 = 7 cycles
if (u < 0 || u > 1) return false;
q = CROSS3(s, edge1);                     // 10 cycles
v = f * DOT3(ray_dir, q);                 // 7 cycles
if (v < 0 || u + v > 1) return false;
t = f * DOT3(edge2, q);                   // 7 cycles
return (t > EPSILON);
// Total per ray-triangle test: ~60 cycles

vs without Xmath (using only standard RV64GD):

// Each VSUB3 → 3 FSUB (3 cycles)
// Each CROSS3 → 6 FMUL + 3 FSUB + register juggling (~15 cycles)
// Each DOT3 → 3 FMUL + 2 FADD or 2 FMA + FMUL (~6 cycles, similar)
// FRECIP → FDIV (~15 cycles)
// Total: ~110 cycles per ray-triangle test

~1.8× speedup on ray-triangle tests. A 3D game raycasting against ~500 triangles per query gets back several hundred microseconds per query on FireStorm.

15.6 Pathfinding: A* with OCTILE2 Heuristic

A* on a 1000×1000 grid with typical agent paths expanding ~3000 nodes:

// Inner expansion loop
while (!open_set.empty()) {
    node = open_set.pop_min_f();
    if (node == goal) return reconstruct_path();
    for (neighbour in node.successors()) {                  // 8 neighbours
        new_g = node.g + cost(node, neighbour);             // 2 cycles
        if (new_g < neighbour.g) {
            neighbour.parent = node;
            neighbour.g = new_g;
            neighbour.h = OCTILE2(neighbour, goal);          // 3 cycles
            neighbour.f = new_g + neighbour.h;               // 1 cycle
            open_set.update_or_insert(neighbour);            // ~20 cycles (heap)
        }
    }
}

Per-node cost with G10: ~30 cycles + heap. Per-pathfind cost (3000 nodes × 30) = 90K cycles.

Without G10 (manual heuristic): ~50 cycles + heap = 150K cycles per pathfind.

*~1.7× speedup on A throughput.** A combat encounter triggering 20 AI pathfinds per second uses ~2% of CPU instead of ~3%.

15.7 AI Steering: Flocking Behaviour

Boids-style flocking, computing separation/alignment/cohesion forces for each agent:

// Per-agent flocking inner loop (against N nearby agents)
sep_total = VEC3(0, 0, 0);
ali_total = VEC3(0, 0, 0);
coh_total = VEC3(0, 0, 0);
for (other in nearby_agents) {
    diff = VSUB3(self.pos, other.pos);                       // 3 cycles
    dist_sq = LENSQ3(diff);                                   // 6 cycles
    if (dist_sq < separation_radius_sq) {
        inv_dist = FRSQRT.S(dist_sq);                         // 3 cycles
        force_mag = VSCALE3(diff, inv_dist);                  // 3 cycles
        sep_total = VADD3(sep_total, force_mag);              // 3 cycles
    }
    if (dist_sq < neighbour_radius_sq) {
        ali_total = VADD3(ali_total, other.velocity);         // 3 cycles
        coh_total = VADD3(coh_total, other.pos);              // 3 cycles
    }
}
// Combine forces, integrate position with VMADD3
// Per-agent cost: ~50 cycles + per-neighbour work

Per-neighbour cost with G7: ~24 cycles (without separation), ~33 cycles (with). Without G7: ~50 cycles per neighbour (more individual FSUB / FMUL / FADD).

~1.6× speedup on flocking inner loop. A typical flock of 50 boids checking the nearest 8 neighbours each runs at ~10K cycles per frame at 60 fps — well under 1% of CPU.

15.8 Skeletal Animation: Character Bone Hierarchy

A typical humanoid character has 50–80 bones, each with a local rotation expressed as a quaternion. Per-frame update walks the hierarchy and composes local rotations with parent world rotations:

// Per-bone update (50 bones, 60 fps, 20 characters on screen)
for (bone in skeleton.bones_in_hierarchy_order) {
    QMUL.S    bone.world_quat, bone.parent.world_quat, bone.local_quat;   // 8 cycles
    QROT.S    bone.world_forward, bone.world_quat, bone.local_forward;    // 10 cycles
    // Compute bone tip position via VMADD3 from parent + offset rotated by world_quat
    VMADD3.S  bone.world_pos, bone.parent.world_pos, bone.world_forward, bone.length;  // 4 cycles
}
// Per-bone: ~22 cycles
// Per-character: 50 × 22 = 1100 cycles
// Per frame (20 characters): 22,000 cycles = 0.006% of CPU at 380 MHz

Without G11 quaternion ops:

// QMUL via software: 16 FMUL + 12 FADD ≈ 28 cycles
// QROT via software: ~25 FMA chain ≈ 25 cycles
// Per-bone: ~57 cycles
// Per-character: 50 × 57 = 2850 cycles
// Per frame (20 characters): 57,000 cycles = 0.015% of CPU

~2.5× speedup on skeletal animation. At higher character counts (action games with 100+ characters), this becomes a substantial fraction of frame budget.

15.9 Game Animation: Smooth UI Tween

A UI panel sliding in from off-screen over 0.3 seconds with eased motion:

// Per-frame
phase = (time - start_time) * 3.333;                  // 1 cycle
alpha = SMOOTHERSTEP.S(phase);                        // 5 cycles (clamps + cubic ease)
panel.x = LERP.S(start_x, end_x, alpha);              // 3 cycles
// Total: 9 cycles per animated property per frame

For 50 simultaneously-animating UI elements: 450 cycles/frame at 60 fps = 27K cycles/sec = essentially free.

Without G9: ~25 cycles per element. For 50 elements: 1250 cycles/frame = 75K cycles/sec.

~2.8× speedup on UI animation. Not large in absolute terms, but uniformly applied across every animated UI element keeps the UI thread's overhead negligible.

16. Implementation Notes

16.1 DSP Block Budget

Group	DSP blocks	Notes
G1 — Integer MAC	4 (shared with M-ext MUL)	Reuses the multiplier hardware with post-add via DSP block ALU
G2 — Saturating	0	Detection + clamp in fabric (~800 LUTs)
G3 — Min/Max/Sign/Abs	0	Combinational comparators (~300 LUTs)
G4 — FP Approximations	2–4 per active unit	LUT-backed approximations, possibly with Newton-Raphson refinement
G5 — BAM Trigonometry	1 (shared with G4)	Hardware modular reduction + table lookup
G6 — 3D Vector Bundles	4–6 (shared with FP FMA)	Sequenced FMA chain through existing FMA unit
G7 — Vector Componentwise	3–6 (shared with FP FMA)	3 parallel FMA paths or one with serial completion
G8 — 2D Math Primitives	2 (shared with G6)	Same FMA hardware as 3D bundles, one fewer lane
G9 — Game / Animation Math	2 (shared with G4)	SMOOTHSTEP / SMOOTHERSTEP use the FP FMA chain; CLAMP / STEP are combinational
G10 — Distance Heuristics	1 (integer multiplier)	OCTILE2's internal multiply uses one DSP block; others combinational
G11 — Quaternion Math	6–8 (shared with FP FMA)	QMUL and QROT internally use the same FMA chain; need extra register-file write bandwidth
Total dedicated DSPs	~10–15	Most groups share with existing units

Total Xmath area cost on Ant64: ~8,000–10,000 LUTs + 4 BSRAM blocks (for FP approximation tables) + ~10–15 dedicated DSP blocks beyond the existing M-extension and F/D extension hardware.

This fits comfortably on the GW5AST-138 (138K LUT, 298 DSPs), consuming a small fraction of the available fabric.

16.2 Implementation

Xmath is fully implemented with the following latencies and throughput:

Group	GW5AST-138
G1 — Integer MAC	Full
G2 — Saturating	Full
G3 — Min/Max/Sign/Abs	Full
G4 — FP Approximations	Full (FP32 + FP64)
G5 — BAM Trigonometry	Full
G6 — 3D Vector Bundles	Full
G7 — Vector Componentwise	Full (parallel FMA paths)
G8 — 2D Math Primitives	Full
G9 — Game / Animation Math	Full
G10 — Distance Heuristics	Full
G11 — Quaternion Math	Full
Throughput on G6/G7	1 / 3 cycles per bundle
Throughput on G11 QMUL	1 / 8 cycles

All models implement all Xmath instructions with identical latencies and throughput — the fabric is the same GW5AST-138 everywhere, so there is no per-variant difference and software portability is automatic.

16.3 Power and Frequency

The FP approximation tables and modular-reduction logic are mostly static (LUTs) and have minimal switching activity — power impact is small. The DSP-block-backed integer MAC and FP FMA paths use the same DSP blocks as the M and F/D extensions, so their power impact is the marginal cost of additional operations through the same units.

All Xmath operations are clocked at the same 380 MHz BSRAM/CPU clock; no separate clock domain.

16.4 Composition with Scoreboarding

Xmath operations interact with the scoreboarding system (§15.1 of ee_cpu) the same way other multi-cycle operations do:

An Xmath instruction issues to its functional unit and marks its destination register pending.
Subsequent instructions continue executing on the main pipeline until one needs the pending result.
Especially useful for G4 (FP approximations) and G6 (vector bundles), where the 3–10 cycle latencies are well-hidden behind subsequent independent work.

For G1 (integer MAC), the 2–3 cycle latency is short enough that scoreboarding's main benefit is hiding back-to-back MAC chains where the previous result is needed only a few instructions later.

17. Open Items

Items deferred to subsequent Xmath revisions:

Half-precision (FP16) Xmath variants. FRECIP.H / FRSQRT.H / FSIN.H / FCOS.H / FSINCOS.H / FATAN2.H for the Zfh extension. Useful for shader-style code where FP16 precision suffices, and saves 50% on DSP block usage per FMA. Likely to land in v0.2 with FP16 support.
Quad-precision (FP128) approximations. Same family for the Q (Zfh extension) if implemented; defers to whenever Q lands.
Saturating shift operations. SLLS, SRLS (saturating left/right shifts) — niche but useful for fixed-point DSP. Deferred to a future revision.
Additional vector bundles.
- MATMUL3x3 / MATMUL4x4 as full 3×3 / 4×4 matrix multiplies. Cost: substantial — these need many internal FMAs and may better suit a future V (vector) extension.
- LERP3 / LERP4: vector linear interpolation (componentwise LERP on a 3- or 4-element tuple). Useful for skeletal animation blending.
- SLERP: spherical linear interpolation for quaternions. Significant complexity; deferred.
BAM-specific extensions.
- FSINBAM_LU / FCOSBAM_LU: linear-interpolation-between-table-entries variants of the BAM trig, for cases where bit-exact wraparound is needed but accuracy can be slightly better than 12-bit lookup table. v0.2 candidate.
- BAMADD / BAMSUB: angle addition/subtraction with proper modular semantics. Trivial as integer ADD/SUB on BAM values — but a clear mnemonic might improve readability.
Vector extension (V) revisited. If, post-implementation, a clear need emerges for SIMD-style data parallelism that Xmath does not address (e.g., bulk image filtering, mass voice mixing into wide vectors), V could be added in v0.3+. The OP-V opcode (0x57) is currently allocated to Xmath; if V is added later it will need a different opcode allocation, which is straightforward given that several custom-N spaces and one or two reserved standard slots remain free.
FP exception handling. Xmath FP approximations do not raise IEEE-754 exception flags (they are approximations, not exact). Specifically: FRECIP of zero does not raise divide-by-zero; FRSQRT of negative does not raise invalid; FSIN/FCOS/FATAN2 of infinity return implementation-defined values. The fcsr register is unaffected by Xmath operations. This is consistent with similar approximation instructions on other architectures (SSE RCPSS, ARM FRECPE).
Encoding finalisation. The funct7 / funct3 assignments in §2.3 are nominal; the exact bit-level encoding pending finalisation in coordination with the rest of the FireStorm extensions and with toolchain implementation. This includes the G12 multi-precision ops (ADDC/SUBC/ROLC/RORC) in the funct3 = 000 lane.
xcarry CSR address. 0x808 (§14.1) is suggested — the first free user-custom slot after the Xlate translator block at 0x800–0x807. Final assignment requires coordination with the other FireStorm extensions' CSR allocations, as flagged for Xstack, Xctx and Xlate. Also open: whether to widen G12 to multi-word rotate-by-N (currently rotate-by-1 only) and whether a future MULX/multiply-wide pairing with G1's MADDH would round out a complete bignum-multiply primitive set.
Compiler intrinsics. Naming and conventions for C / Rust intrinsics for Xmath operations. Standard practice (e.g., __builtin_firestorm_madd, __builtin_firestorm_fsincosbam) but exact names TBD.
Library integration. Whether libm's sin / cos / sqrt / etc. should dispatch to Xmath approximations by default (faster, less accurate) or only via explicit intrinsics (preserves IEEE-754 contract for libm). Recommended: explicit intrinsics for Xmath; libm preserves strict semantics with software fallback.

18. Glossary

Term	Meaning
BAM	Binary Angle Measure. An angle representation using a fixed-width integer (typically 16 or 32 bits) to span [0, 2π) with perfect modular wraparound. Native to fixed-point and retro-style code.
Fused MAC	Multiply-and-add performed in a single instruction with a single rounding step (FP) or no intermediate truncation (integer).
Saturating arithmetic	Arithmetic where overflow clamps to the maximum/minimum representable value rather than wrapping. Critical for audio (avoids audible distortion from wrap-around).
Vector bundle	A single instruction that executes a sequence of internal multiply-adds on a small fixed-size tuple of values. Not SIMD — sequential internal execution, but encoded as one instruction for density.
FRECIP / FRSQRT	Reciprocal and reciprocal-square-root FP approximations, typically with 0.05–0.1% relative error. Comparable to SSE RCPSS / RSQRTSS.
Register tuple	A consecutive group of registers (e.g., f5, f6, f7 for a DOT3 operand). Vector bundles read/write tuples; the instruction names only the starting register.
`xcarry`	The single-bit carry/borrow CSR (`0x808`, URW) used by the G12 multi-precision instructions (ADDC/SUBC/ROLC/RORC). The EE's only condition-code state; carried across instructions so that limb-by-limb integer arithmetic of any width is one instruction per limb. There is no branch-on-carry — the bit is read into a GPR and tested with a standard branch.
Approximation	An Xmath operation that produces a result correct to a stated precision target (typically 0.01% – 0.1%) but not IEEE-754-correct. Used where game-rendering precision suffices.