Ant64 Display System — More information...

Memory Architecture

The Ant64 uses a federated memory architecture — BSRAM (on-chip, 380MHz), wide-mode SRAM (~4.5MB, 1M × 36-bit pipeline, FireStorm Execution Engine wide-mode instruction fetch only), FPGA flash (read-only system assets), and FireStorm DDR3 (bulk store — up to 2GB, shared between FireStorm EE, blitter, and audio DSP). External chips reach FireStorm memory through QSPI FRAM bridges (one each for Pulse, DeMon, and the optional Pi Zero 2W accelerator). DeMon's ESP32-P4 also drives a MIPI display feed into FireStorm for boot UI and supervisor overlay.

For full details — chip specifications, 36-bit format table, bus isolation rationale, FRAM programming model, inter-subsystem data paths, and the complete memory map — see the Memory Architecture reference.

1. Output Clocking & HBR2 Specification

The primary display output is 4K@60 HDMI 2.0, delivered from the GoWin GW5AST-138 (FireStorm) hardware transceivers via a Parade PS176 DP→HDMI 2.0 protocol-converter chip on the PCB. The FPGA emits DisplayPort HBR2 internally; the PS176 converts to HDMI for the back-panel connector. See Main HDMI output for the full chip, pinout, power-rail, link-training and BOM detail.

Parameter	Value
Reference oscillator	135 MHz (shared with Colony Connection)
Lane count	4
Lane rate	5.4 Gbps (135 MHz × 40)
Raw aggregate	21.6 Gbps
Effective after 8b/10b	17.28 Gbps
Max pixel clock at 24bpp	~720 MHz

The 135 MHz oscillator is shared between FireStorm's two transceiver banks. Bank 1 drives the main HDMI output path (DP signaling out → PS176 chip → HDMI connector); Bank 2 drives Colony Connection (inter-machine network) on the Ant64 and Ant64C — the Ant64S does not include Colony Connection. Both PLLs derive from the same reference, eliminating inter-subsystem clock domain issues.

Important: Standard 4K/60 with full CEA-861 blanking requires ~594 MHz pixel clock (~17.82 Gbps), which slightly exceeds HBR2's 17.28 Gbps effective ceiling. The 4K/60 baseline therefore targets CVT-RB (Reduced Blanking) timing, bringing the pixel clock to ~533 MHz — comfortably within budget.

2. VRR (Adaptive Sync)

VRR is implemented by stretching the vertical blanking interval — no special protocol layer is needed. This is how DisplayPort Adaptive Sync works at the protocol level. Note that the PS176 converts to HDMI 2.0, which does not natively pass VRR through, so the user-facing VRR benefit of this mechanism is currently limited to the FPGA-internal pipeline; HDMI 2.1 VRR would require a different bridge chip.

Since the Ant64's internal render target is small (see Section 4), frame completion times are highly predictable. VRR is therefore useful for:

Smooth sub-60 Hz output when the application is doing heavy work
Precise cadence locking: PAL (50 Hz), NTSC (59.94 Hz), or arbitrary rates
The FireStorm side has no minimum — blanking can be stretched indefinitely

Maximum Refresh Rates by Resolution

Resolution	Max VRR (approx)	Notes
3840×2160 (4K)	~75 Hz	CVT-RB blanking
2560×1440 (1440p)	~144 Hz	Non-integer scale from native — see below
1920×1080 (1080p)	~240 Hz

3. Output Paths

Two external display outputs plus two internal MIPI video inputs, each driven from — or composited into — the same FireStorm pipeline independently:

Output	Connector	FPGA signal	Bridge chip	Notes
Main HDMI	HDMI Type A	4 DP SerDes lanes, HBR2 (Bank 1)	Parade PS176 DP→HDMI 2.0	4K@60 primary target; HDCP / CEC / deep colour
Retro VGA + audio	DE-15 VGA + 3.5 mm TRS	second HDMI TX on fabric-pin LVDS (4 TMDS pairs)	Algoltek AG6201 HDMI→VGA	8-bit analogue RGB + embedded stereo audio; 480p–1080p60 plus 240p superresolution for genuine 15 kHz CRTs
Supervisor MIPI #0	(internal) MIPI D-PHY	FireStorm MIPI RX #0	—	DeMon ESP32-P4 → AntOS UI / overlay (input layer)
Supervisor MIPI #1	(internal) MIPI D-PHY	FireStorm MIPI RX #1	—	Pulse ESP32-P4 → sequencer / mixer UI (input layer)

There is no FPGA-side VGA DAC and no separate user-facing secondary HDMI port. Both external outputs leave the FPGA as digital streams: the main output as DisplayPort into the PS176, the retro output as a second, internal HDMI TX (fabric-pin LVDS, not user-accessible) into the AG6201, which performs the analogue conversion. The AG6201 carries video and audio in one HDMI stream and emits analogue RGB on the DE-15 plus line-level stereo on the 3.5 mm jack — see Retro VGA output for the full bridge, connector, modeline and cable/adapter detail.

Each external output can have independent simulation modes applied (see Section 8). For example, the main HDMI output could run aperture grille + bloom while the retro VGA output runs simple scanlines or no effect (the AG6201 analogue path produces softness naturally at lower modelines). The MIPI RX paths from DeMon and Pulse are input layers — the supervisors render content (boot UI, recovery menu, system overlay, sequencer UI) that FireStorm composites into the output stream like any other layer.

MIPI Inter-Chip Video Bus (DeMon ESP32-P4 → FireStorm)

DeMon's ESP32-P4 has an integrated MIPI DSI transmitter wired to FireStorm's MIPI RX #0 hardcell. The supervisor uses this for two main purposes:

AntOS UI layer — AntOS itself runs on DeMon's HP core, and its rendered UI is fed into FireStorm as a compositable display layer. Recovery screens, debug overlays, file browsers, system menus, the shell — all live here.
Bulk data path — when the UI is idle or transparent, the link can carry arbitrary bulk data into FireStorm memory at D-PHY speeds (up to 3 Gbps total).

(There is no boot-time direct-forward of DeMon's MIPI to the display outputs in the current architecture. The FPGA's native Ant64 chipset auto-loads from its own attached flash on power-on in a fraction of a second; only then does the compositor exist. AntOS runs on DeMon throughout — and the keyboard touchscreen and the case-mounted circular display from Pulse stay visible during the brief blackout, so the user always has feedback. Personality cartridges produce a longer blackout (up to ~1 second) when they replace the native chipset; a reset always returns the FPGA to native from flash.)

Physical layer: ESP32-P4 MIPI TX → FireStorm MIPI RX #0 hardcell. 2 data lanes + clock, D-PHY signalling, 1.5 Gbps per lane = 3 Gbps total. No external bridge chip needed — the ESP32-P4 and the FPGA both speak standard D-PHY. (FireStorm has two such inputs; the second, MIPI RX #1, is fed by Pulse's identical ESP32-P4 with the sequencer / mixer / sample-browser UI.)

Return path (FireStorm → supervisor). The inbound direction needs no bridge, but the reverse does: FireStorm has no MIPI TX hardcell (the GW5AST silicon hardens MIPI on RX only), so it drives LVDS TX into an external LVDS-to-MIPI CSI bridge (Lontium LT9211) — one per supervisor — feeding a CSI input on DeMon and on Pulse. Asynchronous (not vsync-locked), ~375 MB/s each, carrying composited-video capture, chipset telemetry, mixer-audio readback, and bulk transfers out of FireStorm DDR3.

Resolution: The ESP32-P4 typically generates 720p@60 for the AntOS UI feed; the FPGA can upscale or composite it into any output resolution.

DeMon ESP32-P4 (AntOS UI, overlays, recovery)
    ↓ MIPI D-PHY (2-lane, 1.5 Gbps/lane, 3 Gbps total)
FireStorm MIPI RX #0 hardcell
    ↓
FireStorm compositor (one input layer alongside FireStorm-generated layers
                      and the second MIPI input from Pulse)
    ↓
Main HDMI / retro VGA outputs (with CRT simulation, scaling, etc. applied)

Use cases:

AntOS shell and system UI — file browser, settings, network, debug server frontend, all rendered on DeMon and composited over whatever FireStorm is running
Recovery / safe-mode UI — if FireStorm crashes (or even if the FPGA bitstream needs reloading), AntOS keeps running on DeMon and presents the recovery interface as soon as the FPGA comes back up
System overlay — debug info, system notifications, OTA progress, composited over the running application
Touch UI — DeMon owns the touchscreen and can render its own UI layer for system menus

Retro VGA + Audio — Special Properties

The retro output is analogue, but the conversion happens in the AG6201 bridge chip, not in the FPGA. FireStorm emits a full-precision digital HDMI stream on its second (internal) HDMI TX; the AG6201 has a real integrated 8-bit-per-channel RGB DAC (0.7 Vpp into 75 Ω, AC-coupled, back-terminated) that produces the DE-15 signal, plus an embedded stereo audio DAC that decodes the HDMI audio-island packets to the 3.5 mm jack (~1 Vrms line level). There is no 5-bit truncation, no quantisation trick and no temporal dithering anywhere in the path — the palette RAM's RGBA32 precision is carried as full 8-bit-per-channel colour all the way to the analogue stage. See Retro VGA output for the bridge datasheet detail, connector pinouts and external cable/adapter recipes.

Because video and audio ride the same HDMI stream off the same pixel-clock domain, audio is inherently locked to video — no drift, no resync logic, no separate audio path on the host side.

The analogue signal path still gives the retro output its characteristic additional properties:

Natural bandwidth limiting from cable and monitor input stage provides free horizontal softness, partially simulating limited CRT bandwidth.
Per-row brightness modulation for scanline effects is applied digitally in the FireStorm pipeline before the HDMI TX (it no longer relies on a host-side DAC), so it still costs zero framebuffer bandwidth.
Composite colour bleed simulation via a short FIR on Cb/Cr channels (after RGB→YCbCr) replicates the colour bandwidth limitation of composite video — applied in the digital pipeline, then carried through the AG6201's analogue stage.
240p superresolution — a 1440×240p60 (or 1440×288p50) mode at ~27 MHz pixel clock leaves the DE-15 at 15.734 kHz / 59.94 Hz, a signal a real PVM/BVM locks to as 240p and which reaches SCART CRTs through a passive HDMI-to-SCART adapter.
One caveat from the commodity bridge: the AG6201 crushes RGB codes ≤ 16, mitigated by a 16–255 output-range remap in the host HDMI TX (documented in rgb_out).

Colony Connection — Ant64 and Ant64C

Colony Connection is available on the Ant64 (Power) and Ant64C (Creative) models. The Ant64S (Starter) does not include Colony Connection — it is one of the peripheral features reserved for the higher tiers, alongside the Ethernet, DIN-MIDI, and optical-audio ports. All three models share the same GW5AST-138 FPGA and both display outputs: the main HDMI output (PS176 bridge) and the retro VGA + audio output (AG6201 bridge).

The Ant64 and Ant64C share essentially the same motherboard with selective connector population. The only main-board functional difference is that the Ant64C adds Ethernet; the DIN MIDI and optical-audio ports both now live on the Audio Expansion Header daughterboard (the Studio I/O variant shipped with the Ant64C by default). Both models have the same main HDMI + retro VGA outputs, the same Colony IO ports, and the same Colony RX ports fitted.

Both Ant64 and Ant64C have:

Two full bidirectional Colony IO ports (IO and IO2) — 2 TX + 2 RX lanes each at up to 7.83Gbps per lane
Two receive-only Colony RX ports (RX and RX2) from the spare Bank 2 RX lanes — the Ant64C has connectors fitted, the Ant64 has populated pads only

The RX and RX2 ports are inherently receive-only — peripherals can inject data into the Colony network but cannot read from it, making them architecturally secure input-only nodes.

The primary use case for the RX ports is an HDMI frame grabber — a small standalone Colony peripheral with an HDMI receiver IC that captures video from any HDMI source (games console, PC, camera) and streams it into the Colony network. FireStorm receives that stream as a Colony video layer and can process it through the full layer system — Copper effects, chroma keying, blending with generated content — outputting the result on any external display output.

With two RX ports, two independent HDMI frame grabbers can feed the machine simultaneously, giving FireStorm two live video streams as separate layers:

Picture-in-picture — two HDMI sources composited on screen simultaneously
Side-by-side comparison
Chroma key one stream over the other
Copper-driven wipe or blend transitions between streams
One stream as background, FireStorm-generated sprites and tilemaps over the top, second stream as overlay

Since the frame grabbers connect via Colony, they don't need to be physically attached to the machine — they can sit anywhere in the Colony string and their video streams are forwarded over the network.

See ant64.com/colony for full Colony Connection documentation.

4. Native Resolution System

FireStorm renders internally at a low native resolution which is then pixel-replicated to the output resolution. This eliminates most framebuffer bandwidth pressure and provides the scale budget needed for CRT simulation effects.

Primary Native Resolutions

The Ant64 has two primary named native resolutions, directly echoing the Amiga's low res / high res relationship:

480×270 — Standard

Closely matches the Amiga's low res pixel size at its respective output resolution (Amiga low res: 320×256 PAL)
Exactly 1/8 of 4K in each axis (480×8 = 3840, 270×8 = 2160)
Perfect ×4 integer scale to 1080p
The natural home for games, demos, and content that wants the classic retro pixel feel

960×540 — Hires

Exactly double 480×270 in both axes — the same relationship as Amiga low res to high res
×4 to 4K, ×2 to 1080p
Sharp UI, detailed backgrounds, text rendering, productivity use
The natural home for applications where pixel density matters more than the retro aesthetic

These two resolutions are peers, not a hierarchy. Which one is "right" depends entirely on what you're making. A game running at Standard gets the full 8×8 CRT simulation block budget on 4K output. An application running at Hires gets twice the canvas with a still-respectable 4×4 block budget.

All other valid resolutions from the H and V tables are equally available — Standard and Hires are named reference points, not constraints.

The Two Independent Axes

The horizontal and vertical scale factors are completely independent. Any horizontal width from the H table can be paired with any vertical height from the V table to form a valid native resolution.

5. Valid Horizontal Native Widths

Must divide GCD(1920, 3840) = 1920 for integer scaling to both outputs. 3840 is a special case — native 4K width, main HDMI output only (the retro VGA path tops out at 1080p so cannot accept 4K-width modes).

Width	→1080p scale	→4K scale	Notes
3840	— (HDMI only)	×1	Native 4K — tilemap/HAM24 main HDMI output only
1920	×1	×2	Full HD native
960	×2	×4	← Ant64 Hires
640	×3	×6	VGA, Amiga hires, DOS
480	×4	×8	← Ant64 Standard
384	×5	×10	Atari ST low res
320	×6	×12	SNES, Mega Drive, DOS Mode 13h
240	×8	×16	Half-width Ant64, fat pixel mode
192	×10	×20	—
160	×12	×24	GBA, Atari Lynx
128	×15	×30	ZX81, early micros
120	×16	×32	—
96	×20	×40	—
80	×24	×48	BBC Micro MODE 0, text columns
64	×30	×60	Commodore PET
60	×32	×64	—
48	×40	×80	—
40	×48	×96	ZX Spectrum text cols, BBC MODE 1
32	×60	×120	—
24	×80	×160	—
16	×120	×240	Impractical

6. Valid Vertical Native Heights

Must divide GCD(1080, 2160) = 1080 for integer scaling to both outputs. 2160 is a special case — native 4K height, main HDMI output only.

Height	→1080p scale	→4K scale	Notes
2160	— (HDMI only)	×1	Native 4K — tilemap/HAM24 main HDMI output only
1080	×1	×2	Full HD native
540	×2	×4	← Ant64 Hires
360	×3	×6	—
270	×4	×8	← Ant64 Standard
216	×5	×10	—
180	×6	×12	—
135	×8	×16	—
120	×9	×18	—
108	×10	×20	—
90	×12	×24	—
72	×15	×30	—
60	×18	×36	—
54	×20	×40	—
45	×24	×48	—
40	×27	×54	—
36	×30	×60	—
30	×36	×72	—
27	×40	×80	—
24	×45	×90	—
18	×60	×120	—
15	×72	×144	—
12	×90	×180	—
10	×108	×216	—
9	×120	×240	Impractical

7. Notable Pixel Aspect Ratio Modes

By mixing different H and V widths/heights, any pixel aspect ratio (PAR) can be constructed.

Fat Pixels — PAR 2:1 (pixel twice as wide as tall)

Classic multicolour modes — C64 multicolour, many arcade games, CGA multicolour.

Notable examples at 4K:

Native W	Native H	→4K (X×Y)	Feel
240	270	×16 × ×8	Half-width Ant64 — C64 multicolour
160	180	×24 × ×12	Fat sprite machines
320	270	×12 × ×8	Wide multicolour

Tall Pixels — PAR 1:2 (pixel twice as tall as wide)

BBC Micro MODE 0, some teletext-style displays, and Tate mode vertical arcade games.

Native W	Native H	→4K (X×Y)	Feel
960	270	×4 × ×8	Cinematic widescreen, panoramic
480	135	×8 × ×16	Very tall pixels, distinctive
270	480	×8 × ×8 (per half-screen)	Tate mode — portrait arcade on landscape screen

The 270×480 mode is the natural home for Tate mode. At 4K output, two 270×480 layers with Tate simulation enabled fit side by side in a perfect 50/50 split — 1920 output pixels each. Two independent vertical arcade games on one screen, each with rotated CRT scanlines. See Section 8 (Tate Mode).

4:3 Content on 16:9 Screen — PAR 4:3

Pixel AR exactly compensates for screen AR so geometry is undistorted. A circle in native coordinates appears as a circle on screen.

Check: native 480×360 × PAR 4:3 → (480×4)/(360×3) = 1920/1080 = 16:9 ✓

Native	→4K (X×Y)	Retro equivalent
480 × 360	×8 × ×6	Classic VGA-era 4:3
320 × 240	×12 × ×9	SNES, PlayStation, DOS, CPS2 arcade
240 × 180	×16 × ×12	Low-res
160 × 120	×24 × ×18	GBA-ish

Famous Resolutions That Don't Integer Scale to 1080p/4K

These iconic resolutions have widths or heights that do not divide 1920 or 1080, and therefore cannot be integer-scaled cleanly:

Resolution	Problem	Famous uses
256 wide	1920÷256 = 7.5	NES, ZX Spectrum, Game Boy
512 wide	1920÷512 = 3.75	Amiga lo-res
224 tall	1080÷224 = 4.82	NES, SNES active area
240 tall	1080÷240 = 4.5	NTSC standard, NES, SNES
200 tall	1080÷200 = 5.4	DOS/CGA/EGA, C64
192 tall	1080÷192 = 5.625	ZX Spectrum, Master System

The closest clean Ant64 equivalent to the common NES/SNES feel would be 240 wide × 270 tall (fat pixel, ×16×8 to 4K).

8. CRT Simulation Effects

Applied in the FireStorm output pipeline after pixel replication, before the output encoders. The two external outputs can run independent simulation modes — the main HDMI and the retro VGA path can each use a different mode simultaneously with no interaction between paths.

Scale Budget Per Output

The fidelity of CRT simulation is directly proportional to the scale factor — more output pixels per native pixel means more room to model the phosphor structure. At 8×8 the simulation can be highly authentic; at 4×4 basic effects work well; at 2×2 only crude scanline darkening is possible.

Output	Target res	X scale	Y scale	Pixel block	Simulation quality
Main HDMI	3840×2160	×8	×8	8×8	Highest — full phosphor simulation
Main HDMI	1920×1080	×4	×4	4×4	Good — scanlines + basic mask
Retro VGA	1920×1080	×4	×4	4×4	Good — AG6201 analogue path adds natural softness
Retro VGA	960×540	×2	×2	2×2	Basic — scanline only

Primary simulation target is 4K main HDMI. The retro VGA analogue signal path (after the AG6201 DAC) provides free horizontal softness that partially substitutes for the phosphor mask simulation at lower scale factors.

Pipeline Ordering

The full simulation pipeline per output path, applied after pixel replication:

Native pixel (from framebuffer / tilemap / sprite composite)
    ↓
[1] Bloom pre-pass       — separable blur, models phosphor glow
    ↓
[2] Pixel replicator     — expands native pixel to H_scale × V_scale block
    ↓
[3] Row brightness mask  — scanline simulation (Y axis)
    ↓
[4] Column brightness mask — pixel boundary darkening (X axis)
    ↓
[5] Phosphor mask        — RGB aperture pattern (X+Y combined)
    ↓
[6] Final multiply/blend
    ↓
Output encoders (DP→PS176 for main HDMI / HDMI TX→AG6201 for retro VGA)

Bloom is applied before replication because it operates on native-resolution pixels — blurring the expanded block would give wrong results. All other stages operate on the replicated output pixels.

Scanline Simulation (Y axis) — Stage 3

A brightness multiplier applied per row within each logical pixel block, simulating the dark gaps between CRT electron beam scan passes. At 8× Y scale, 8 output rows represent one native pixel height — the mask profile determines how many are lit and at what brightness.

8× Y scale profiles (one value per output row within the block):

Profile	Row brightnesses [0..7]	Fill %	Reference
TV Thick	100,100,100,100,0,0,0,0	50%	C64 / Spectrum on domestic SCART TV
TV Soft	100,100,100,100,100,60,20,60	~69%	BBC Micro on domestic TV
Monitor Sharp	100,100,100,100,100,100,30,0	~79%	Amiga on Philips CM8833
Arcade	100,100,100,100,100,100,100,20	~91%	JAMMA arcade CRT
PVM	100,100,100,100,100,100,80,40	~95%	Sony PVM broadcast monitor
Off	100,100,100,100,100,100,100,100	100%	Flat LCD — no simulation

4× Y scale profiles (4 rows per native pixel):

Profile	Row brightnesses [0..3]	Fill %
TV	100,100,0,0	50%
Monitor	100,100,100,30	~83%
Arcade	100,100,100,80	~95%
Off	100,100,100,100	100%

The profile is stored as an 8-entry LUT (one byte per row) in BSRAM — trivially small. The active profile selects which LUT is used. At 4× scale only the first 4 entries are used.

Phosphor Mask Simulation (X+Y) — Stage 5

Real CRT phosphors are shaped apertures arranged in repeating patterns — not square pixels. Three mask types are supported, each representing a different physical display technology.

Shadow Mask — used in most consumer TVs and many monitors. Phosphor triads arranged in a triangular dot pattern. The repeat unit tiles at approximately 3 columns × 2 rows. At 8× X scale, roughly two full RGB triads fit across one native pixel width:

Row 0: R . G . B . R .    (dots at positions 0,2,4,6)
Row 1: . G . B . R . G    (dots offset by 1)
Row 2: B . R . G . B .    (row 0 shifted right by 2)
Row 3: . R . G . B . R    (row 1 shifted)

Each dot position is full brightness for that channel; off-dot positions blend neighbouring colours at reduced brightness (~30–40%) to simulate phosphor bleed. The result is a warm, slightly soft look — the default "TV" aesthetic.

Aperture Grille — Sony Trinitron / Mitsubishi Diamondtron. Vertical phosphor stripes with no horizontal structure except thin damper wires. At 8× X scale with 24bpp content:

Cols 0,1: Red channel full, G+B reduced (~20%)
Cols 2,3: Green channel full, R+B reduced (~20%)
Cols 4,5: Blue channel full, R+G reduced (~20%)
Cols 6,7: Red channel full (repeat)

Damper wires appear as faint (~85% brightness) horizontal bands at approximately 1/3 and 2/3 of the screen height — in practice one band per ~108 output rows at 4K. These are applied as a global Y-position modulation separate from the per-block row mask.

The aperture grille look is sharper and more saturated than shadow mask — the "Trinitron look" strongly associated with high-quality retro computing displays (Amiga, Mac, workstations).

Slot Mask — common in 90s PC monitors. Rectangular apertures in a brick pattern — a hybrid between shadow mask and aperture grille. Slightly more structured than shadow mask, slightly warmer than aperture grille:

Row 0: RR GG BB RR GG BB RR GG    (2-wide slots)
Row 1: RR GG BB RR GG BB RR GG    (same)
Row 2: BB RR GG BB RR GG BB RR    (offset by 2 — the "brick" shift)
Row 3: BB RR GG BB RR GG BB RR

Each mask is stored as a small LUT indexed by (output_col mod pattern_width, output_row mod pattern_height) — typically 6–8 bytes for the repeat unit. The entire phosphor mask LUT for all three types fits in well under 1KB.

Pixel Boundary Darkening (X axis) — Stage 4

Simulates the dark gap between adjacent pixels even on flat-emissive displays, giving a sense of discrete pixel structure reminiscent of PVM monitors. Applied as a column brightness profile within each logical pixel block:

8× X scale profile:

Col 0: 70%   (rising edge)
Col 1: 100%
Col 2: 100%
Col 3: 100%
Col 4: 100%
Col 5: 100%
Col 6: 100%
Col 7: 70%   (falling edge)

The soft edge on cols 0 and 7 rather than a hard cutoff avoids an overly mechanical look. The exact rolloff values are stored in the column mask LUT alongside the row mask.

Bloom / Phosphor Glow — Stage 1

Bright pixels on a real CRT bleed light into neighbouring pixels — the phosphor glows beyond its aperture when driven hard. Implemented as a separable 3-tap horizontal + 3-tap vertical blur applied at native resolution before pixel replication:

Horizontal: [0.05, 0.15, 1.0, 0.15, 0.05]  (centre + 2 neighbours each side)
Vertical:   [0.05, 0.15, 1.0, 0.15, 0.05]

Requires a 1–2 row line buffer of native-resolution pixels in BSRAM (~480 pixels × 32bpp = ~1.9KB at Standard resolution — negligible). The result is a faint coloured halo around bright pixels on dark backgrounds — most visible on white text on black, bright sprites against dark backgrounds, and raster bar effects.

Bloom strength is a configurable register value (0 = off, 255 = maximum glow). At moderate settings (~64–96) it adds authenticity without being visually distracting.

Tate Mode — 90° Rotated Scanline Simulation

Tate mode (from 縦, tate — Japanese for "vertical") is a single bit in the simulation mode register that rotates the entire CRT simulation pipeline 90°. It is designed for vertical arcade games (shoot-em-ups, platformers, Donkey Kong-style) displayed on a horizontal screen.

Normally the scanline brightness profile runs horizontally — dark gaps between rows. In Tate mode the profile is transposed: the dark gaps run vertically instead, simulating how a real CRT would look if the monitor were rotated 90° to display the game in portrait orientation. The phosphor mask patterns are similarly transposed — the column index and row index are swapped in the LUT lookup.

Implementation cost: The LUTs are already indexed by (col_within_block, row_within_block). Tate mode simply swaps those indices — (row_within_block, col_within_block). One XOR gate and a register bit. Essentially free.

The result is a portrait arcade game on a landscape screen that reads authentically as "rotated CRT" rather than "pillarboxed LCD." The horizontal dark column gaps give the same visual weight and texture as scanlines do on a normal horizontal game.

Dual Tate — Two Vertical Games Side by Side

At 4K output, two portrait games fit side by side in a clean 50/50 split using the 270×480 native resolution — tall pixels (PAR 1:2), each half exactly 1920 output pixels wide:

Layer	Native	PAR	X_offset	Output width	Notes
Game A	270×480	1:2 (tall pixels)	0	1920	Left half — Tate mode
Game B	270×480	1:2 (tall pixels)	1920	1920	Right half — Tate mode

270×480 is valid — 270 divides 1080 (×4 to 1080p, ×8 to 4K) and 480 divides 1080 (×4 to 1080p horizontally used as height). Each game gets its own layer, its own sprite layer, its own palette, its own rotated CRT simulation. The layers simply have different X_offsets — the layer system handles this with no special casing.

A narrow separator layer between them (a 2-pixel wide border or a thin HAM24 marquee strip) can be added as another layer at X_offset=1919, costing nothing.

Both games run simultaneously and completely independently, composited in hardware.

A preset register selects a named combination rather than requiring individual effect configuration. Presets are starting points — individual effects can be overridden:

Preset	Scanline	Phosphor mask	Boundary	Bloom	Tate	Reference
Off	Off	Off	Off	Off	Off	Flat LCD, no simulation
TV	TV Thick	Shadow Mask	Off	Low	Off	C64 / Spectrum on domestic TV
TV Soft	TV Soft	Shadow Mask	Off	Medium	Off	BBC Micro on domestic TV
RGB Monitor	Monitor Sharp	Aperture Grille	On	Low	Off	Amiga on CM8833 / Trinitron
PVM	PVM	Aperture Grille	On	Off	Off	Sony PVM broadcast monitor
Arcade	Arcade	Slot Mask	Off	Medium	Off	JAMMA arcade cabinet
PC Monitor	Monitor Sharp	Slot Mask	On	Off	Off	90s PC VGA monitor
Tate Arcade	Arcade	Slot Mask	Off	Medium	On	Vertical arcade cabinet — rotated CRT
Tate PVM	PVM	Aperture Grille	On	Off	On	Vertical game on rotated broadcast monitor

Named Monitor Profiles

Beyond the generic presets, specific iconic displays had highly distinctive looks that are worth documenting as named profiles. Each is a precise combination of scanline profile, phosphor mask, boundary darkening, and bloom settings that together reproduce the character of that specific hardware.

Commodore 1084S / 1084SD The standard monitor for the Amiga and C64. A medium-quality shadow mask consumer monitor with a relatively thick beam and warm colour temperature.

Setting	Value
Scanline	TV Thick (4 lit, 4 dark)
Phosphor	Shadow Mask — warm, wide dot pitch
Boundary darkening	Off
Bloom	Medium (~80)
Character	Warm, slightly soft, strong scanline gaps — the definitive C64 look

Philips CM8833 / CM8833-II The Amiga enthusiast's monitor of choice. A higher-quality shadow mask with a sharper beam than the 1084, slightly cooler colour temperature, and notably tighter dot pitch.

Setting	Value
Scanline	Monitor Sharp (6 lit, 2 dark, hard edge)
Phosphor	Shadow Mask — cooler, tighter dot pitch than 1084
Boundary darkening	Off
Bloom	Low (~32)
Character	Crisp, slightly cool, visible but not dominant scanlines — the definitive Amiga demo look

Sony Trinitron (PVM-14L2, KV series, etc.) The Trinitron aperture grille technology was used across Sony's entire range from budget TVs to professional PVMs. The defining characteristic is vertical phosphor stripes — no shadow mask dot structure — giving higher brightness and more saturated colours than any shadow mask monitor. The thin horizontal damper wires (one or two faint lines across the screen) are the giveaway.

Setting	Value
Scanline	Monitor Sharp to Arcade depending on model
Phosphor	Aperture Grille — vertical RGB stripes
Damper wires	~85% brightness at 1/3 and 2/3 screen height
Boundary darkening	On — the stripe structure gives natural column separation
Bloom	Low-Medium (~48)
Character	Sharp, saturated, slightly clinical — the "professional" retro look

Consumer Trinitron TVs (KV series) had thicker scanlines and softer bloom. Professional PVMs had extremely tight, precise beams with minimal bloom — the "PVM look" beloved by retro gaming enthusiasts for its sharpness and accuracy.

Sony PVM / BVM (Professional/Broadcast Video Monitor) The gold standard for retro game video quality. PVMs used Trinitron tubes with extremely precise electronics — the beam was tighter, the colour more accurate, and the scanlines sharper than any consumer display. BVMs (Broadcast Video Monitors) were even more precise.

Setting	Value
Scanline	PVM (6 lit, 2 dark, then 80%, 40% rolloff)
Phosphor	Aperture Grille — tightest stripe pitch of any Trinitron
Damper wires	Present but very faint (~92% brightness)
Boundary darkening	On — strong, precise column edges
Bloom	Off or minimal (~16)
Character	Razor sharp, clinically accurate, minimal glow — content looks exactly as the developer intended

The PVM look is simultaneously the most "accurate" and the most distinctive — it makes low-resolution pixel art look structured and intentional rather than fuzzy.

Mitsubishi Diamondtron (NF series) Mitsubishi's answer to Trinitron, also an aperture grille technology. Slightly warmer colour temperature than Sony, marginally wider stripe pitch on some models, and slightly more bloom. Used in Mitsubishi's Diamond Plus and Diamond Pro monitor ranges popular with Amiga and PC users in the 90s.

Setting	Value
Scanline	Monitor Sharp
Phosphor	Aperture Grille — slightly wider pitch than Trinitron
Damper wires	Present, slightly more visible than Trinitron (~82% brightness)
Boundary darkening	On
Bloom	Low-Medium (~56)
Character	Similar to Trinitron but marginally warmer and softer — slightly more forgiving on harsh colours

Generic SCART TV (PAL domestic) A typical mid-range European CRT television of the 80s/90s connected via RGB SCART — the primary display for the vast majority of British home computer and console users. Shadow mask, fairly thick beam, warm colour temperature, visible overscan, and strong scanlines.

Setting	Value
Scanline	TV Thick (4 lit, 4 dark)
Phosphor	Shadow Mask — wide dot pitch, warm
Boundary darkening	Off
Bloom	High (~128)
Overscan simulation	On — slight border bleed
Character	Warm, fuzzy, nostalgic — the actual look most people experienced their retro games in

This is probably the most emotionally resonant profile for European users — not the sharpest, but the most authentically "what it actually looked like in your living room in 1987."

JAMMA Arcade CRT (Wells Gardner, Electrohome, etc.) Arcade monitors ran their CRTs harder than consumer displays — higher brightness, tighter convergence, and a very characteristic slightly-green tint from the phosphor mix. Shadow or slot mask depending on manufacturer. Scanlines were visible but thin due to the high brightness drive.

Setting	Value
Scanline	Arcade (7 lit, 1 dark)
Phosphor	Slot Mask (most common in JAMMA era)
Boundary darkening	Off
Bloom	Medium-High (~96) — phosphors driven hard
Colour tint	Very slight green bias (+4 green channel)
Character	Bright, punchy, slightly raw — the smell of a fish and chip shop optional

Commodore 64 composite (PAL) The C64 connected to a domestic TV via RF or composite — not RGB. This is the lowest-quality signal path, adding colour smearing, luminance-chroma crosstalk, and significant softness. Many classic C64 games were designed with composite artefacts in mind — the colour bleeding was sometimes used deliberately as a feature.

Setting	Value
Scanline	TV Thick
Phosphor	Shadow Mask — wide pitch
Boundary darkening	Off
Bloom	High (~112)
H chroma blur	On — 2–3 pixel chroma bleed (composite artefact simulation)
Character	Soft, blurry, colourful — the way most C64 owners actually saw their machine

The composite chroma blur is a separate effect from phosphor bloom — it's a horizontal low-pass on the Cb/Cr channels only (the YCbCr composite simulation mentioned in Section 3), deliberately mimicking the limited chroma bandwidth of PAL composite video.

These profiles are selectable by name via a display.setMode() system call, with individual parameters overridable after selection. DeMon applies the profile at boot based on a stored configuration; the Copper can switch profiles mid-frame for mixed-monitor split-screen effects.

The simulation pipeline is LUT-trivial excluding the bloom pre-pass:

Stage	Resource
Row mask LUT	8-entry × 8-bit = 64 bits — a few flip-flops
Column mask LUT	8-entry × 8-bit = 64 bits
Phosphor mask LUT	~64 bytes per pattern — distributed RAM
Multiply/blend	~3 DSP blocks per output path
Bloom line buffer	~1.9KB BSRAM per output path
Total per output path	~2 BSRAM blocks + ~3 DSP blocks + handful of LUTs

Three output paths (HDMI, VGA) running independent simulation modes costs ~6 BSRAM blocks and ~9 DSP blocks — rounding error on a device with 340 BSRAM blocks and 298 DSP blocks.

9. Display Mode Registers (Initial Concept)

Each output path has a small register bank (target: 8–16 bits per output) written by DeMon at boot or dynamically via a display.setMode() system call from the FireStorm EE.

Per-Output Control Register (proposed)

Bits	Field	Notes
[2:0]	Native H width select	Index into H resolution table
[5:3]	Native V height select	Index into V resolution table
[7:6]	Scanline profile	0=Off, 1=TV, 2=Monitor, 3=Arcade
[9:8]	Phosphor mask	0=Off, 1=Shadow, 2=Aperture Grille, 3=Slot
[10]	Pixel boundary darkening	Enable/disable
[11]	Bloom pre-pass	Enable/disable
[13:12]	PAR mode	0=Square, 1=2:1 fat, 2=1:2 tall, 3=4:3
[14]	Tate mode	Rotate simulation 90° for vertical arcade games
[15]	Reserved	—

This allows different simulation configurations on DP, HDMI, and VGA simultaneously without any CPU intervention during the frame — DeMon sets the registers at mode-set time and FireStorm handles the rest autonomously.

10. Scanline Mixer and Layer System

FireStorm composites multiple independent layers in hardware on every scanline. No CPU cost for final composition — the application writes to layer framebuffers/registers and FireStorm handles mixing autonomously.

Per-Layer Independent Resolution — A Key Capability

Every layer in FireStorm has its own native resolution, independently configured. A game background, a sprite layer, a UI overlay, and a text console can all run at different native resolutions simultaneously — and all are composited cleanly to the single output resolution by the blender. No special casing, no CPU involvement, just independent pixel replication per layer before blending.

This is unusual. Most display hardware has a single global resolution — everything on screen shares it, and mixing different resolutions means either scaling in software or accepting borders. FireStorm has no such constraint. Each layer picks the resolution that suits its content:

A background tilemap at 320×180 for a classic scrolling game feel
Sprites at 480×270 Standard resolution
A UI overlay at 960×540 Hires for sharp text and controls
A border/static layer at 240×270 for fat-pixel decorative elements
A Colony video layer at whatever resolution the incoming stream carries

All of these composite at the output pixel clock simultaneously. The blender sees every layer arriving at the same rate regardless of its native resolution — the pixel replicator for each layer handles the difference transparently.

The Amiga could mix lo-res, hi-res, and HAM bitplanes. The Ant64 can mix any valid combination from the resolution tables, per layer, with fully independent H and V scale factors. It is a direct evolution of that philosophy with no arbitrary constraints.

Layer Types (initial concept)

Layer	Type	Notes
Framebuffer 0/1	Dynamic	Primary application render target — FireStorm EE
Tilemap 0/1	Dynamic	Classic scrolling game backgrounds, per-tile palette
Sprite layer	Dynamic	Hardware sprites — user and system partitions, see Section 14
Text/console	Static/dynamic	Always-available terminal, character cell with attribute byte
ImGui / OS overlay	Dynamic	GUI overlaid over application content
Cursor	Dynamic	Mouse pointer, always highest priority
Border/static	Static	Fixed graphics, colour fills, overscan regions
Colony video (Ant64/Ant64C)	External	Video stream received via Colony network from HDMI frame grabber or other peripheral
Supervisor MIPI layer	Internal	CPU-generated content from ESP32-P4 MIPI TX → FireStorm MIPI RX hardcell

Layer Properties (per layer)

Enable/disable
Priority (Z-order)
Colour key / transparency
Palette select
H_scale — output clocks per native pixel (horizontal pixel replication)
V_scale — output lines per native row (vertical line replication)
H_scroll — native pixel offset (horizontal scroll)
V_scroll — native row offset (vertical scroll)
X_offset — output pixel position of the layer's left edge
Y_offset — output line position of the layer's top edge
Width — active width of the layer in output pixels (clips the right edge)
Height — active height of the layer in output lines (clips the bottom edge)

H_scale and V_scale are fully independent, allowing any pixel aspect ratio. All parameters are writable by the Copper for per-scanline effects.

X_offset, Y_offset, Width, and Height together define a positioned bounding rectangle for each layer in output pixel coordinates. The blender only samples a layer within its active rectangle — outside it the layer contributes nothing and the next lower priority layer shows through.

Pixel Replication Architecture

Each layer has its own pixel replicator running at the output pixel clock. A register holds the current native pixel value for H_scale output clocks before advancing; a line counter holds the current native row for V_scale output lines before fetching the next row. The blender always sees all layers arriving at the output pixel clock rate — uniform, no rate-matching logic required.

This means:

The blender has one implementation at one clock rate regardless of per-layer native resolution
Adding a new layer is instantiating the same parameterised module again
Copper writes take effect at precise output pixel boundaries with no rate-translation bookkeeping
V_scale line repetition re-reads the same BRAM row on each repeated output line — at native widths ≤480 pixels, BRAM bandwidth is nowhere near a constraint
Layers outside their X_offset/Y_offset bounding rectangle output a transparent pixel — the replicator simply doesn't run for those positions

Layer Positioning and Overlap

Each layer is a fully positioned rectangle on the output. Layers at different Y positions simply don't contribute pixels outside their active height. Layers at overlapping positions composite normally — priority, colour key, and per-palette alpha resolve which pixels win, exactly as if the layers filled the whole screen.

This is a direct evolution of how the Amiga used the Copper to drag screens of different resolutions up and down the display. On the Amiga those regions were vertically exclusive — a lo-res region and a hi-res region could not overlap, they could only be stacked with a hard boundary between them. The Copper moved the boundary, but never allowed content from two regions to occupy the same output line.

On the Ant64 there is no such constraint. Any layer can be placed anywhere, at any native resolution, and overlap any other layer freely. The blender handles the compositing per pixel regardless of whether the layers' bounding rectangles intersect. Examples:

Game HUD over tilemap — a 960-wide Hires score panel at Y_offset=0 over a 320-wide Standard tilemap filling the full screen. The HUD is sharp pixel-perfect text; the game is high-colour lo-res with full scroll. Neither compromises the other. The HUD only updates when the score changes — it isn't part of the game framebuffer, shares no bitplane bandwidth, costs nothing at runtime
A 320×180 lo-res background layer behind a 480×270 sprite layer, with a small 960×96 Hires status bar overlapping both at the top
Two HAM24 photograph layers at different Y offsets with a soft alpha blend between them where they overlap — a hardware crossfade without touching either layer's content
A Hires popup dialog floating over a Standard game scene mid-play, composited entirely in hardware

The Copper can write X_offset or Y_offset on every output line, allowing layers to move smoothly without any CPU frame rendering involved. This is the direct descendant of the Amiga's screen-drag mechanic, generalised to every layer independently and extended to allow full overlap.

Layer Clip Table

Each layer has a CLIP_TABLE_PTR register pointing to a memory block that defines per-scanline left and right clip boundaries. This replaces the simple Width/Height rectangle with an arbitrarily shaped clip region — any contour expressible as horizontal left/right pairs.

Clip table entry format:

struct clip_entry {
    uint16_t left_x;    // left visible boundary in output pixels
    uint16_t right_x;   // right visible boundary in output pixels
    uint16_t height;    // number of output scanlines this entry covers
};

6 bytes per entry. The display engine walks the table during output — consuming one entry per height scanlines, advancing to the next automatically. When the table is exhausted the layer reverts to fully transparent.

Table sizes:

Mode	Entries	Table size	Use case
No clip	1	6 bytes	left=0, right=output_width, height=output_height
Rectangle	1	6 bytes	Fixed rectangular clip
Diagonal split	N (one per scanline)	N×6 bytes	Straight angled split
Curved boundary	N	N×6 bytes	Any smooth curve
Per-line full precision	1080	~6.5KB	Full 1080p per-scanline
Per-line 4K	2160	~13KB	Full 4K per-scanline

All table sizes fit comfortably in BSRAM.

Key properties:

CLIP_TABLE_PTR is a Copper target — the Copper can swap the pointer mid-frame to change clip regions between screen areas
The FireStorm EE builds the clip table each frame for dynamic clips (player-tracking split screen, animated portals, etc.)
The table is consumed left-to-right in scanline order — the EE writes it from top to bottom, the display engine reads it the same way
A single-entry table covering the full screen is equivalent to no clip — zero overhead

Diagonal split-screen example:

For a straight diagonal split at a given angle, each entry has height=1 and the left/right values advance by a fixed delta per line. The EE computes this in a trivial loop each frame — a multiply and add per line. The Copper swaps CLIP_TABLE_PTR at the frame boundary. Two layers with complementary tables (one's right boundary = the other's left boundary) produce a seamless join with no gap and no overlap.

Dynamic split tracking (Lego-style):

The game logic computes the split geometry each frame — midpoint between players, rotation perpendicular to the player vector — and the EE writes a new clip table. When players are close, a single full-screen entry replaces the per-scanline table and the split disappears in one frame. No transition artefact because both layers contain valid full-screen content at all times.

Bitmap Layers and the Blitter

In addition to the hardware tilemap and sprite layers, FireStorm supports bitmap layers — intermediate framebuffers (8bpp, 16bpp, or 32bpp) that the FireStorm Blitter renders into. The scanline mixer composites them alongside the hardware layers exactly as it would any other layer.

The blitter processes an EE-defined job queue with no fixed pipeline depth — one job for simple cases, many jobs for complex multi-pass rendering. The display engine is decoupled: it reads from the front buffer at the output pixel clock while the blitter writes to the back buffer. Double and triple buffering are supported. If a layer's inputs haven't changed since the last frame, the EE skips its jobs and the front buffer retains the previous frame's content — a static HUD costs nothing until it needs updating.

Texture source hierarchy: Any blitter primitive that samples pixel data — sprites, textured triangles, tilemaps, pattern fills — uses a unified texture source system. Texture data can come from:

Permanent BSRAM — frequently used assets at fixed BSRAM addresses, zero cache overhead, 380MHz
BSRAM texture cache (~128–256KB) — hot working set backed by Graphics SRAM. Hits at full BSRAM bandwidth
Graphics SRAM (4.5MB) — fast pipeline SRAM with no page-miss penalty, ideal for the non-sequential UV sampling that DDR3 handles poorly. Also enables high-colour intermediate buffers (R12G12B12) for precision bloom and compositing before conversion to output format. A typical scene's textures and intermediate buffers fit within 4.5MB with no DDR3 involvement
DDR3 — full art library, potentially megabytes. Accessed via cache for normal use; direct bypass for streaming

Textured triangles use the same system — UV coordinates are interpolated across the triangle and the sampler reads from the texture source hierarchy per pixel. Nearest-neighbour, bilinear, and perspective-correct sampling are all supported. This gives the blitter basic textured 3D rendering capability entirely within the FPGA.

Software sprite throughput: A primitive list of 500 × 16×16 4bpp masked sprites from BSRAM cache takes approximately 42 microseconds at 380MHz. Several thousand blitter sprites per frame are readily achievable, in addition to the hardware sprite layer's own budget of hundreds per scanline.

Vector CRT simulation: The blitter supports bloom lines and bloom particles — antialiased lines and point sprites with phosphor glow falloff and additive blending, designed specifically for colour vector game aesthetics. Tempest. Asteroids. Star Wars. Each primitive has its own intensity and falloff. Line crossings and particle clusters get brighter. The bitmap saturates at hot spots. It looks like a vector CRT because the maths is the same maths. Named presets for the classic arcade titles are built in. A full Tempest-style frame — web wireframe, enemy lines, shot particles, explosion particles — renders in well under 500 microseconds.

Clip table at composite stage: When the blitter composites an intermediate bitmap to the output layer, the layer's clip table is applied — pixels outside the left/right boundaries per scanline are not written. This is the split-screen and portal mechanism.

For full blitter documentation — pipeline model, buffering, dirty tracking, texture source system, bloom lines, all primitive types — see Click here...

Since H_scale, V_scale, X_offset, Y_offset, Width, and Height are all per-layer, layers at different native resolutions can be freely positioned and composited. Example at 1080p output:

Layer	Native W	H_scale	Native H	V_scale	Position	Feel
Background tilemap	320	×6	180	×6	Full screen	Classic game background
Sprite layer	480	×4	270	×4	Full screen	Default Ant64 resolution
UI overlay	960	×2	160	×2	Y=860, H=220	Sharp Hires status bar at bottom
Popup panel	960	×2	270	×2	X=600, Y=200	Hires overlay, partial screen width
Border	240	×8	270	×4	Full screen	Fat pixel decorative border

All arrive at the blender at the output pixel clock. Outside each layer's active rectangle, the layer is transparent. The Amiga could mix lo/hi/HAM in vertical strips with no overlap. The Ant64 can mix any resolution at any position with full overlap — the same idea with no remaining constraints.

11. Copper — Per-Scanline Register Control

The Copper executes a command list in sync with the display beam, allowing any FireStorm register to change at any horizontal or vertical position. This is the mechanism behind virtually all classic demo effects.

Key capabilities:

Raster bars — change background colour register every N lines
Split screen — switch layer enable/mode mid-frame
Horizontal scroll per line — sine wave over a tilemap (classic water effect)
Palette cycling — change colour registers on specific scanlines
Mid-frame resolution switch — different H or V native width above/below a line
Layer priority reorder — mid-frame compositor configuration change
VRR blanking control — stretch or compress vertical blanking from the Copper list
Colony video mix ratio — blend Colony-sourced video content against generated content, per scanline (Ant64 and Ant64C)

The Copper concept is direct spiritual descendant of the Amiga's Copper coprocessor, but implemented as a general register-write engine rather than a fixed-function chip. Any FireStorm register is a valid Copper target.

12. Named Display Modes (initial set)

Pulling together the most useful configurations:

Mode name	Native res	PAR	Scale to 4K	Primary character
Standard	480×270	1:1	×8×8	Amiga low res equivalent — retro pixel feel
Hires	960×540	1:1	×4×4	Amiga hires equivalent — double detail
Desktop	1920×1080	1:1	×2×2	Full 1080p native passthrough
Multicolour	240×270	2:1	×16×8	C64 multicolour / fat pixel sprites
Classic 4:3	480×360	4:3	×8×6	Undistorted 4:3 geometry on 16:9
Retro Platform	320×240	4:3	×12×9	SNES/PlayStation/DOS authentic feel
Panorama	960×270	1:2	×4×8	Tall pixel cinematic widescreen
Retro Low	160×180	2:1	×24×12	Maximum fat-pixel simulation space
Universal	640×360	1:1	×6×6	Integer scales to 4K, 1080p, and 1440p

14. Sprite System

Sprite Attribute Table

A fixed-size table of sprite attribute slots, each holding:

Attribute	Notes
X position	Determines line buffer write address
Y position	Used during sort/pick to determine if sprite intersects next row
Width / height	Pixel fetch range
BRAM data pointer	Address of sprite pixel data
Palette select	Applied during pixel fetch
Priority	Blender Z-order
H flip / V flip	Applied during pixel fetch
Enable flag	Skips disabled sprites during sort

The true hardware limit is a compile-time constant baked into the HDL — the absolute ceiling no register can exceed.

Table Partitioning

The attribute table is split from both ends:

Slot 0                                              Slot N (true limit)
│◄──── user sprites (0 → USER_SPRITE_LIMIT) ───────►│◄─── system ────►│
                                                    ▲
                                               SYS_SPRITE_BASE

USER_SPRITE_LIMIT — maximum slots the user application can populate, counting from slot 0 upward. The sprite engine stops scanning user sprites at this index. Writable by the FireStorm EE via live register; takes effect at next snapshot.

SYS_SPRITE_BASE — slot index where system sprites begin, counting from the top of the table. Writable by DeMon only — the Pulse and the accelerator have no write path to this register (not a software permission check; the bus connection physically does not exist). System sprites always fetch regardless of USER_SPRITE_LIMIT.

If USER_SPRITE_LIMIT is set at or above SYS_SPRITE_BASE, the hardware clamps it silently. A status register exposes the effective limit back to software.

DeMon can adjust SYS_SPRITE_BASE dynamically — reserving more system slots during a debug session, releasing them afterward — without any FireStorm EE involvement.

System Sprite Use Cases

System sprite	Notes
Mouse cursor	Always topmost priority, updated by DeMon from Sticky input
Debug overlay marker	Highlights screen regions during development
AntOS notification icon	OS-level status independent of application
Personality cartridge indicator	Hardware-level status, always visible

System sprites use a separate palette owned by DeMon, ensuring the cursor always renders correctly regardless of what the user application has done to colour registers.

Shadow Registers

All sprite attributes have a parallel shadow copy. At the start of each native scanline's V_scale window, the hardware atomically snapshots all live registers into shadow:

shadow[] ← live[]   (atomic, at native scanline boundary)

The sprite engine reads only from shadow registers for the entire V_scale window. The FireStorm EE, FireStorm EE, and the Copper may freely write live registers at any time — writes are queued implicitly and take effect at the next snapshot. This eliminates all mid-scanline corruption races with no software involvement.

The Copper writes sprite X positions to live registers on every native scanline without any special timing concern — the snapshot boundary is entirely the hardware's responsibility, transparent to software.

Prefetch Pipeline

The sprite engine works one native row ahead — during the V_scale window for native row N, it prepares sprites for native row N+1. This gives the full V_scale window for sort and fetch:

Native row N begins (output line 0 of V_scale window):
    shadow[] ← live[]              ← snapshot for row N+1
    Buffer A → blender             ← displaying row N (prepared last window)
    Sort: scan shadow[] for sprites active on row N+1
    Fetch: load sprite pixels into Buffer B for row N+1

Native row N+1 begins:
    shadow[] ← live[]              ← snapshot for row N+2
    Buffer B → blender             ← displaying row N+1
    Sort + fetch into Buffer A for row N+2
    (buffers ping-pong each native row)

The blender and sprite engine never touch the same buffer. The pipeline is primed during vertical blanking — row 0 sprites are prepared during the last V-blank lines so the first display row has a valid buffer ready with no special cases.

Sprite Budget

The binding constraint is not output line count but pixel fetch throughput — BRAM read bandwidth and line buffer write bandwidth.

GW5AST-138 BSRAM key specs:

340 BSRAM blocks, each up to 18Kbits (6,120Kbits / ~765KB total)
Clock frequency up to 380 MHz
Dual Port mode with independent clocks and up to 72-bit data width
Semi Dual Port mode: dedicated write port (A) and read port (B), independent clocks

At 380 MHz and 72-bit port width, a single BSRAM can deliver 72 bits per cycle = ~3.4GB/s read bandwidth. Multiple BSRAMs running in parallel multiply this further.

For sprite pixel fetching, a practical design would use a dedicated BSRAM for the sprite sheet data, running its read port at the BSRAM clock (up to 380 MHz) independently of the output pixel clock. At 380 MHz with a 32-bit read port (one 8bpp sprite row of 4 pixels per cycle), the fetch engine can pull ~1,520 million pixels per second — far more than any scanline can consume.

The practical limit then becomes the line buffer write bandwidth and the sort logic throughput rather than raw BRAM speed. With a sensible architecture:

Sort scan (check Y bounds of all attribute slots): at 380MHz, 64 sprites × ~3 cycles each ≈ ~200 cycles — under 1 output line period at any resolution
Pixel fetch: per-sprite BRAM read + line buffer write, perhaps 8–16 cycles per sprite row at 380MHz

This gives a realistic per-V_scale slot budget of hundreds of sprites, not dozens. The full V_scale window is available since the pipeline runs one row ahead:

V_scale	Total fetch slots	Conservative budget	Notes
×2	2	200–400	Sort + fetch at 380MHz
×4	4	400–800	—
×8	8	800–1,600	—
×16	16	1,600–3,200	—

The budget is more naturally expressed as pixels fetched per native scanline — a wide-sprite scene uses the same bandwidth as a narrow-sprite scene with more objects. USER_SPRITE_LIMIT acts as a ceiling so software can control worst-case fetch time; the engine fits as many sprites as it can up to that limit within each V_scale window.

14.4 Supervisor Sprite and Tilemap Layers — DeMon & Pulse (Two ESP32-P4s)

The FPGA's native sprite engine (above) is not the only source of sprites — and its native tilemap engine isn't the only tilemap source either. The Ant64 carries two ESP32-P4 supervisors — DeMon and Pulse — and each one feeds independent, hardware-accelerated sprite and tilemap layers into the FireStorm compositor over its MIPI link. They are not limited to flat UI surfaces.

Every ESP32-P4 includes a full 2D Pixel Processing Accelerator (2D-PPA): hardware rotation, scaling, mirroring, alpha blending, and colour-key on framebuffers in its own 32 MB PSRAM. Tilemap rendering on the 2D-PPA is the same primitive as sprite blitting — each tile is a fixed-size blit from a tileset in PSRAM — so the same accelerator that composes a supervisor's sprite layer also composes its tile-grid backgrounds. Each supervisor composites its sprites and tilemaps locally with the 2D-PPA, then streams the result over MIPI to FireStorm, which treats the feed as a normal display layer — it gets a priority register, colour key, and per-scanline Copper control, and appears on every output (HDMI and VGA) exactly like a native layer.

So the compositor sees three sources for sprites and tilemaps, not one:

FPGA native sprite + tilemap engines — hundreds of sprites per scanline plus the hardware tilemap layers, all pixel-clock-locked (§14 and graphics)
DeMon layer (MIPI #0) — AntOS-rendered sprites and tile-grid backgrounds: system icons, file-browser thumbnails, notification badges, animated cursors, recovery/boot splashes, file-browser tile grids, settings-panel tiled surfaces, and AntOS-side widgets composited over a running game without involving the FireStorm EE. See DeMon — Sprite and Tilemap Layer Capability.
Pulse layer (MIPI #1) — performance visuals plus sequencer-style tile grids: VU meters, scope traces, animated mixer faders and rotated jog-dial graphics, sample/pattern-browser thumbnails, the step-sequencer matrix (16–256 steps as a tile grid), drum-machine pattern matrices, and mixer channel strips composed as tiles. See Pulse.

The split-by-purpose between FPGA native and supervisor MIPI is the same on both sides: the native engines carry high-throughput, full-screen game-style sprites and tilemap backgrounds at zero CPU; the supervisor 2D-PPA carries the supervisor's own UI sprites and grids — modest in size and update rate, but with all the rotation / scaling / alpha / colour-key effects of the 2D-PPA per blit.

Because the rotation and scaling happen in the 2D-PPA, a spinning dial graphic or a scaled thumbnail costs the supervisor almost nothing — it is not software pixel-pushing.

And the 2D-PPA is the floor, not the ceiling. The ESP32-P4 is a 400 MHz dual-core RISC-V with PIE SIMD extensions — a genuinely capable processor; full software 3D of the Quake class has been demonstrated running on a single P4. The Ant64 has two of them sitting alongside the FPGA chipset, each able to render real graphics into a composited layer. The sprite-layer role described here is the conservative, always-available use; a supervisor can render considerably more than VU meters when an application asks it to, all of it landing in the same MIPI-fed display layer the compositor already understands.

15. High-Resolution Native Modes and Colour Encoding

Multi-Resolution Output — Per-Path Native Resolution

The two external output paths generate independently. A 4K native layer (tilemap or HAM24) renders once and is simultaneously mixed down to the lower-resolution retro output in the same scanline pass — no second render, no framebuffer store:

Output	Native res	Mode	Delivery
Main HDMI	3840×2160	Tilemap or HAM24	Full 4K, via PS176 DP→HDMI bridge
Retro VGA	960×540 / 640×480 / 1440×240p	N:1 mixdown from 4K render	Filtered digital HDMI stream → AG6201 8-bit analogue DAC

FireStorm generates each output from the same layer data. A game running genuine 4K tilemap on the main HDMI output simultaneously outputs a clean lower-res signal on the retro VGA output for a capture card or CRT with no extra CPU involvement.

4K Mixdown Pipeline

Since the 4K render produces pixels scanline-sequentially, the mixdown sits inline between the 4K pixel stream and the lower-resolution encoders. The blender composites all layers first, then the composite is mixed down — simpler than mixing individual layers separately and gives correct blending results:

4K render + composite blender
    │
    ├──────────────────────────────→ DP TX → PS176 → main HDMI (full 4K, no processing)
    │
    └── [H N:1 filter] → [line pair accumulator] → [CDC FIFO] → 2nd HDMI TX → AG6201 → retro VGA + audio

Horizontal filter — averages pairs (or N-tuples) of adjacent pixels across the line. At 2:1 this is a 2-tap box filter (one add + shift per pixel pair), essentially free in logic. Higher quality options use DSP blocks.

Line pair accumulator — holds one line of output-resolution pixels in BSRAM (~1920 × 32bpp ≈ 7.5KB at 1080p) and averages it with the next native line before outputting. This provides the vertical reduction.

CDC FIFO — a small clock-domain crossing FIFO bridges the 4K pixel clock (~533MHz) to the retro output's own pixel clock (~25–148 MHz depending on modeline; ~27 MHz for the 240p superresolution mode). Only a few pixels deep — absorbing the clock ratio difference, not buffering a full line. Standard synchroniser practice. The mixdown output is a full-precision digital stream; the second HDMI TX hands it to the AG6201, which performs the analogue conversion.

Mixdown Filter Quality

The filter is selectable per output via a register field:

Filter	Cost	Quality	Best for
Box 2×2	Trivial — 2 adds per pixel	Adequate, slight softness	Retro VGA (AG6201 analogue path does the rest)
Bilinear	Small multiplier per pixel	Good, smooth edges	HDMI general use
Lanczos 4-tap	A few DSP blocks	Excellent, sharp	HDMI high-quality / photographic HAM24

The GW5AST-138 has 298 DSP blocks — a Lanczos filter is easily affordable. The Copper can change the filter register mid-frame if different screen regions benefit from different filter approaches.

Retro VGA Output Stage — AG6201 8-Bit Analogue Path

The retro output stage is not an FPGA DAC. After the mixdown filter, FireStorm emits the result as a full-precision digital HDMI stream on its second (internal) HDMI TX, and the analogue conversion happens in the AG6201 bridge chip:

Mixdown output (8bpp per channel, full precision)
    ↓
[optional digital CRT/scanline/colour-bleed effects — Section 8]
    ↓
[16–255 output-range remap — AG6201 black-crush mitigation]
    ↓
2nd HDMI TX (fabric-pin LVDS, 4 TMDS pairs)
    ↓
AG6201 HDMI→VGA bridge
    ├─ integrated 8-bit linear RGB DAC → DE-15 (0.7 Vpp / 75 Ω)
    └─ embedded stereo audio DAC → 3.5 mm TRS (~1 Vrms)

The AG6201's RGB DAC is a real 8-bit-per-channel linear DAC, so the palette RAM's full RGBA32 precision survives all the way to the analogue pins — every gradient as smooth as the source. There is no 8→5 truncation, no Bayer spatial dither and no temporal half-bit trick anywhere in the path: those techniques existed only in an earlier design that drove a 5-bit resistor-ladder DAC straight from FPGA fabric, and they are obsolete now that a true 8-bit DAC sits in the bridge. The only host-side concession to the commodity bridge is a 16–255 output-range remap in the HDMI TX to work around the AG6201's black-crush on RGB codes ≤ 16; the full per-revision detail is in rgb_out.

Audio rides the same HDMI stream and is decoded by the AG6201's embedded audio DAC, so it is inherently locked to video with no separate clock domain on the host side.

Property	Old 5-bit FPGA DAC (superseded)	AG6201 bridge (current)
DAC location	FPGA fabric resistor ladder	Inside AG6201
Channel depth	5-bit + dither tricks (~6.5 effective)	True 8-bit linear
Perceived colours	up to ~600K with both dithers	full 16.7M
Dithering needed	Yes (spatial + temporal)	None
Audio	separate host path	Embedded in same HDMI stream
FPGA pin cost	~28 (5-5-5 ladder + audio)	~11–12 (4 TMDS pairs + HPD + DDC)

HAM24 on VGA

HAM24's 8-bit channel-modify precision now reaches the AG6201 at the full 8 bits — no truncation, so the retro output sees the same channel depth as the main HDMI output, with the analogue signal path blurring transitions further. YCbCr HAM24 retains its perceptual advantage over RGB HAM24 on this output as on any other.

Memory Constraints at 4K Native

Raw framebuffers at 4K are impractical from on-chip BSRAM:

Format	Resolution	Size	Fits in BSRAM?
Raw 24bpp	3840×2160	~24.9MB	No — needs DDR3
Raw 8bpp indexed	3840×2160	~7.9MB	No — needs DDR3
HAM24 (12bpp)	3840×2160	~9.9MB	No — needs DDR3
Tilemap map data (8×8 tiles)	480×270 entries × 2B	~259KB	Yes ✓
Tilemap pixel data (256 tiles × 8×8 × 8bpp)	—	~16KB	Yes ✓
Total BSRAM available	—	~765KB	—

Tilemap mode fits entirely in BSRAM. HAM24 needs DDR3 but at roughly 40% of raw 24bpp bandwidth.

Tilemap Mode at 4K

With 8×8 tiles the 4K screen is 480×270 tiles — exactly the Ant64 Standard native resolution. The entire tilemap and tile pixel data lives in BSRAM with room to spare.

With 16×16 tiles the map shrinks to 240×135 entries (~64KB), allowing more BSRAM headroom for larger tile sets or deeper colour.

Per-tile properties carried in the map entry:

Field	Bits	Notes
Tile index	12	Up to 4,096 unique tiles
Palette ID	8	Which of the 256 palette descriptors to use
H flip	1	Mirror horizontally
V flip	1	Mirror vertically
Priority	2	Layer sub-ordering
Reserved	8	Future use

Tilemap Scroll System

Each tilemap layer has independent horizontal and vertical scroll, with optional per-tile-row H scroll (Copper-driven) and per-tile-column V scroll (fixed hardware register file), modelled on the Mega Drive's scroll system but with configurable granularity in both axes.

Variable Tile Size

Tile size is a per-layer register — any power-of-two in both axes independently:

Field	Bits	Values	Notes
TILE_W	3	4, 8, 16, 32, 64, 128	H tile size in native pixels
TILE_H	3	4, 8, 16, 32, 64, 128	V tile size in native pixels

Non-square tiles are free — 8×16 for character graphics, 16×8 for wide landscape strips, etc. Power-of-two sizes replace division with right-shift and modulo with bitwise AND in the renderer:

tile_col = pixel_x >> TILE_W_SHIFT
local_x  = pixel_x &  TILE_W_MASK

Larger tiles at the same native resolution mean fewer unique tile designs fit in BSRAM — a 64×64 tile at 8bpp is 4KB versus 64 bytes for an 8×8 tile. The tile size register lets each layer choose the right trade-off independently.

H_scroll — Single Register with HSCROLL_STEP

A single H_SCROLL register per layer applies the same H offset to all tiles on the current tile row. HSCROLL_STEP controls the granularity:

HSCROLL_STEP	Meaning	Copper updates per frame
0	Global — one H scroll value for entire layer	0 (set once)
1	Per tile row — Copper updates every tile row	ceil(native_height / TILE_H) + 1
2	Per 2 tile rows	ceil(native_height / (TILE_H×2)) + 1
4	Per 4 tile rows	ceil(native_height / (TILE_H×4)) + 1

The Copper fires at the output line corresponding to each tile row group boundary:

Copper at output line (group × TILE_H × HSCROLL_STEP × V_scale): H_SCROLL = value

HSCROLL_STEP=0 is purely global scroll with no Copper involvement. HSCROLL_STEP=1 gives full per-tile-row resolution — the classic sine wave water effect. Higher values reduce Copper list length for simpler parallax effects that don't need per-row precision.

V_scroll — Fixed Register File with VSCROLL_STEP

A fixed hardware register file of N entries (suggested initial N=64) holds signed V scroll values. VSCROLL_STEP controls how many tile columns each register covers:

VSCROLL_STEP	Meaning	Registers used
0	Global — one V scroll value for entire layer	1
1	Per tile column	ceil(columns) + 1
2	Per 2 tile columns	ceil(columns/2) + 1
4	Per 4 tile columns	ceil(columns/4) + 1
8	Per 8 tile columns	ceil(columns/8) + 1

VSCROLL_STEP=0 collapses to global V scroll using only V_scroll_file[0] — it replaces the separate global/per-column mode entirely. The feature never disables — it just becomes less granular as VSCROLL_STEP increases.

The programmer chooses VSCROLL_STEP to fit their column count within N registers:

Native width	TILE_W	Columns	VSCROLL_STEP	Registers used	Granularity
480	8	61	1	61	Per tile column ✓
480	4	121	2	61	Per 2 columns
3840	64	61	1	61	Per tile column ✓
3840	32	121	2	61	Per 2 columns
3840	8	481	8	61	Per 8 columns

Even the worst case (3840 native, 8×8 tiles, VSCROLL_STEP=8) is still better than the Mega Drive's fixed per-2-column scheme. A status register flags whether the current configuration overflows N entries, so software can detect and adjust VSCROLL_STEP accordingly.

The V_scroll lookup:

v_index    = (VSCROLL_STEP == 0) ? 0 : (tile_col >> VSCROLL_STEP_SHIFT)
v_index    = min(v_index, N-1)         // clamp — never undefined
scrolled_y = pixel_y + V_scroll_file[v_index]

Clamping to N-1 means overflow columns silently use the last register's value rather than producing undefined behaviour.

Renderer Evaluation Order

For each output pixel:
    tile_col    = pixel_x >> TILE_W_SHIFT          // pre-scroll column
    tile_row    = pixel_y >> TILE_H_SHIFT          // pre-scroll row
    h_off       = H_SCROLL                         // single register, Copper-maintained
    v_idx       = (VSCROLL_STEP==0) ? 0 : min(tile_col >> VSCROLL_STEP_SHIFT, N-1)
    v_off       = V_scroll_file[v_idx]
    scrolled_x  = pixel_x + h_off
    scrolled_y  = pixel_y + v_off
    tile_col_s  = (scrolled_x >> TILE_W_SHIFT) & map_width_mask
    tile_row_s  = (scrolled_y >> TILE_H_SHIFT) & map_height_mask
    tile_index  = tilemap[tile_row_s][tile_col_s]
    pixel_data  = tile_data[tile_index][local_y][local_x]

Scroll Control Registers Summary

Register	Width	Notes
H_SCROLL	16-bit signed	Current H scroll offset — written by Copper
HSCROLL_STEP	3-bit	0=global, 1/2/4/8...=tile row grouping
V_scroll_file[0..N-1]	16-bit signed × N	V scroll offsets — per column group
VSCROLL_STEP	3-bit	0=global, 1/2/4/8...=tile column grouping
VSCROLL_STATUS	read-only	Flags if column count exceeds N at current VSCROLL_STEP

Comparison with Mega Drive

Feature	Mega Drive	Ant64
H scroll	Full / per-tile-row / per-line	HSCROLL_STEP: 0=global, 1=per-row, N=per-N-rows
V scroll	Full / per-2-tile-column (40-entry VSRAM)	VSCROLL_STEP: 0=global, 1=per-col, N=per-N-cols
V scroll granularity	Fixed per-2-columns	Configurable — degrades gracefully, never disables
Tile size	Fixed 8×8	Per-layer, power-of-two, non-square supported
Scroll source	VRAM DMA	Single register (H, Copper-driven) + register file (V)

A modern equivalent of the Amiga's Hold And Modify, designed for continuous-tone photographic content at 4K.

Mode	Bits/pixel	Direct palette	Channel precision	Fringing
OCS HAM6	6	16 entries	4-bit (16 steps)	Very visible
AGA HAM8	8	64 entries	6-bit (64 steps)	Noticeable in some content
Ant64 HAM24	12	1,024 entries	8-bit (256 steps)	Essentially invisible
Raw 24bpp	24	—	8-bit	None

Each 12-bit HAM24 pixel:

Bits	Field	Meaning
[11:10]	Mode	00 = palette lookup, 01 = modify channel A, 10 = modify channel B, 11 = modify channel C
[9:0]	Value	10-bit palette index (in palette mode) or 8-bit channel value + 2 padding bits

At 8-bit channel modify precision, fringing artefacts are 1/256th of the full channel range per pixel — essentially invisible at 4K viewing distances and imperceptible at 1080p. The 1,024-entry direct palette uses the standard palette descriptor system (see Section 16).

HAM24 Colour Space — RGB or YCbCr

A single colour space select bit in the layer register switches between two channel interpretations:

RGB mode (bit = 0) — channels A/B/C map to R/G/B. Natural for synthetic/generated content where the artist thinks in RGB. Simple, no conversion needed.

YCbCr mode (bit = 1) — channels A/B/C map to Y (luminance), Cb (blue chroma), Cr (red chroma). Better for photographic or video-derived content.

YCbCr is superior for photographic HAM content because human vision is far more sensitive to luminance errors than chroma errors — the same principle exploited by JPEG, H.264, and every video codec (which store chroma at half resolution without perceptible quality loss). In RGB HAM, modifying G causes a large perceived luminance jump because G contributes ~59% of perceived brightness. In YCbCr HAM, luma is a dedicated channel so chroma-only fringe pixels (wrong colour, correct brightness) are near-invisible.

Encoding follows BT.709 (the HD/UHD standard):

Y  =  0.2126·R + 0.7152·G + 0.0722·B
Cb = (B - Y) / 1.8556
Cr = (R - Y) / 1.5748

Cb and Cr are signed, stored as offset-binary (128 = zero chroma) matching standard video convention. The HAM decoder resolves the final YCbCr pixel value then passes it through a YCbCr→RGB converter (3 multiply-accumulates, a handful of DSP blocks) before the blender. Palette entries are always stored as RGBA — the YCbCr conversion is applied only to the resolved output pixel, not to the palette data itself.

HAM24 is a per-layer property, switchable by the Copper mid-frame. The top half of the screen could be a HAM24 photograph; the bottom half a tilemap game area — exactly the kind of split-screen mode Amiga coders achieved via Copper, but at 4K.

16. Palette System

Architecture — Flat RAM with Descriptor Table

Palettes are implemented as two independent structures:

Flat palette RAM — a single array of RGBA32 entries. The only place colour data physically lives. Accessible via two address windows:

Base address + 0x00000:  RGBA access — reads/writes raw values directly
Base address + 0x10000:  HSVA access — hardware converts RGB↔HSV on the fly

Palette descriptor table — 256 entries, each containing a base offset into the flat RAM. The pixel lookup is:

final_colour = palette_RAM[palette_descriptor[palette_id].base + pixel_value]

The base offset is one adder in hardware — essentially free.

Flat Palette RAM Sizing

Base field	Addressable entries	RAM size at 32bpp	Status
14 bits	16,384	64KB	Current implementation
16 bits	65,536	256KB	Reserved address space

The register layout reserves address space for 16-bit base offsets. The current implementation uses 14 bits (bits [15:14] of the base field are reserved/zero). No register map changes are needed to expand to full 16-bit in a future revision.

Palette Descriptor Entry (32 bits)

Bits	Field	Notes
[13:0]	Base offset	14-bit index into flat palette RAM (current)
[15:14]	Reserved	Zero for now — expands base to 16-bit in future
[31:16]	Reserved	Future flags — wrap limit, mode bits, etc.

256 descriptors × 4 bytes = 1KB total for the descriptor table. Fits in a tiny BSRAM or distributed RAM.

Variable-Size Palettes

Because each palette is just a base offset, pixel depth determines how many entries are consumed — not any field in the descriptor:

Pixel depth	Max pixel value	Entries used from flat RAM
1bpp	1	2
2bpp	3	4
4bpp	15	16
6bpp	63	64
8bpp	255	256
HAM24 direct	1023	1,024

A 4-colour sprite occupies exactly 4 flat RAM entries. A 256-colour background occupies 256. Multiple palettes can alias — two descriptors with the same or overlapping base offsets share entries, which enables:

Shared transparency — all sprite palettes arranged so pixel value 0 always lands on the same flat RAM entry (the transparent colour)
Gradient windows — a long gradient stored once, multiple descriptors pointing at different 16-entry windows of it
Sprite recolouring — same pixel data BRAM, different palette IDs, instant team colour swaps
Palette animation — Copper writes a new base offset mid-frame, flipping which colour set is active for every object using that palette ID simultaneously

HSV Dual-Access Window

The flat palette RAM presents two address windows. Writes to the HSVA window convert H/S/V/A → R/G/B/A in hardware before storing. Reads from the HSVA window read R/G/B/A and convert to H/S/V/A before returning. The RAM always stores RGBA — HSV is a view, not a storage format.

HSV encoding — all components 0–255 mapped linearly:

H: 0=0°, 255=~359° (full hue circle, 1.41°/step)
S: 0=greyscale, 255=fully saturated
V: 0=black, 255=full brightness

This makes hue rotation arithmetic natural in integer registers — add a fixed value to H across a range of palette entries to shift the entire palette around the colour wheel. The Copper can do this per-scanline for animated colour cycling effects.

Conversion hardware uses DSP blocks for the multiply operations (the GW5AST-138 has 298 DSP blocks — this costs a handful). RGB→HSV read latency is 1–2 extra cycles versus raw RGBA read, which is acceptable since palette reads are not in the display critical path.

Palette Assignment

Object type	Palette ID source	Notes
Sprite	Palette ID field in sprite attribute table	Per-sprite, from descriptor table
Tilemap tile	Palette ID field in tile map entry	Per-tile
HAM24 layer	Palette ID field in layer register	Per-layer, uses up to 1,024 entries
Global/background	Layer register	Solid colour or palette index 0
System sprites	DeMon-owned palette IDs	Protected from user writes

The Copper can write any palette descriptor's base offset or any flat RAM entry on any scanline, making per-scanline palette changes a standard zero-cost operation.

18. Hardware Ray Casting and BSP Acceleration

FireStorm includes dedicated hardware units to accelerate the core inner-loop operations of ray cast and BSP-style 3D rendering — the techniques behind Wolfenstein 3D, Doom, Quake, and their descendants. These workloads are characterised by tight, repetitive fixed-point arithmetic loops that are ideal for FPGA implementation.

18.1 Ray DDA Units

Ray casting shoots one ray per screen column, stepping through a grid cell by cell using a Digital Differential Analyser (DDA) algorithm until a solid cell is hit. Since every column's ray is independent, N parallel DDA units give N× throughput — all columns can be cast simultaneously rather than serially on the EE.

Each DDA unit takes:

Ray origin — (x, y) in fixed-point 16.16 map coordinates
Ray direction — (dx, dy) normalised fixed-point
Map base address — pointer to the cell grid in Graphics SRAM or BSRAM

Each DDA unit outputs:

Hit distance — perpendicular wall distance in fixed-point (for column height calculation)
Hit cell coordinates — (cell_x, cell_y)
Hit face — N / S / E / W wall, for texture selection
Texture column offset — the fractional position along the hit wall face (0.0–1.0 fixed-point), directly indexing the texture column

The EE dispatches a batch of rays — one per screen column — to the DDA unit pool. Each unit processes its ray independently, stepping through the grid at full clock rate, and raises a completion flag or interrupt when done. With 8 parallel DDA units and a typical map depth of 10–20 steps per ray, all 480 columns of a Standard-resolution frame can be cast in well under a frame period.

Graphics SRAM cell map packing: At 4 bits per cell (16 cell types), a 9-wide corridor row fits in one SRAM word. At 8 bits per cell (256 types), 4 cells per word with bits to spare for door states, floor/ceiling type, or lighting zones. The DDA stepping pattern is not purely sequential but benefits from the SRAM's 1-cycle random access latency — each grid step is a new address delivered in one cycle with no page miss.

18.2 Fixed-Point Reciprocal Unit

The fundamental operation converting ray hit distance to screen column height is:

column_height = PROJECTION_PLANE_DISTANCE / hit_distance

This is a divide — or equivalently a reciprocal followed by a multiply. The EE's divide is 16–32 cycles. A dedicated fixed-point reciprocal unit computes 1/x in 2–4 cycles using a Newton-Raphson refinement stage seeded from a lookup table. One reciprocal per screen column × 480 columns = 480 reciprocals per frame. At 2–4 cycles each versus 16–32 on the EE, the column height pass is 4–8× faster.

The same unit accelerates BSP plane tests (see below) and any other workload requiring fast division — perspective-correct texture mapping (divide U and V by W per pixel), fog intensity falloff, and lighting attenuation all use reciprocals.

18.3 Dot Product / Half-Plane Test Unit

BSP traversal requires a half-plane test at every node: which side of a partition plane is the viewpoint on? This reduces to:

side = sign( plane_normal · (viewpoint - plane_point) )

A dot product unit computes this in 2–3 cycles — one cycle per multiply-accumulate for a 2D or 3D dot product, plus the sign extraction. The result is a single bit (front or back) plus the full signed value for soft cases (on-plane tolerance).

The same unit serves:

BSP traversal — which child to visit first
Frustum culling — is a BSP node's bounding box inside the view frustum?
Polygon backface culling — dot product of face normal with view direction
Lighting — dot product of surface normal with light direction for diffuse intensity
Collision response — dot product of velocity with surface normal for reflection

18.4 BSP Traversal Engine

The BSP traversal engine walks a BSP tree autonomously given a viewpoint, outputting leaf sector or subsector IDs in front-to-back (or back-to-front) order via a small hardware FIFO. The EE reads from the FIFO to get the rendering order without implementing the recursion itself.

Operation:

EE writes viewpoint (x, y, z) and BSP tree root address to the engine registers
Engine begins traversal — at each node, uses the dot product unit to determine which side the viewpoint is on, pushes the back subtree onto an internal stack, descends into the front subtree first
Each reached leaf sector is output to the FIFO
Engine raises interrupt when traversal is complete or FIFO reaches a threshold

BSP node packing in Graphics SRAM: A 2D BSP node (Doom-style) requires:

Partition line: x, y, dx, dy (4 × 16-bit = 64 bits)
Right child pointer, left child pointer (2 × 16-bit = 32 bits)
Bounding boxes (optional, 4 × 16-bit per side = 128 bits)

A compact 2D BSP node (Doom-style) without bounding boxes fits in 3 SRAM words with bits to spare for flags, sector type, and lighting data. A 1024-node BSP tree fits in 3KB of Graphics SRAM, leaving the vast majority available for textures and cell maps.

18.5 Combined Rendering Pipeline

Ray casting and BSP acceleration work together for a Doom-style renderer:

Frame render sequence:

1. EE writes viewpoint to BSP traversal engine
2. BSP engine traverses tree → outputs visible sector list to FIFO
   (EE free to do other work during traversal)

3. EE reads sector list from FIFO
4. For each visible sector, EE dispatches column rays to DDA units
   (multiple sectors in flight simultaneously across DDA unit pool)

5. DDA units return hit distances, faces, texture offsets
6. Reciprocal unit converts distances to column heights
7. Blitter draws textured column spans to bitmap layer
   (TEX_GFXSRAM for texture atlas lookup)

8. Scanline mixer composites bitmap layer with hardware sprite layer
   (sprites for items, enemies, pickups — hardware sprites with zero blitter cost)

The hardware accelerators handle the pure arithmetic. The EE handles the scheduling and control flow. The blitter handles the pixel fill. The scanline mixer composites everything. Each unit does its job while the others run in parallel.

18.7 Height-Field Voxel Acceleration (Comanche / Delta Force Style)

Height-field voxel rendering casts rays across a 2D height map — each ray steps forward at ground level, sampling the height value at each grid position. When the sampled height projects higher on screen than the current column's drawn horizon, a vertical span is drawn upward to the new projected height. Rays step front-to-back; the column fills upward as taller features are found.

The inner loop per step:

Step ray forward (DDA — same unit as wall ray caster)
Sample height map at current (x,y) — one Graphics SRAM read
Project height to screen: screen_y = horizon - (height - camera_z) × scale / distance — one reciprocal
Compare projected height against current column top — if higher, draw span
Sample colour/texture at (x,y) for the span — second Graphics SRAM read

Steps 1–4 repeat for every step along the ray, typically 200–500 steps per column at Comanche-era quality. The entire column loop runs in hardware with no EE involvement per step.

Height Map Sampler unit

A dedicated height map sampler takes a 2D (x,y) address in fixed-point and returns the height value and colour value in one pipeline operation. The Graphics SRAM holds the height map and colour map in adjacent regions:

Graphics SRAM height map packing: Multiple height values pack per SRAM word — e.g. 4 height values at 9-bit resolution per word. A 512×512 height map fits in ~64KB; a 512×512 colour map alongside it stays well within 4.5MB. A 1024×1024 map uses DDR3 backing for the colour data with Graphics SRAM holding the active working region.

Span renderer

A vertical span renderer draws the filled column segments generated by the height-field scan, working from the hit (x, screen_y_top, screen_y_bottom, colour) tuples into the bitmap layer. One span write per height step that beats the current horizon — typically far fewer than the DDA steps, since only the first visible feature per height zone draws.

Column state registers: Each column maintains a "current horizon" register — the highest screen Y drawn so far. The height-field DDA unit reads and writes these per column as it steps. A bank of 480 horizon registers (one per Standard-resolution column) lives in BSRAM.

Throughput example: 480-column frame at 300 steps/column

480 columns × 300 steps = 144,000 DDA steps per frame
Each step: 1 SRAM read (height), 1 reciprocal, 1 compare
At 8 DDA units × 200MHz = ~1,600 million steps/second
144,000 steps / 1,600M = ~0.09ms — well under frame budget

The height-field renderer is among the cheapest 3D techniques in terms of hardware cost — the DDA and reciprocal units from wall ray casting handle it almost entirely, with only the height map sampler and span renderer as additions.

18.8 3D Voxel Grid Acceleration (Dense Grid / Minecraft Style)

3D voxel DDA extends the 2D wall ray caster to three axes. Each step through the grid determines which axis face (X, Y, or Z plane) the ray crosses next — this requires three comparisons per step rather than two.

3D DDA unit — extends the 2D DDA unit with a Z axis:

Input: ray origin (x,y,z), ray direction (dx,dy,dz), voxel grid base address
Per step: three t_max values (tMaxX, tMaxY, tMaxZ) — the ray parameter at which the next X, Y, or Z grid boundary is crossed. The smallest wins and the ray advances to that face
Output: hit voxel coordinates (cx,cy,cz), hit face (±X/±Y/±Z), distance

The EE configures the direction and sends it; the 3D DDA unit steps autonomously until a solid voxel is hit, returning the result.

Graphics SRAM grid packing: A 64³ grid at 4 bits per voxel fits in ~18KB; at 8 bits per voxel, ~64KB. A 128³ grid at 8 bits approaches the full 4.5MB — use 4-bit types or back with DDR3 for larger grids.

18.9 Sparse Voxel Octree (SVO) Acceleration

Sparse voxel octrees skip empty space efficiently — the tree subdivides space into octants, and empty subtrees are single null pointers rather than arrays of empty voxels. High-resolution detailed voxel scenes (millions of voxels) that would be impractical as dense grids are tractable as SVOs.

The key operation per node: slab test (ray vs AABB)

At each octree node, the ray is tested against the node's axis-aligned bounding box:

tmin = max( (box_min - ray_origin) / ray_dir )   ← entry distance
tmax = min( (box_max - ray_origin) / ray_dir )   ← exit distance
hit = (tmin < tmax) and (tmax > 0)

Each of the three axis divisions is a subtract and a reciprocal-multiply. A slab test unit performs all three axis tests in parallel, delivering tmin, tmax, and hit in 3–4 cycles.

Octant ordering: Given tmin per axis, the octant entry order is determined by sorting the three axis entry distances — which axis face is crossed first, second, third. This determines the order in which child octants are visited, ensuring front-to-back traversal for early termination.

SVO Traversal Engine

Analogous to the BSP traversal engine, the SVO traversal engine walks the octree autonomously:

EE writes ray origin, direction, and SVO root address
Engine performs slab test at each node using the slab test unit
On hit: if leaf, output voxel hit (address, face, distance) to FIFO; if branch, push far children to stack, descend near child
On miss: pop next entry from stack
Raises interrupt when ray terminates (first solid hit or tree exhausted)

Multiple SVO engines running in parallel cast multiple rays simultaneously — one per screen column for full-frame rendering, or one per shadow ray for lighting.

SVO node packing: Each node contains a child presence mask (8 bits), a child pointer (20 bits), a leaf flag (1 bit), and colour/material data (7 bits) — 36 bits total. One node per Graphics SRAM word, no padding.

An SVO node fits exactly in one Graphics SRAM word — no padding, no wasted bits. A 16K-node tree (sufficient for a detailed scene) occupies 64KB of Graphics SRAM. Larger trees spill to FireStorm DDR3 with Graphics SRAM serving as a node cache for the active ray front.

18.10 Shared Voxel / Ray Cast Resources

Hardware unit	Wall ray cast	Height-field	3D grid	SVO
DDA unit (2D)	✓ primary	✓ primary	—	—
DDA unit (3D)	—	—	✓ primary	—
Reciprocal unit	✓ column height	✓ height projection	✓ column height	✓ slab test
Dot product unit	BSP plane test	—	—	—
Slab test unit	frustum cull	—	—	✓ primary
Height map sampler	—	✓ primary	—	—
Span renderer	—	✓ primary	—	—
BSP traversal engine	✓ sector order	—	—	—
SVO traversal engine	—	—	—	✓ primary
Graphics SRAM	cell map	height/colour map	voxel grid	node cache

All units dispatch work via EE register writes and signal completion via interrupt or FIFO. The EE schedules multiple engines simultaneously — a BSP sector walk can happen in parallel with DDA column casting, and SVO traversal for one scene region can overlap height-field sampling for another. The FPGA's task scheduler and parallel execution model make fine-grained overlap natural.

18.11 Renderer Style Capability

Renderer style	Key operations	FireStorm hardware path
Wolfenstein 3D	2D DDA per column, flat walls	DDA units, reciprocal unit, blitter column fill
Doom	BSP sector order, ray per column seg, textured walls	BSP engine, DDA units, reciprocal unit, blitter textured span fill
Quake (software)	BSP + PVS, affine texture mapping	BSP engine, dot product unit, blitter affine textured spans
Height-field voxel	2D DDA + height map sample + span draw	DDA units, height map sampler, reciprocal unit, span renderer
Dense 3D voxel	3D DDA per ray	3D DDA units, reciprocal unit, blitter face fill
Sparse voxel octree	Slab test per node, front-to-back traversal	SVO traversal engine, slab test unit, reciprocal unit
Isometric tilemap	Scanline mapper, diamond spans, depth buffer	Isometric scanline mapper, diamond span unit, dot product unit, depth buffer, blitter span fill
Isometric sprites	World→screen transform, depth test	Dot product unit, blitter masked sprite, depth buffer
Shadow maps	Depth buffer render + depth compare	Depth comparator, coord transform, dot product unit
SSAO / contact shadows / SSR	Screen-space DDA + depth samples	DDA units (reused), blitter memory unit
BVH + shadow rays	BVH traversal + depth compare	SVO engine (parameterised), slab test unit
Ray-triangle intersection	Möller–Trumbore	Cross product unit, dot product unit, reciprocal unit
Custom voxel / hybrid	Mix of above	Any combination dispatched by EE

The 480×270 Standard resolution is a natural fit for all of these renderers — it matches the internal resolution most period engines actually ran at, pixel-doubled to the output. At Standard resolution the DDA unit pool casts all 480 columns in parallel; the blitter draws textured spans into the bitmap layer; the scanline mixer composites hardware sprites on top at zero additional render cost.

19. Ray Trace Acceleration and Shadow Functions

Full path tracing is not a realistic target for FPGA hardware at this scale. What is realistic — and sufficient to produce convincingly lit scenes — is a layered set of targeted accelerators, each adding shadow and lighting quality at incremental hardware cost. The following are ordered from cheapest to most capable.

19.1 Shadow Maps — Minimal New Hardware

Shadow mapping is a two-pass technique: render the scene from the light source's point of view into a depth buffer, then during the main render compare each pixel's light-space depth against the stored value. The comparison produces a single shadow/lit bit per pixel.

New hardware required: a depth comparator and a coordinate transform (world position → light space). The coordinate transform is a 4×4 matrix multiply — four dot products per pixel, using the existing dot product unit. The depth comparator is a subtract and sign check — trivial.

Depth buffer storage: a Standard-resolution depth buffer at 16-bit precision is 480×270×2 = ~253KB, fitting comfortably in Graphics SRAM. A separate depth buffer is maintained for each active light source. The shadow map render pass is a standard blitter job targeting the depth buffer rather than a colour bitmap.

Percentage-closer filtering (PCF): instead of a single depth comparison, sample several neighbouring texels and average the results — producing soft shadow edges proportional to sample count. The blitter's memory unit performs the multiple samples as sequential reads; the EE accumulates and averages. No additional hardware needed.

Result: proper directional shadows with controllable edge softness, using the blitter's existing render-pass model, at almost no additional silicon cost.

19.2 Screen-Space Techniques — Zero New Hardware

The DDA units designed for ray casting and voxel stepping are directly applicable to screen-space ray marching. Step a ray along the screen-space depth buffer, sample at each step, detect intersection. No new hardware — new uses of existing units.

Screen-space ambient occlusion (SSAO): sample depth values in a hemisphere around each pixel's world-space position. Count how many samples are occluded by nearby geometry — the ratio approximates how much ambient light reaches that point. Conventionally expensive; on FireStorm the depth buffer samples are burst reads from Graphics SRAM and the accumulation runs on the EE.

Contact shadows: very short ray marches (8–16 steps) along the screen-space depth buffer near geometry edges. Produces convincing shadows where surfaces meet — cracks, corners, where objects rest on floors. Extremely cheap, high visual impact. The DDA units step through screen space; the blitter composites the contact shadow mask over the main render.

Screen-space reflections (SSR): march a reflection ray along the depth buffer until it hits geometry, then sample the colour buffer at the hit point. Convincing for flat reflective surfaces (floors, wet ground, metal). Same DDA units, same depth buffer, different ray direction.

All three techniques share the depth buffer generated by the shadow map pass — no extra render cost for the buffer itself.

19.3 BVH Traversal — Minimal New Hardware

A Bounding Volume Hierarchy is structurally identical to a sparse voxel octree for traversal purposes — both walk a tree using slab tests (ray vs AABB) at each node. The SVO traversal engine (Section 18.9) can be parameterised to handle BVH nodes with minimal additional logic — the node format changes, the traversal algorithm does not.

With BVH traversal, shadow rays for polygon geometry become feasible: trace one ray from each lit surface point toward each light source, test it against the BVH, output shadow/lit. This gives hard per-pixel ray-tested shadows for polygon scenes at a fraction of full path tracing cost.

BVH node format: partition plane or AABB bounds, left/right child pointers, leaf flag, primitive index. Fits in Graphics SRAM for scenes with up to ~16K nodes.

19.4 Ray-Triangle Intersection — Small New Hardware

The Möller–Trumbore algorithm: compute two edge vectors, one cross product, two dot products, a reciprocal, bounds checks. The dot products and reciprocal reuse existing units. The one new operation is the cross product: three multiplies and three subtracts — approximately 8–10 DSP blocks.

With a cross product unit added, the hardware can perform full ray-triangle intersection. Combined with BVH traversal (Section 19.3), this gives:

Hard shadows for polygon geometry via shadow rays
Primary ray casting for polygon scenes (not just DDA grid)
Reflection rays against polygon geometry

One intersection test per 4–6 cycles. A pool of parallel intersection units (parameterised at HDL build time) gives proportional throughput scaling.

19.5 Recommended Addition Order

Stage	New hardware	Technique enabled	Visual result
1	Depth comparator + coord transform (~5 DSP)	Shadow maps	Proper directional shadows
2	None	SSAO, contact shadows, SSR	Ambient occlusion, reflections
3	Parameterise SVO engine for BVH	BVH traversal	Hard ray-tested shadows for polygons
4	Cross product unit (~10 DSP)	Ray-triangle intersection	Full polygon shadow rays, reflection rays

Stages 1 and 2 together — shadow maps plus screen-space techniques — produce the combination most people perceive as "ray traced looking" without any ray-triangle intersection hardware. Shadow maps give the directional shadows; SSAO gives the contact darkening and ambient occlusion that makes geometry feel grounded; contact shadows fill in the fine detail at surface intersections. The visual step from this combination to full ray traced shadows is smaller than the visual step from no shadows to this combination.

Stages 3 and 4 are the path toward a proper hybrid rasterise-and-raytrace pipeline — rasterise the primary view, cast shadow and reflection rays against a BVH for secondary lighting. Not a real-time path tracer, but convincingly lit geometry within the constraints of the hardware.

19.6 EE and Blitter Role

The EE schedules all shadow and ray passes as blitter jobs, exactly as it does for the main render:

Frame render with shadows:

Job 1: Shadow map pass    — render scene depth from light POV → depth buffer
Job 2: Main render pass   — render scene colour → world bitmap
Job 3: SSAO pass          — DDA screen-space samples → AO mask
Job 4: Contact shadows    — short DDA marches → contact shadow mask
Job 5: Composite          — world bitmap × AO mask × contact mask → output

→ Jobs 1 and 2 are sequentially dependent (shadow map before main render)
→ Jobs 3 and 4 can run in parallel (both read the same depth buffer, write different masks)
→ Job 5 depends on Jobs 2, 3, 4

The depth comparator and coordinate transform run as part of Job 2 — each pixel's shadow test is a per-pixel operation during the main render, not a separate pass. The blitter performs the comparison as it writes each pixel's colour value.

20. Isometric Rendering System

FireStorm includes dedicated acceleration for isometric tilemaps and sprites — the rendering approach behind Populous, Syndicate, Theme Hospital, and RollerCoaster Tycoon. The system is designed to eliminate the per-frame CPU overhead that constrained those games on period hardware, and to handle depth ordering automatically via a hardware depth buffer rather than requiring software painter's algorithm sorting.

20.1 The Isometric Transform

Every object in the isometric world — tile or sprite — has a world-space position (world_x, world_y, world_z). The screen position is a fixed linear transform:

screen_x = (world_x - world_y) × tile_half_width  + scroll_x
screen_y = (world_x + world_y) × tile_half_height - world_z × height_scale + scroll_y
depth_z  =  world_x + world_y  - world_z

This is handled by the existing dot product unit — three dot products and two additions per object, 2–3 cycles. The same transform positions both tiles and sprites. The FPGA computes screen positions for an entire row of tiles in a handful of cycles.

Scrolling is an increment to scroll_x and scroll_y — no repaint of unchanged tiles, no full-screen redraw. Only tiles entering or leaving the visible region require new data.

20.2 Isometric Scanline Mapper

The isometric scanline mapper is a small state machine that, given a screen Y coordinate, returns the set of (world_x, world_y) tile positions whose diamonds intersect that scanline. This is the inverse of the isometric transform — instead of tile → screen, it computes screen_y → tile list.

The mapper walks the isometric grid scanline by scanline, analogous to the DDA ray caster walking a grid cell by cell. For each output scanline it produces a compact list of (tile_col, tile_row, span_left, span_right) tuples — the tiles visible on that line and the horizontal pixel extent of each diamond.

This replaces the classic tile-by-tile painter's algorithm with a scanline-coherent render that only processes tiles actually contributing pixels to each line. No invisible tiles are touched.

20.3 Diamond Span Renderer

Each tile contributes a horizontal span of pixels to each scanline it covers — the width of the diamond narrows toward the top and bottom of the tile. The diamond span unit computes left_x and right_x for a given tile at a given scanline Y from the tile's screen position and size. This is a small combinational unit — a handful of additions and shifts, no DSP blocks.

The span coordinates feed directly into the blitter as a textured span fill job: fetch the tile's pixel row from the tile texture at the correct V coordinate, write it to the output bitmap between left_x and right_x. The blitter's existing masked textured span primitive handles this exactly.

20.4 Depth Buffer

The depth buffer eliminates painter's algorithm sorting entirely. Each pixel drawn to the isometric bitmap layer carries a Z value — the depth_z from the isometric transform. The blitter tests each incoming pixel against the stored depth; closer pixels write, further pixels are discarded.

Buffer specification:

Parameter	Value
Format	16-bit per pixel
Resolution	Matches output bitmap (e.g. 480×270 Standard)
Storage	Graphics SRAM (~253KB)
Precision	16 bits — more than sufficient for isometric scene depths
Shared with	Shadow map depth buffer (Section 19) — same buffer, different frames or separate regions

Depth buffer flags on blit job descriptor:

Flag	Meaning
`BLT_DEPTH_WRITE`	Write depth_z to depth buffer when writing a pixel
`BLT_DEPTH_TEST`	Discard pixel if depth_z ≥ stored depth at that pixel position
`BLT_DEPTH_CLEAR`	Fill job targeting depth buffer with maximum Z — used at frame start

These flags work on any blitter primitive — tile span fills, sprite blits, shape fills — making the depth buffer available to the full primitive set.

What the depth buffer eliminates:

Unified draw list sorting for opaque objects — no longer needed
Split-sprite calculation for tall buildings occluding sprites — depth buffer arbitrates per-pixel automatically
Stencil mask pass for occlusion — depth test replaces it
Most EE overhead managing draw order

Transparency: semi-transparent objects (windows, foliage, rain, shadows) use BLT_DEPTH_TEST without BLT_DEPTH_WRITE and draw after all opaque geometry. Transparent objects still need painter's order among themselves, but this is a small fraction of scene objects.

20.5 Hardware Sprite Layer — No Depth Buffer

The hardware sprite layer composites sprites at the output pixel clock in the scanline mixer — a real-time pipeline operating against a live display stream. Depth buffer testing would require a Graphics SRAM read per sprite pixel synchronised to the pixel clock, which is architecturally incompatible with the scanline pipeline's timing and latency constraints.

More fundamentally, the hardware sprite layer's purpose is guaranteed-foreground compositing — the cursor, system overlays, DeMon's reserved region, HUD elements. These are precisely the objects that should always appear on top, regardless of scene depth. Depth testing would be counterproductive for this use case.

The division of labour is therefore natural:

Layer	Depth buffer	Best for
Hardware sprite layer	No	Cursor, system overlays, HUD, always-foreground UI
Blitter bitmap layer	Yes	All world objects — tiles, buildings, characters, items

World objects in an isometric game live in the blitter bitmap layer with full depth testing. The one-frame pipeline latency of the blitter is imperceptible at 60fps and is a worthwhile trade for correct per-pixel occlusion of all world geometry.

20.6 Isometric Sprites

Isometric sprites — characters, vehicles, animals, equipment, loose items — are drawn by the blitter as standard masked sprite blits into the isometric bitmap layer, with BLT_DEPTH_TEST | BLT_DEPTH_WRITE. The depth value is world_x + world_y - world_z computed from the object's world position by the EE.

The blitter draws sprites in any convenient order. The depth buffer arbitrates occlusion automatically:

A character behind a building loses the depth test for occluded pixels — correct occlusion without split-sprite calculations
Two characters overlapping are resolved per-pixel — correct for any arbitrary overlap
A character partly behind a tree, partly in front — resolved correctly per pixel with no special case handling

Isometric sprite attributes extend the standard sprite descriptor with world-space position:

Field	Notes
world_x, world_y, world_z	World position — EE computes screen_x, screen_y, depth_z from these
Sprite sheet address	Tile in Graphics SRAM or DDR3
Animation frame	Selects tile within sheet
Direction	4 or 8-way facing — selects row in sprite sheet
Scale	Optional — blitter handles nearest/bilinear
Depth_z override	For objects that should sort differently than their position implies

The EE maintains a world-object list. Each frame it computes screen positions for all visible objects, builds a blitter primitive list, and dispatches it as a single job. The blitter draws all objects; the depth buffer resolves occlusion.

20.7 Tall Objects and Multi-Tile Buildings

A Theme Hospital ward block or RCT roller coaster may occupy multiple tiles and extend several floors upward. On classic hardware this required split-sprite rendering — draw the lower part of the sprite, draw the tiles in front, draw the upper part. With a depth buffer this is handled automatically:

The building is either:

A single sprite with depth_z set to its base tile depth. Floors of the building at higher world_z will correctly occlude objects at lower world_z behind them.
A set of sprites per floor each with their own world_z-derived depth_z. Characters on upper floors composite correctly against the building geometry at each height level.

The depth buffer resolves every pixel correctly regardless of which approach is used.

20.8 Frame Render Sequence

Frame N job queue (dispatched at V-blank start):

Job 1: Depth buffer clear → max Z          (memory/copy — fast fill)
Job 2: Tile pass          → bitmap + depth  (BLT_DEPTH_WRITE)
       Isometric scanline mapper feeds span list
       Blitter draws textured diamond spans with depth_z per tile
Job 3: Sprite pass        → bitmap + depth  (BLT_DEPTH_TEST | BLT_DEPTH_WRITE)
       EE-built primitive list, any order
       Depth buffer arbitrates occlusion automatically
Job 4: Transparent pass   → bitmap only     (BLT_DEPTH_TEST, no BLT_DEPTH_WRITE)
       Sorted transparent objects (foliage, windows, weather effects)
Job 5: Composite          → output layer    (clip table for UI border if needed)

→ Jobs 2 and 3 can overlap on different sub-units if depth buffer regions don't conflict
→ Job 4 depends on Jobs 2 and 3
→ Hardware sprite layer composites cursor and HUD on top at scanline output — zero cost

Jobs 2 and 3 together replace everything that was software in the classic isometric games — the tile loop, the sprite sort, the depth ordering, the occlusion handling. On FireStorm the EE's role is building the sprite primitive list (world position → screen position, one transform per object) and dispatching the jobs. Game logic — pathfinding, AI, economy — runs on the FireStorm EE entirely unaffected.

20.9 Scale Assessment — Theme Hospital / RollerCoaster Tycoon

Parameter	Classic hardware	FireStorm
Visible tiles	Full redraw every frame, CPU cost per tile	Scanline mapper + blitter, EE not involved per tile
Dynamic objects	Software painter sort + blit every frame	Blitter primitive list, depth buffer arbitrates
Occlusion	Split-sprite, stencil, or incorrect	Per-pixel depth test, always correct
Scroll	Full repaint	Scroll register increment, only edge tiles update
CPU role	Rendering	Game logic only
Sprite budget	Limited by CPU blit speed	Several thousand blitter sprites per frame

A RollerCoaster Tycoon-scale scene — 50×50 visible tiles, 500 visible peeps and vehicles — fits comfortably within a single frame's blitter budget at 480×270 Standard resolution. The FireStorm EE spends its time on guest AI, ride physics, and economy rather than pixel pushing.

Feature	Amiga OCS/ECS	Atari ST	Ant64 FireStorm
Playfields / layers	2 (dual playfield)	1	Multiple (configurable)
Hardware sprites	8 per scanline (fixed width)	No hardware sprites	V_scale dependent, hundreds to thousands per scanline
Copper coprocessor	Yes — fixed function	No	Yes — general register write engine
Blitter	Yes — 3-source	No	Yes — hardware blitter with full primitive set
Ray casting / BSP / voxel	No	No	Yes — DDA units (2D/3D), BSP engine, SVO engine, height map sampler, reciprocal, slab test
Shadow / ray trace	No	No	Shadow maps, SSAO, contact shadows, BVH traversal, ray-triangle intersection
Isometric rendering	Software only	Software only	Scanline mapper, diamond span unit, depth buffer, blitter sprites — no CPU render cost
Hardware depth buffer	No	No	Per-pixel depth test/write on all blitter primitives
Scanline effects	Via Copper	No	Via Copper, any register
Mixed H resolutions	Yes (lo/hi/HAM)	No	Yes — per layer independently
Native palette	32 colours (OCS), 64 (ECS)	512 colours, 16 on screen	16,384 entries flat RAM, 256 palette descriptors, dual RGB/HSV access
Tilemap scroll	Per-tile-row H / per-2-col V	No	Per-tile-row H / per-tile-col V / per-line H — full independent axes
HAM mode	HAM6 (4-bit/channel) / HAM8 (6-bit/channel)	No	HAM24 (8-bit/channel, ~invisible fringing) per layer, Copper switchable
Max output res	1280×512 (interlaced)	640×400 (mono)	3840×2160 @ 60Hz (480×270 Standard / 960×540 Hires internally)
Audio integration	Yes (Paula 4-channel)	Yes (YM2149)	Yes — FireStorm DSP, 128+ voices
Programmable logic	No — fixed silicon	No	Yes — full FPGA, reconfigurable per cartridge

The key difference: Amiga and Atari programmers had to work around the fixed capabilities of their custom chips. On the Ant64, the custom chip is the FPGA itself — personality cartridges can reconfigure the entire display pipeline for a completely different architecture if desired.

The Amiga parallel is intentional: Standard (480×270) closely matches Amiga low res pixel size; Hires (960×540) doubles both axes exactly as Amiga hires did. The difference is that Amiga hires needed interlace to hit 512 lines and flickered doing it. The Ant64 does it cleanly at 4K/60.

21. Workstation App Rendering — ImGui Backend

FireStorm acts as the Dear ImGui rendering backend for the Music Workstation App running on the FireStorm EE. The CPU builds the ImGui draw list (a compact buffer of vertices, indices, and draw commands) and DMAs it to FireStorm, which rasterises it into the framebuffer in hardware. The CPU does zero pixel work.

┌─────────────────────────────────────────────────────────────────────┐
│              Application Processor (FireStorm EE, bare metal)    │
│                                                                     │
│  Music Workstation App (C++)                                        │
│  Page A·D·G·K·L·V·W·H·S·R·E·F·M                                     │
│                                                                     │
│  Dear ImGui::NewFrame() → ImGui::Render() → ImDrawData              │
│  (vertex buffer + index buffer + draw commands)                     │
│         │                                                           │
│         │  DMA — draw list  (not pixels — just geometry + colour)   │
└─────────┼───────────────────────────────────────────────────────────┘
          │
┌─────────▼────────────────────────────────────────────────────────────┐
│              FireStorm FPGA — Dual Role                              │
│                                                                      │
│  ┌─────────────────────────────┐  ┌───────────────────────────────┐  │
│  │    AUDIO DSP PIPELINE       │  │    2D RASTERIZER              │  │
│  │                             │  │                               │  │
│  │  128 voices time-multiplexed│  │  Triangle setup               │  │
│  │  VA · FM · Sample engines   │  │  Scanline rasterizer          │  │
│  │  Filters · BBD chorus       │  │  Gouraud colour interp        │  │
│  │  48kHz sample rate          │  │  Font texture sampler (BRAM)  │  │
│  │                             │  │  Framebuffer write (SRAM/DDR3)│  │
│  │  Uses: DSP blocks + BSRAM   │  │  Uses: logic + BRAM + SRAM    │  │
│  │  Orthogonal FPGA resources  │  │  Orthogonal FPGA resources    │  │
│  └─────────────────────────────┘  └──────────┬────────────────────┘  │
│                                              │                       │
│                                    ┌─────────▼──────────┐            │
│                                    │  Display output    │            │
│                                    │  (HDMI / VGA)      │            │
│                                    └────────────────────┘            │
└──────────────────────────────────────────────────────────────────────┘

The two roles are orthogonal — audio DSP uses DSP multiply blocks; the rasteriser uses logic cells and BRAM. They run simultaneously on independent fabric resources; their working data is placed across BSRAM, the SRAM bus, and DDR3 by best fit and arbitrated where it shares a bus.

Rasterizer Performance

The rasteriser runs on completely independent fabric — it does not share fabric resources with the audio DSP and does not consume any of the audio cycle budget.

  Rasterizer clock:         200MHz (same fabric)
  Output pixel clock:       74.25MHz (720p) · 148.5MHz (1080p)
  Fabric cycles per pixel:  200 / 74.25 = 2.7 fabric cycles per pixel output

  Per-frame budget (60fps):  200,000,000 / 60 = 3,333,333 cycles

  Spectrogram render (512 × 256 pixels):
    131,072 pixels × 3 cycles (LUT + write) = 393,216 cycles = 11.8% of frame
    → runs in 1.97ms — well within 16.67ms frame budget

  ImGui UI (typical complex panel, avg triangle 50×50 px):
    Setup: 15 cycles · Fill: 2,500 px × 3 cycles = 7,515 cycles/triangle
    3,333,333 / 7,515 = 443 triangles per frame at 60fps
    → typical ImGui draw list: 200–1000 triangles — very comfortable

  At 250MHz:
    Frame budget: 4,166,667 cycles → 554 triangles/frame

The spectrogram and a full ImGui UI panel render simultaneously within the same frame with substantial headroom.

The ImGui rasteriser pipeline (triangle fetch → setup → bounding box clip → scanline fill → Gouraud colour → font atlas BRAM sample → framebuffer write) is documented in Section 20 of this document.

3D Rendering Architecture — Two-Pass Split with LOD

The FireStorm rasteriser supports 3D world rendering using a two-pass technique that solves the fundamental z-buffer precision problem for large-scale scenes. The approach is well documented in James Lambert's N64 world renderer series — the same constraints apply here: limited z-buffer precision, constrained memory, and the need to render both a vast far scene and a geometrically precise near scene in the same frame.

The Problem with a Single Full-Range Z-Buffer

A perspective-projected z-buffer stores depth non-linearly. The precision distribution is heavily biased toward the near plane:

  Linear world depth:   0.1m ──────────────────────────────→ 1000m
  Z-buffer values:      ████████████████████░░░░░░░░░░░░░░░░░ (16-bit)
                        ▲ near: very precise     far: almost none ▲

  Most precision wasted on the range 0.1–2m.
  Objects at 200m vs 201m may map to the same z-buffer value → z-fighting.
  Objects at 1000m+ are all quantised to maximum depth — no distinction possible.

A full-range z-buffer covering 0.1m to infinity simply cannot resolve distant geometry reliably, regardless of bit depth. The solution is to not use the z-buffer for far geometry at all.

The Two-Pass Solution

The scene is divided at a split distance D into two zones. Each zone uses a completely different visibility technique:

  Camera
    │
    │◄──────── NEAR ZONE ───────────►│◄────────── FAR ZONE ───────────────►│
    │          [0 ... D]             │           [D ... ∞]                 │
    │                                │                                     │
    │  Full z-buffer precision       │  No z-buffer                        │
    │  All 16 bits used for [0..D]   │  Painter's algorithm (back to front)│
    │  Resolves cm-level depth       │  Pre-baked LOD chunks               │
    │  Dynamic geometry welcome      │  Baked meshes, sorted at load time  │
    │                                │                                     │
    │  Rendered SECOND               │  Rendered FIRST                     │

D is chosen per-scene — typically the distance at which individual polygon edges become sub-pixel and LOD simplification is invisible to the player. 200–500 world units is typical for a Mega Drive / N64-style game world.

Pass 1 — Far Zone (Back to Front, No Z-Buffer)

Rendered first, farthest to nearest, painter's algorithm. No z-buffer reads or writes. Each element simply overwrites whatever was drawn before it. Correct because the elements are sorted by depth — a closer sky element will always be drawn on top of a farther one.

  Order of rendering (far pass):
  ┌──────────────────────────────────────────────────────┐
  │  1. Skybox / sky dome                                │  ← always furthest
  │     Single fullscreen quad or cube faces             │
  │     Fills every pixel — clears the framebuffer       │
  │                                                      │
  │  2. Very distant terrain / world chunks              │
  │     Pre-baked to billboard textures at horizon       │
  │     Essentially flat — no z needed                   │
  │                                                      │
  │  3. Far LOD chunks (distance D to D/2)               │
  │     Low-polygon baked meshes, sorted far→near        │
  │     Each chunk drawn as a simple triangle list       │
  │                                                      │
  │  4. Mid LOD chunks (distance D/2 to D)               │
  │     Medium detail, still painter's algorithm         │
  │     Sorted far→near within chunk grid                │
  └──────────────────────────────────────────────────────┘

At the end of pass 1, the framebuffer contains the entire far scene — sky, distant terrain, far geometry — without a single z-buffer comparison having been made. The z-buffer is completely unused and memory is not allocated for it during this pass.

Pass 2 — Near Zone (Z-Buffer Active, Distance 0 to D)

Rendered second, on top of the far pass result. The z-buffer is now active for depth testing and writing. Crucially: the z-buffer only needs to represent depths in [0, D]. All 15 or 16 bits of precision are concentrated in this near range.

  Z-buffer precision in near zone only:

  Signed 15-bit (S15, range −16384 to +16383 in clip space):
  Near plane 0.1m → D (say, 300m):
  Precision: 32768 levels across 300m → ~9mm per z-level at worst case

  For a game where near geometry is furniture, characters, walls:
  9mm precision is completely invisible. No z-fighting anywhere.

The near pass draws all dynamic and high-detail geometry: characters, objects, foreground tiles, particle effects, UI elements in world space.

  Near pass render order:
  ┌──────────────────────────────────────────────────────┐
  │  Near world geometry (distance < D)                  │
  │  → z-test against buffer, write on pass              │
  │                                                      │
  │  Dynamic objects (characters, enemies, projectiles)  │
  │  → z-test, write                                     │
  │                                                      │
  │  Transparent/alpha geometry (sort back-to-front      │
  │  within near zone, no z-write but z-test)            │
  │                                                      │
  │  Particle effects, billboards                        │
  │  → z-test only, no z-write (additive blend)          │
  │                                                      │
  │  HUD / UI (screen-space, drawn last, no z)           │
  └──────────────────────────────────────────────────────┘

Z-Buffer Format and Memory

15-bit signed (S15) is the natural choice — clip-space depth is naturally signed (in front of / behind the near plane), and 15 bits leaves 1 bit for a stencil flag or can be rounded to 16-bit aligned storage with the top bit unused.

16-bit (U16) is the practical storage format — 16-bit aligned reads and writes are native to SRAM, and the top bit is either the sign or a stencil/flag bit.

Memory cost:

  Resolution    Pixels      Z-buffer (16-bit)   Framebuffer RGB12 (36-bit/px)
  640×480       307,200     600 KB              1.1 MB
  1280×720      921,600     1.8 MB              3.3 MB
  1280×720      921,600     1.8 MB packed       → the SRAM bus can hold both at 36-bit

  SRAM bus: 36-bit wide. At 1280×720:
  ├─ Framebuffer: 921,600 × 36 bits = 3.3 MB — fits in a typical 4MB 36-bit SRAM
  └─ Z-buffer:    921,600 × 16 bits = 1.8 MB — separate region or DDR3

Option A — Z-buffer in DDR3: Framebuffer RGB12 in the SRAM bus (fast, deterministic latency). Z-buffer in DDR3 — accessed in a tile-coherent pattern by the rasteriser, so DDR3 latency is amortised. The rasteriser processes tiles of (e.g.) 8×8 pixels; z-reads within a tile are spatially coherent and burst well from DDR3.

Option B — Packed 36-bit word (near pass only):

The 36-bit SRAM word can pack colour and depth together for the near pass:

  Bits [35:21] — 15-bit signed z-value (S15)
  Bit  [20]    — stencil / flag bit
  Bits [19:16] — alpha (4-bit)
  Bits [15:12] — R (4-bit)
  Bits [11:8]  — G (4-bit)
  Bits  [7:4]  — B (4-bit)
  Bits  [3:0]  — spare / extended alpha

This gives RGB4 colour + 15-bit z + stencil in a single 36-bit SRAM word — one read or write per pixel per pass, zero bus overhead. RGB4 is coarser than RGB12, but for depth-tested near geometry where colour precision matters, the colour can be stored in the framebuffer separately and the 36-bit word used purely as a z+stencil+alpha buffer alongside a full RGB12 framebuffer in another region of the same SRAM bank.

Option C — RGB12 framebuffer in SRAM, Z16 in the other half of the SRAM bank:

A 4MB 36-bit SRAM holds:

Framebuffer at RGB12: 640×480 = 1.1MB, or 1280×720 = 3.3MB
Z-buffer at 16-bit: 640×480 = 0.6MB, or 1280×720 = 1.8MB

At 640×480: both fit in 2MB of a 4MB SRAM with room to spare. At 1280×720: 3.3 + 1.8 = 5.1MB — needs either a larger SRAM or DDR3 overflow.

The practical recommendation: 640×480 near z-buffer in SRAM alongside the framebuffer; 1280×720 z-buffer in DDR3 with tile-burst access.

Pre-Baked LOD Chunks (Far Zone Geometry)

The world beyond distance D is divided into a regular chunk grid. Each chunk is pre-baked at multiple LOD levels — the geometry is simplified offline and stored as static triangle lists. At runtime, the game logic (on DeMon) or the FireStorm EE selects the appropriate LOD for each visible chunk and submits it to FireStorm's draw list.

  World chunk grid (top-down view):

  ┌─────┬─────┬─────┬─────┬─────┐
  │ 4,4 │ 4,3 │ 4,2 │ 4,3 │ 4,4 │  LOD 3 (lowest detail, horizon)
  ├─────┼─────┼─────┼─────┼─────┤
  │ 3,3 │ 3,2 │ 3,1 │ 3,2 │ 3,3 │  LOD 2
  ├─────┼─────┼─────┼─────┼─────┤
  │ 2,2 │ 2,1 │ CAM │ 2,1 │ 2,2 │  LOD 1 (highest far detail)
  ├─────┼─────┼─────┼─────┼─────┤
  │ 3,3 │ 3,2 │ 3,1 │ 3,2 │ 3,3 │  LOD 2
  └─────┴─────┴─────┴─────┴─────┘

  Distance ring → LOD level:
  D/4  to D/2:  LOD 0 (near pass — full detail, z-buffer)
  D/2  to D:    LOD 1 (far pass — medium detail, painter's)
  D    to 2D:   LOD 2 (far pass — low detail)
  2D   to ∞:    LOD 3 (far pass — very low detail or billboard)

LOD baking process (offline or load-time):

Full-resolution mesh → quadric error simplification (standard mesh reduction)
Each LOD stored as a compact triangle list in DDR3 (indices + quantised vertices)
Normals and UVs quantised to fit in a compact vertex format
Sky-distant chunks can be baked to a single billboard texture — a flat quad at the horizon that looks correct from all viewing angles at that distance

Chunk streaming: only chunks within the view frustum are submitted to FireStorm. Frustum culling runs on the FireStorm EE in C++, intersecting the chunk grid against the camera frustum planes. Hidden chunks generate no draw calls. The visible list is sorted back-to-front and submitted as a single DMA draw list batch.

The Complete Frame

  Frame N (16.67ms at 60fps):
  ┌───────────────────────────────────────────────────────────────┐
  │  FireStorm EE (bare metal):                                  │
  │  ├─ Frustum cull chunk grid → visible far chunks (sorted)     │
  │  ├─ Sort near objects by state (texture, shader)              │
  │  ├─ Build ImGui draw list (UI overlay)                        │
  │  └─ DMA draw list → FireStorm via QSPI                        │
  │                                                               │
  │  FireStorm rasteriser (200MHz, SRAM/DDR3 framebuffer):        │
  │                                                               │
  │  PASS 1 — FAR ZONE (no z-buffer):                             │
  │  ├─ Draw skybox (fullscreen quad)                             │
  │  ├─ Draw far LOD chunks (sorted back-to-front)                │
  │  └─ Far pass complete — framebuffer contains full far scene   │
  │                                                               │
  │  PASS 2 — NEAR ZONE (z-buffer active):                        │
  │  ├─ Clear z-buffer (near region only)                         │
  │  ├─ Draw near world geometry (z-test + write)                 │
  │  ├─ Draw dynamic objects (z-test + write)                     │
  │  ├─ Draw alpha geometry (z-test, no write, back-to-front)     │
  │  ├─ Draw particles (z-test, additive blend)                   │
  │  └─ Draw UI / HUD (no z, screen-space, last)                  │
  │                                                               │
  │  DISPLAY READOUT:                                             │
  │  └─ HDMI timing generator scans completed framebuffer         │
  └───────────────────────────────────────────────────────────────┘

Audio DSP runs throughout on completely independent fabric. Audio and rasteriser working data is placed across BSRAM, the SRAM bus, and DDR3 by best fit, arbitrated where they share a bus.

FireStorm Vertex Format (3D)

A compact fixed-point vertex format for both passes:

  3D vertex (near pass — full precision):
  ├─ X: S16.8  (24-bit fixed-point clip-space x)
  ├─ Y: S16.8  (24-bit fixed-point clip-space y)
  ├─ Z: S15    (15-bit clip-space depth — maps to z-buffer)
  ├─ U: U12    (12-bit texture coordinate)
  ├─ V: U12    (12-bit texture coordinate)
  ├─ R, G, B:  4-bit each (vertex colour for Gouraud)
  └─ Total: ~13 bytes, padded to 16 bytes per vertex

  LOD chunk vertex (far pass — reduced precision):
  ├─ X, Y, Z:  S8.4 each (12-bit fixed-point — sufficient at far distance)
  ├─ U, V:     U8 each (8-bit UVs — low-res textures at LOD 2+)
  ├─ R, G, B:  4-bit each
  └─ Total: ~8 bytes per vertex — 2× more geometry in same bandwidth

The reduced far-pass vertex format means twice as many far-geometry triangles fit in the same DMA transfer budget — important since the far scene often has more surface area than the near scene even at lower polygon counts.

Reference

The two-pass painter's / z-buffer split technique, LOD chunk baking, and the N64-era reasoning behind these decisions are documented in detail in James Lambert's 3D world renderer series for the Nintendo 64. The constraints are analogous: limited z-buffer precision, fixed-point arithmetic throughout, tile-based memory access patterns, and the need to render a convincing large-scale world at 60fps on constrained silicon.

The Ant64 applies the same technique on more capable hardware — FireStorm has more rasteriser throughput and a faster BSRAM than the N64's RCP, and the 36-bit SRAM architecture avoids the N64's shared framebuffer / z-buffer contention entirely. The technique still applies because the fundamental geometry of the problem — perspective z-buffer precision distribution — is independent of hardware capability.

22. Light Synth — Audio-Reactive Video

The Ant64 integrates audio-reactive video output driven directly by the synthesis engine — a light synth in the tradition of the Atari ST demo scene, early VJ culture, and the Edirol CG-8 Visual Synthesizer. This is a completely empty market in hardware synthesis: no current production synthesizer has integrated video output.

All four modes render into FireStorm's compositor as standard layers, benefiting from the full layer stack — CRT simulation, Copper effects, alpha blending, and all output paths simultaneously.

Mode 1 — Audio-Reactive Visualiser

Waveform and spectrum displays driven by live audio output from FireStorm:

Classic oscilloscope view — audio waveform drawn in real time as a vector trace
FFT spectrum analyser — frequency domain magnitude as a bar or line display
Lissajous (XY) display — L/R audio channels on X/Y axes, generating complex Lissajous figures from stereo synthesis. Especially striking with chorus and FM — the phase relationships produce rotating, evolving geometric forms.
Spectrogram — the Page D STFT display rendered as a scrolling layer

Mode 2 — Synth Parameter Visualiser

Each playing voice rendered as a distinct visual element:

Audio parameter	Visual mapping
Voice pitch	Screen position (low = bottom, high = top) or hue
Amplitude / VCA envelope	Element size or brightness
LFO modulation	Oscillating motion / pulsing
Filter cutoff	Colour temperature (cool = closed, warm = open)
Filter resonance	Shape sharpness / saturation
Note velocity	Impact size at note-on
Polyphony	Up to 128 simultaneous visual elements

Mode 3 — Light Synth / Generative Video

Procedural visual synthesis driven by MIDI and sequencer events:

Each note trigger generates a visual event — flash, shape spawn, particle burst
Pattern and rhythm of the sequencer drives visual rhythm
Synthesis parameters modulate visual parameters in real time
Programmable via AntOS scripting bindings — audio data feeds (voice states, FFT bins, MIDI events, sequencer position) available as inputs to video scripts
The scripting API exposes the same audio data that Page D uses for spectral analysis — video scripts can respond to spectral content, not just note events

Mode 4 — VJ Tool

Pre-rendered or procedurally generated visual clips triggered by MIDI notes
BPM-synced transitions and cuts
Visual patterns stored in DBFS, loaded on demand
MIDI CC → visual parameter control (opacity, colour, zoom, position)
Drivable externally from any MIDI controller or sequencer

Synthesis Analogy

Audio concept	Visual equivalent
Oscillator frequency	Shape oscillation / rotation speed
Filter cutoff	Colour saturation / blur radius
Envelope	Visual event size and decay
LFO	Periodic motion (wave, pulse, spin)
Reverb	Trail / persistence / echo of visual events
Chorus	Shape duplication with spatial offset
Waveform type	Visual pattern morphology
Note velocity	Brightness and impact scale
Polyphony	Simultaneous independent visual voices

AntOS Scripting API

The audio system makes its live data available to video scripts running on the little core. Scripts receive callbacks on audio events and can read live audio state:

-- Register a callback for note events
video.on_note(function(voice, note, velocity)
    spawn_particle(note_to_y(note), velocity / 127)
end)

-- Read live FFT data from the audio engine
local bins = audio.get_fft_bins(512)
for i, magnitude in ipairs(bins) do
    draw_bar(i, magnitude)
end

-- Read voice state
local voices = audio.get_active_voices()
for _, v in ipairs(voices) do
    draw_voice_element(v.pitch, v.amplitude, v.filter_cutoff)
end

Real-time audio processing remains entirely on the FireStorm EE and FireStorm. Scripts running on the DeMon read audio state asynchronously — they cannot affect audio timing or introduce latency into the audio pipeline.