Ant64 Display System — More information...


Memory Architecture

The Ant64 uses a federated memory architecture — BSRAM (on-chip, 380MHz), Graphics SRAM (4.5MB, 36-bit pipeline, shared with audio DSP), EE Code SRAM (4.5MB, 36-bit pipeline, EE exclusive), FPGA flash (read-only system assets), and FireStorm DDR3 (bulk store, blitter and audio DSP). The SG2000 has its own separate 512MB DDR3 and accesses FireStorm memory through the QSPI FRAM bridge, MIPI TX (bulk in), and LVDS (bulk out). Pulse and DeMon each have their own independent FRAM interfaces to FireStorm.

For full details — chip specifications, 36-bit format table, bus isolation rationale, FRAM programming model, inter-subsystem data paths, and the complete memory map — see the Memory Architecture reference.


1. Output Clocking & HBR2 Specification

The primary display output is DisplayPort HBR2, driven from the GoWin GW5AT-138 (FireStorm) hardware transceivers.

Parameter Value
Reference oscillator 135 MHz (shared with Colony Connection)
Lane count 4
Lane rate 5.4 Gbps (135 MHz × 40)
Raw aggregate 21.6 Gbps
Effective after 8b/10b 17.28 Gbps
Max pixel clock at 24bpp ~720 MHz

The 135 MHz oscillator is shared between FireStorm's two transceiver banks. Bank 1 drives DP output; Bank 2 drives Colony Connection (inter-machine network) on the Ant64 and Ant64C — the Ant64S does not include Colony Connection. Both PLLs derive from the same reference, eliminating inter-subsystem clock domain issues.

Important: Standard 4K/60 with full CEA-861 blanking requires ~594 MHz pixel clock (~17.82 Gbps), which slightly exceeds HBR2's 17.28 Gbps effective ceiling. The 4K/60 baseline therefore targets CVT-RB (Reduced Blanking) timing, bringing the pixel clock to ~533 MHz — comfortably within budget.


2. VRR (Adaptive Sync)

VRR is implemented by stretching the vertical blanking interval — no special protocol layer is needed. This is how DisplayPort Adaptive Sync works at the protocol level.

Since the Ant64's internal render target is small (see Section 4), frame completion times are highly predictable. VRR is therefore useful for:

  • Smooth sub-60 Hz output when the application is doing heavy work
  • Precise cadence locking: PAL (50 Hz), NTSC (59.94 Hz), or arbitrary rates
  • The FireStorm side has no minimum — blanking can be stretched indefinitely

Maximum Refresh Rates by Resolution

Resolution Max VRR (approx) Notes
3840×2160 (4K) ~75 Hz CVT-RB blanking
2560×1440 (1440p) ~144 Hz Non-integer scale from native — see below
1920×1080 (1080p) ~240 Hz

3. Output Paths

Three external display outputs plus one internal video bus, each driven from the same FireStorm composite independently:

Output Interface Signal type Notes
Primary DisplayPort (Bank 1) Digital — SERDES HBR2, 4K primary target
Secondary HDMI Digital — FPGA I/O Broad TV/monitor compat
Legacy VGA Analogue — DAC CRT/retro monitor support
Internal LVDS → Sub-LVDS → SG2000 Digital — LVDS Inter-chip video bus

Each output path can have independent simulation modes applied (see Section 8). For example, DP could run aperture grille + bloom, HDMI simple scanlines, and VGA with no effect (analogue signal softness does the work naturally). The LVDS internal path carries a clean unmodified pixel stream — no CRT simulation applied.

LVDS Inter-Chip Video Bus (FireStorm → SG2000)

FireStorm sends its composited pixel stream to the SG2000 via an LVDS → Sub-LVDS link. The SG2000 receives this into its hardware Video Output (VO) compositor, which adds a CPU-generated graphics overlay before driving an embedded panel or feeding the hardware encoder.

Physical layer: GoWin GW5AT-138 LVDS TX → bridge chip → SG2000 Sub-LVDS RX. The bridge handles electrical translation between LVDS voltage swing (~350mV differential) and Sub-LVDS (~150mV differential), and common mode shift (~1.2V → ~0.9V). Bridge part selection is a parking-lot item (TI DS90 series is a candidate).

Resolution: Selects from the same H and V tables as all other outputs. 480×270 (Standard) is the natural default. No CRT simulation applied — clean composite only.

SG2000 VO compositor:

The SG2000 contains a hardware Video Output (VO) module with two independent layers that it composites internally before driving the panel:

  • VHD0 (video layer) — receives the FireStorm LVDS feed as its video input, up to 1080p@60
  • G0 (graphics layer) — a hardware overlay the SG2000 CPU renders into independently

The VO hardware composites G0 over VHD0 in hardware at zero CPU cost once configured, and outputs the combined result via MIPI DSI to the embedded panel. This is a simple fixed compositor — one video layer, one graphics overlay — but it is entirely sufficient for adding OS UI, boot status, touch menus, or system notifications over the FireStorm content on the embedded display only. FireStorm's external outputs (DP, HDMI, VGA) are completely unaffected.

FireStorm composite
    ↓ LVDS
SG2000 Sub-LVDS RX → VHD0 (video layer)
                              ↓
SG2000 CPU renders UI → G0 (graphics overlay)
                              ↓
               VO hardware compositor
                              ↓
                    MIPI DSI → embedded panel
                              │
                              └→ MJPEG encoder → storage / network

Use cases:

  • Embedded panel with UI overlay — VO composites VHD0 (FireStorm content) + G0 (SG2000 UI layer) and drives a small built-in panel via MIPI DSI
  • Hardware MJPEG compression — the SG2000 captures the VHD0 input directly into its hardware MJPEG encoder for recording or network streaming, with no CPU involvement
  • Computer vision / TPU — the SG2000 TPU processes the VHD0 input frame for object detection, gesture recognition, or physics processing (see TPU notes)
  • Screenshot / frame capture — single frame grab from the live display output

VGA — Special Properties

VGA is analogue with a 5-bit-per-channel DAC, giving 32 levels per channel and 32³ = 32,768 possible colours (effectively 15-bit colour at the output stage). The internal palette RAM stores RGBA32 throughout — the 5-bit limitation is purely a property of the VGA output path. Truncation happens at the very last stage:

palette R[7:0] → DAC R[4:0]  (bits [7:3], discard [2:0])
palette G[7:0] → DAC G[4:0]
palette B[7:0] → DAC B[4:0]

The analogue signal path gives VGA unique additional characteristics:

  • Natural bandwidth limiting from cable and monitor input stage provides free horizontal softness, partially simulating limited CRT bandwidth
  • DAC modulation for scanline effects costs zero framebuffer bandwidth — brightness can be modulated at the analogue stage per row by a row-counter output from the FPGA
  • Composite colour bleed simulation via a short FIR on Cb/Cr channels (after RGB→YCbCr) replicates the colour bandwidth limitation of composite video
  • The analogue softness means dithered 5-bit output blends more smoothly than on a digital display — the 32,768-colour ceiling is less visible in practice than on paper
  • The "old" output may paradoxically produce the most authentic retro CRT feel

Colony Connection — Ant64 and Ant64C

Colony Connection is available on the Ant64 (Power) and Ant64C (Creative) models. The Ant64S (Starter) uses a smaller GoWin 60k FPGA with fewer transceiver resources and does not include Colony or DisplayPort.

The Ant64 and Ant64C share essentially the same motherboard with selective connector population. The only functional difference between them is that the Ant64C adds DIN MIDI, optical audio, and Ethernet connectors. Both have DisplayPort, Colony IO ports, and Colony RX ports fitted.

Both Ant64 and Ant64C have:

  • Two full bidirectional Colony IO ports (IO and IO2) — 2 TX + 2 RX lanes each at up to 7.83Gbps per lane
  • Two receive-only Colony RX ports (RX and RX2) from the spare Bank 2 RX lanes — the Ant64C has connectors fitted, the Ant64 has populated pads only

The RX and RX2 ports are inherently receive-only — peripherals can inject data into the Colony network but cannot read from it, making them architecturally secure input-only nodes.

The primary use case for the RX ports is an HDMI frame grabber — a small standalone Colony peripheral with an HDMI receiver IC that captures video from any HDMI source (games console, PC, camera) and streams it into the Colony network. FireStorm receives that stream as a Colony video layer and can process it through the full layer system — Copper effects, chroma keying, blending with generated content — outputting the result on any external display output.

With two RX ports, two independent HDMI frame grabbers can feed the machine simultaneously, giving FireStorm two live video streams as separate layers:

  • Picture-in-picture — two HDMI sources composited on screen simultaneously
  • Side-by-side comparison
  • Chroma key one stream over the other
  • Copper-driven wipe or blend transitions between streams
  • One stream as background, FireStorm-generated sprites and tilemaps over the top, second stream as overlay

Since the frame grabbers connect via Colony, they don't need to be physically attached to the machine — they can sit anywhere in the Colony string and their video streams are forwarded over the network.

See ant64.com/colony for full Colony Connection documentation.


4. Native Resolution System

FireStorm renders internally at a low native resolution which is then pixel-replicated to the output resolution. This eliminates most framebuffer bandwidth pressure and provides the scale budget needed for CRT simulation effects.

Primary Native Resolutions

The Ant64 has two primary named native resolutions, directly echoing the Amiga's low res / high res relationship:

480×270 — Standard

  • Closely matches the Amiga's low res pixel size at its respective output resolution (Amiga low res: 320×256 PAL)
  • Exactly 1/8 of 4K in each axis (480×8 = 3840, 270×8 = 2160)
  • Perfect ×4 integer scale to 1080p
  • The natural home for games, demos, and content that wants the classic retro pixel feel

960×540 — Hires

  • Exactly double 480×270 in both axes — the same relationship as Amiga low res to high res
  • ×4 to 4K, ×2 to 1080p
  • Sharp UI, detailed backgrounds, text rendering, productivity use
  • The natural home for applications where pixel density matters more than the retro aesthetic

These two resolutions are peers, not a hierarchy. Which one is "right" depends entirely on what you're making. A game running at Standard gets the full 8×8 CRT simulation block budget on 4K output. An application running at Hires gets twice the canvas with a still-respectable 4×4 block budget.

All other valid resolutions from the H and V tables are equally available — Standard and Hires are named reference points, not constraints.

The Two Independent Axes

The horizontal and vertical scale factors are completely independent. Any horizontal width from the H table can be paired with any vertical height from the V table to form a valid native resolution.


5. Valid Horizontal Native Widths

Must divide GCD(1920, 3840) = 1920 for integer scaling to both outputs. 3840 is a special case — native 4K width, DP output only (no integer downscale to 1080p).

Width →1080p scale →4K scale Notes
3840 — (DP only) ×1 Native 4K — tilemap/HAM24 DP output only
1920 ×1 ×2 Full HD native
960 ×2 ×4 ← Ant64 Hires
640 ×3 ×6 VGA, Amiga hires, DOS
480 ×4 ×8 ← Ant64 Standard
384 ×5 ×10 Atari ST low res
320 ×6 ×12 SNES, Mega Drive, DOS Mode 13h
240 ×8 ×16 Half-width Ant64, fat pixel mode
192 ×10 ×20
160 ×12 ×24 GBA, Atari Lynx
128 ×15 ×30 ZX81, early micros
120 ×16 ×32
96 ×20 ×40
80 ×24 ×48 BBC Micro MODE 0, text columns
64 ×30 ×60 Commodore PET
60 ×32 ×64
48 ×40 ×80
40 ×48 ×96 ZX Spectrum text cols, BBC MODE 1
32 ×60 ×120
24 ×80 ×160
16 ×120 ×240 Impractical

6. Valid Vertical Native Heights

Must divide GCD(1080, 2160) = 1080 for integer scaling to both outputs. 2160 is a special case — native 4K height, DP output only.

Height →1080p scale →4K scale Notes
2160 — (DP only) ×1 Native 4K — tilemap/HAM24 DP output only
1080 ×1 ×2 Full HD native
540 ×2 ×4 ← Ant64 Hires
360 ×3 ×6
270 ×4 ×8 ← Ant64 Standard
216 ×5 ×10
180 ×6 ×12
135 ×8 ×16
120 ×9 ×18
108 ×10 ×20
90 ×12 ×24
72 ×15 ×30
60 ×18 ×36
54 ×20 ×40
45 ×24 ×48
40 ×27 ×54
36 ×30 ×60
30 ×36 ×72
27 ×40 ×80
24 ×45 ×90
18 ×60 ×120
15 ×72 ×144
12 ×90 ×180
10 ×108 ×216
9 ×120 ×240 Impractical

7. Notable Pixel Aspect Ratio Modes

By mixing different H and V widths/heights, any pixel aspect ratio (PAR) can be constructed.

Fat Pixels — PAR 2:1 (pixel twice as wide as tall)

Classic multicolour modes — C64 multicolour, many arcade games, CGA multicolour.

Notable examples at 4K:

Native W Native H →4K (X×Y) Feel
240 270 ×16 × ×8 Half-width Ant64 — C64 multicolour
160 180 ×24 × ×12 Fat sprite machines
320 270 ×12 × ×8 Wide multicolour

Tall Pixels — PAR 1:2 (pixel twice as tall as wide)

BBC Micro MODE 0, some teletext-style displays, and Tate mode vertical arcade games.

Native W Native H →4K (X×Y) Feel
960 270 ×4 × ×8 Cinematic widescreen, panoramic
480 135 ×8 × ×16 Very tall pixels, distinctive
270 480 ×8 × ×8 (per half-screen) Tate mode — portrait arcade on landscape screen

The 270×480 mode is the natural home for Tate mode. At 4K output, two 270×480 layers with Tate simulation enabled fit side by side in a perfect 50/50 split — 1920 output pixels each. Two independent vertical arcade games on one screen, each with rotated CRT scanlines. See Section 8 (Tate Mode).

4:3 Content on 16:9 Screen — PAR 4:3

Pixel AR exactly compensates for screen AR so geometry is undistorted. A circle in native coordinates appears as a circle on screen.

Check: native 480×360 × PAR 4:3 → (480×4)/(360×3) = 1920/1080 = 16:9 ✓

Native →4K (X×Y) Retro equivalent
480 × 360 ×8 × ×6 Classic VGA-era 4:3
320 × 240 ×12 × ×9 SNES, PlayStation, DOS, CPS2 arcade
240 × 180 ×16 × ×12 Low-res
160 × 120 ×24 × ×18 GBA-ish

Famous Resolutions That Don't Integer Scale to 1080p/4K

These iconic resolutions have widths or heights that do not divide 1920 or 1080, and therefore cannot be integer-scaled cleanly:

Resolution Problem Famous uses
256 wide 1920÷256 = 7.5 NES, ZX Spectrum, Game Boy
512 wide 1920÷512 = 3.75 Amiga lo-res
224 tall 1080÷224 = 4.82 NES, SNES active area
240 tall 1080÷240 = 4.5 NTSC standard, NES, SNES
200 tall 1080÷200 = 5.4 DOS/CGA/EGA, C64
192 tall 1080÷192 = 5.625 ZX Spectrum, Master System

The closest clean Ant64 equivalent to the common NES/SNES feel would be 240 wide × 270 tall (fat pixel, ×16×8 to 4K).


8. CRT Simulation Effects

Applied in the FireStorm output pipeline after pixel replication, before the output encoder. Different effects can be enabled per output path independently — DP, HDMI, and VGA can each run a different simulation mode simultaneously with no interaction between paths.

Scale Budget Per Output

The fidelity of CRT simulation is directly proportional to the scale factor — more output pixels per native pixel means more room to model the phosphor structure. At 8×8 the simulation can be highly authentic; at 4×4 basic effects work well; at 2×2 only crude scanline darkening is possible.

Output Target res X scale Y scale Pixel block Simulation quality
DisplayPort 3840×2160 ×8 ×8 8×8 Highest — full phosphor simulation
HDMI 1920×1080 ×4 ×4 4×4 Good — scanlines + basic mask
HDMI 3840×2160 ×8 ×8 8×8 Highest
VGA 1920×1080 ×4 ×4 4×4 Good — analogue path adds natural softness
VGA 960×540 ×2 ×2 2×2 Basic — scanline only

Primary simulation targets are 4K DP and 4K HDMI. The VGA analogue signal path provides free horizontal softness that partially substitutes for the phosphor mask simulation at lower scale factors.

Pipeline Ordering

The full simulation pipeline per output path, applied after pixel replication:

Native pixel (from framebuffer / tilemap / sprite composite)
    ↓
[1] Bloom pre-pass       — separable blur, models phosphor glow
    ↓
[2] Pixel replicator     — expands native pixel to H_scale × V_scale block
    ↓
[3] Row brightness mask  — scanline simulation (Y axis)
    ↓
[4] Column brightness mask — pixel boundary darkening (X axis)
    ↓
[5] Phosphor mask        — RGB aperture pattern (X+Y combined)
    ↓
[6] Final multiply/blend
    ↓
Output encoder (DP / HDMI / VGA DAC)

Bloom is applied before replication because it operates on native-resolution pixels — blurring the expanded block would give wrong results. All other stages operate on the replicated output pixels.

Scanline Simulation (Y axis) — Stage 3

A brightness multiplier applied per row within each logical pixel block, simulating the dark gaps between CRT electron beam scan passes. At 8× Y scale, 8 output rows represent one native pixel height — the mask profile determines how many are lit and at what brightness.

8× Y scale profiles (one value per output row within the block):

Profile Row brightnesses [0..7] Fill % Reference
TV Thick 100,100,100,100,0,0,0,0 50% C64 / Spectrum on domestic SCART TV
TV Soft 100,100,100,100,100,60,20,60 ~69% BBC Micro on domestic TV
Monitor Sharp 100,100,100,100,100,100,30,0 ~79% Amiga on Philips CM8833
Arcade 100,100,100,100,100,100,100,20 ~91% JAMMA arcade CRT
PVM 100,100,100,100,100,100,80,40 ~95% Sony PVM broadcast monitor
Off 100,100,100,100,100,100,100,100 100% Flat LCD — no simulation

4× Y scale profiles (4 rows per native pixel):

Profile Row brightnesses [0..3] Fill %
TV 100,100,0,0 50%
Monitor 100,100,100,30 ~83%
Arcade 100,100,100,80 ~95%
Off 100,100,100,100 100%

The profile is stored as an 8-entry LUT (one byte per row) in BSRAM — trivially small. The active profile selects which LUT is used. At 4× scale only the first 4 entries are used.

Phosphor Mask Simulation (X+Y) — Stage 5

Real CRT phosphors are shaped apertures arranged in repeating patterns — not square pixels. Three mask types are supported, each representing a different physical display technology.

Shadow Mask — used in most consumer TVs and many monitors. Phosphor triads arranged in a triangular dot pattern. The repeat unit tiles at approximately 3 columns × 2 rows. At 8× X scale, roughly two full RGB triads fit across one native pixel width:

Row 0: R . G . B . R .    (dots at positions 0,2,4,6)
Row 1: . G . B . R . G    (dots offset by 1)
Row 2: B . R . G . B .    (row 0 shifted right by 2)
Row 3: . R . G . B . R    (row 1 shifted)

Each dot position is full brightness for that channel; off-dot positions blend neighbouring colours at reduced brightness (~30–40%) to simulate phosphor bleed. The result is a warm, slightly soft look — the default "TV" aesthetic.

Aperture Grille — Sony Trinitron / Mitsubishi Diamondtron. Vertical phosphor stripes with no horizontal structure except thin damper wires. At 8× X scale with 24bpp content:

Cols 0,1: Red channel full, G+B reduced (~20%)
Cols 2,3: Green channel full, R+B reduced (~20%)
Cols 4,5: Blue channel full, R+G reduced (~20%)
Cols 6,7: Red channel full (repeat)

Damper wires appear as faint (~85% brightness) horizontal bands at approximately 1/3 and 2/3 of the screen height — in practice one band per ~108 output rows at 4K. These are applied as a global Y-position modulation separate from the per-block row mask.

The aperture grille look is sharper and more saturated than shadow mask — the "Trinitron look" strongly associated with high-quality retro computing displays (Amiga, Mac, workstations).

Slot Mask — common in 90s PC monitors. Rectangular apertures in a brick pattern — a hybrid between shadow mask and aperture grille. Slightly more structured than shadow mask, slightly warmer than aperture grille:

Row 0: RR GG BB RR GG BB RR GG    (2-wide slots)
Row 1: RR GG BB RR GG BB RR GG    (same)
Row 2: BB RR GG BB RR GG BB RR    (offset by 2 — the "brick" shift)
Row 3: BB RR GG BB RR GG BB RR

Each mask is stored as a small LUT indexed by (output_col mod pattern_width, output_row mod pattern_height) — typically 6–8 bytes for the repeat unit. The entire phosphor mask LUT for all three types fits in well under 1KB.

Pixel Boundary Darkening (X axis) — Stage 4

Simulates the dark gap between adjacent pixels even on flat-emissive displays, giving a sense of discrete pixel structure reminiscent of PVM monitors. Applied as a column brightness profile within each logical pixel block:

8× X scale profile:

Col 0: 70%   (rising edge)
Col 1: 100%
Col 2: 100%
Col 3: 100%
Col 4: 100%
Col 5: 100%
Col 6: 100%
Col 7: 70%   (falling edge)

The soft edge on cols 0 and 7 rather than a hard cutoff avoids an overly mechanical look. The exact rolloff values are stored in the column mask LUT alongside the row mask.

Bloom / Phosphor Glow — Stage 1

Bright pixels on a real CRT bleed light into neighbouring pixels — the phosphor glows beyond its aperture when driven hard. Implemented as a separable 3-tap horizontal + 3-tap vertical blur applied at native resolution before pixel replication:

Horizontal: [0.05, 0.15, 1.0, 0.15, 0.05]  (centre + 2 neighbours each side)
Vertical:   [0.05, 0.15, 1.0, 0.15, 0.05]

Requires a 1–2 row line buffer of native-resolution pixels in BSRAM (~480 pixels × 32bpp = ~1.9KB at Standard resolution — negligible). The result is a faint coloured halo around bright pixels on dark backgrounds — most visible on white text on black, bright sprites against dark backgrounds, and raster bar effects.

Bloom strength is a configurable register value (0 = off, 255 = maximum glow). At moderate settings (~64–96) it adds authenticity without being visually distracting.

Tate Mode — 90° Rotated Scanline Simulation

Tate mode (from 縦, tate — Japanese for "vertical") is a single bit in the simulation mode register that rotates the entire CRT simulation pipeline 90°. It is designed for vertical arcade games (shoot-em-ups, platformers, Donkey Kong-style) displayed on a horizontal screen.

Normally the scanline brightness profile runs horizontally — dark gaps between rows. In Tate mode the profile is transposed: the dark gaps run vertically instead, simulating how a real CRT would look if the monitor were rotated 90° to display the game in portrait orientation. The phosphor mask patterns are similarly transposed — the column index and row index are swapped in the LUT lookup.

Implementation cost: The LUTs are already indexed by (col_within_block, row_within_block). Tate mode simply swaps those indices — (row_within_block, col_within_block). One XOR gate and a register bit. Essentially free.

The result is a portrait arcade game on a landscape screen that reads authentically as "rotated CRT" rather than "pillarboxed LCD." The horizontal dark column gaps give the same visual weight and texture as scanlines do on a normal horizontal game.

Dual Tate — Two Vertical Games Side by Side

At 4K output, two portrait games fit side by side in a clean 50/50 split using the 270×480 native resolution — tall pixels (PAR 1:2), each half exactly 1920 output pixels wide:

Layer Native PAR X_offset Output width Notes
Game A 270×480 1:2 (tall pixels) 0 1920 Left half — Tate mode
Game B 270×480 1:2 (tall pixels) 1920 1920 Right half — Tate mode

270×480 is valid — 270 divides 1080 (×4 to 1080p, ×8 to 4K) and 480 divides 1080 (×4 to 1080p horizontally used as height). Each game gets its own layer, its own sprite layer, its own palette, its own rotated CRT simulation. The layers simply have different X_offsets — the layer system handles this with no special casing.

A narrow separator layer between them (a 2-pixel wide border or a thin HAM24 marquee strip) can be added as another layer at X_offset=1919, costing nothing.

Both games run simultaneously and completely independently, composited in hardware.

A preset register selects a named combination rather than requiring individual effect configuration. Presets are starting points — individual effects can be overridden:

Preset Scanline Phosphor mask Boundary Bloom Tate Reference
Off Off Off Off Off Off Flat LCD, no simulation
TV TV Thick Shadow Mask Off Low Off C64 / Spectrum on domestic TV
TV Soft TV Soft Shadow Mask Off Medium Off BBC Micro on domestic TV
RGB Monitor Monitor Sharp Aperture Grille On Low Off Amiga on CM8833 / Trinitron
PVM PVM Aperture Grille On Off Off Sony PVM broadcast monitor
Arcade Arcade Slot Mask Off Medium Off JAMMA arcade cabinet
PC Monitor Monitor Sharp Slot Mask On Off Off 90s PC VGA monitor
Tate Arcade Arcade Slot Mask Off Medium On Vertical arcade cabinet — rotated CRT
Tate PVM PVM Aperture Grille On Off On Vertical game on rotated broadcast monitor

Named Monitor Profiles

Beyond the generic presets, specific iconic displays had highly distinctive looks that are worth documenting as named profiles. Each is a precise combination of scanline profile, phosphor mask, boundary darkening, and bloom settings that together reproduce the character of that specific hardware.


Commodore 1084S / 1084SD The standard monitor for the Amiga and C64. A medium-quality shadow mask consumer monitor with a relatively thick beam and warm colour temperature.

Setting Value
Scanline TV Thick (4 lit, 4 dark)
Phosphor Shadow Mask — warm, wide dot pitch
Boundary darkening Off
Bloom Medium (~80)
Character Warm, slightly soft, strong scanline gaps — the definitive C64 look

Philips CM8833 / CM8833-II The Amiga enthusiast's monitor of choice. A higher-quality shadow mask with a sharper beam than the 1084, slightly cooler colour temperature, and notably tighter dot pitch.

Setting Value
Scanline Monitor Sharp (6 lit, 2 dark, hard edge)
Phosphor Shadow Mask — cooler, tighter dot pitch than 1084
Boundary darkening Off
Bloom Low (~32)
Character Crisp, slightly cool, visible but not dominant scanlines — the definitive Amiga demo look

Sony Trinitron (PVM-14L2, KV series, etc.) The Trinitron aperture grille technology was used across Sony's entire range from budget TVs to professional PVMs. The defining characteristic is vertical phosphor stripes — no shadow mask dot structure — giving higher brightness and more saturated colours than any shadow mask monitor. The thin horizontal damper wires (one or two faint lines across the screen) are the giveaway.

Setting Value
Scanline Monitor Sharp to Arcade depending on model
Phosphor Aperture Grille — vertical RGB stripes
Damper wires ~85% brightness at 1/3 and 2/3 screen height
Boundary darkening On — the stripe structure gives natural column separation
Bloom Low-Medium (~48)
Character Sharp, saturated, slightly clinical — the "professional" retro look

Consumer Trinitron TVs (KV series) had thicker scanlines and softer bloom. Professional PVMs had extremely tight, precise beams with minimal bloom — the "PVM look" beloved by retro gaming enthusiasts for its sharpness and accuracy.


Sony PVM / BVM (Professional/Broadcast Video Monitor) The gold standard for retro game video quality. PVMs used Trinitron tubes with extremely precise electronics — the beam was tighter, the colour more accurate, and the scanlines sharper than any consumer display. BVMs (Broadcast Video Monitors) were even more precise.

Setting Value
Scanline PVM (6 lit, 2 dark, then 80%, 40% rolloff)
Phosphor Aperture Grille — tightest stripe pitch of any Trinitron
Damper wires Present but very faint (~92% brightness)
Boundary darkening On — strong, precise column edges
Bloom Off or minimal (~16)
Character Razor sharp, clinically accurate, minimal glow — content looks exactly as the developer intended

The PVM look is simultaneously the most "accurate" and the most distinctive — it makes low-resolution pixel art look structured and intentional rather than fuzzy.


Mitsubishi Diamondtron (NF series) Mitsubishi's answer to Trinitron, also an aperture grille technology. Slightly warmer colour temperature than Sony, marginally wider stripe pitch on some models, and slightly more bloom. Used in Mitsubishi's Diamond Plus and Diamond Pro monitor ranges popular with Amiga and PC users in the 90s.

Setting Value
Scanline Monitor Sharp
Phosphor Aperture Grille — slightly wider pitch than Trinitron
Damper wires Present, slightly more visible than Trinitron (~82% brightness)
Boundary darkening On
Bloom Low-Medium (~56)
Character Similar to Trinitron but marginally warmer and softer — slightly more forgiving on harsh colours

Generic SCART TV (PAL domestic) A typical mid-range European CRT television of the 80s/90s connected via RGB SCART — the primary display for the vast majority of British home computer and console users. Shadow mask, fairly thick beam, warm colour temperature, visible overscan, and strong scanlines.

Setting Value
Scanline TV Thick (4 lit, 4 dark)
Phosphor Shadow Mask — wide dot pitch, warm
Boundary darkening Off
Bloom High (~128)
Overscan simulation On — slight border bleed
Character Warm, fuzzy, nostalgic — the actual look most people experienced their retro games in

This is probably the most emotionally resonant profile for European users — not the sharpest, but the most authentically "what it actually looked like in your living room in 1987."


JAMMA Arcade CRT (Wells Gardner, Electrohome, etc.) Arcade monitors ran their CRTs harder than consumer displays — higher brightness, tighter convergence, and a very characteristic slightly-green tint from the phosphor mix. Shadow or slot mask depending on manufacturer. Scanlines were visible but thin due to the high brightness drive.

Setting Value
Scanline Arcade (7 lit, 1 dark)
Phosphor Slot Mask (most common in JAMMA era)
Boundary darkening Off
Bloom Medium-High (~96) — phosphors driven hard
Colour tint Very slight green bias (+4 green channel)
Character Bright, punchy, slightly raw — the smell of a fish and chip shop optional

Commodore 64 composite (PAL) The C64 connected to a domestic TV via RF or composite — not RGB. This is the lowest-quality signal path, adding colour smearing, luminance-chroma crosstalk, and significant softness. Many classic C64 games were designed with composite artefacts in mind — the colour bleeding was sometimes used deliberately as a feature.

Setting Value
Scanline TV Thick
Phosphor Shadow Mask — wide pitch
Boundary darkening Off
Bloom High (~112)
H chroma blur On — 2–3 pixel chroma bleed (composite artefact simulation)
Character Soft, blurry, colourful — the way most C64 owners actually saw their machine

The composite chroma blur is a separate effect from phosphor bloom — it's a horizontal low-pass on the Cb/Cr channels only (the YCbCr composite simulation mentioned in Section 3), deliberately mimicking the limited chroma bandwidth of PAL composite video.


These profiles are selectable by name via a display.setMode() system call, with individual parameters overridable after selection. DeMon applies the profile at boot based on a stored configuration; the Copper can switch profiles mid-frame for mixed-monitor split-screen effects.

The simulation pipeline is LUT-trivial excluding the bloom pre-pass:

Stage Resource
Row mask LUT 8-entry × 8-bit = 64 bits — a few flip-flops
Column mask LUT 8-entry × 8-bit = 64 bits
Phosphor mask LUT ~64 bytes per pattern — distributed RAM
Multiply/blend ~3 DSP blocks per output path
Bloom line buffer ~1.9KB BSRAM per output path
Total per output path ~2 BSRAM blocks + ~3 DSP blocks + handful of LUTs

Three output paths (DP, HDMI, VGA) running independent simulation modes costs ~6 BSRAM blocks and ~9 DSP blocks — rounding error on a device with 340 BSRAM blocks and 298 DSP blocks.


9. Display Mode Registers (Initial Concept)

Each output path has a small register bank (target: 8–16 bits per output) written by DeMon at boot or dynamically via a display.setMode() system call from the SG2000.

Per-Output Control Register (proposed)

Bits Field Notes
[2:0] Native H width select Index into H resolution table
[5:3] Native V height select Index into V resolution table
[7:6] Scanline profile 0=Off, 1=TV, 2=Monitor, 3=Arcade
[9:8] Phosphor mask 0=Off, 1=Shadow, 2=Aperture Grille, 3=Slot
[10] Pixel boundary darkening Enable/disable
[11] Bloom pre-pass Enable/disable
[13:12] PAR mode 0=Square, 1=2:1 fat, 2=1:2 tall, 3=4:3
[14] Tate mode Rotate simulation 90° for vertical arcade games
[15] Reserved

This allows different simulation configurations on DP, HDMI, and VGA simultaneously without any CPU intervention during the frame — DeMon sets the registers at mode-set time and FireStorm handles the rest autonomously.


10. Scanline Mixer and Layer System

FireStorm composites multiple independent layers in hardware on every scanline. No CPU cost for final composition — the application writes to layer framebuffers/registers and FireStorm handles mixing autonomously.

Per-Layer Independent Resolution — A Key Capability

Every layer in FireStorm has its own native resolution, independently configured. A game background, a sprite layer, a UI overlay, and a text console can all run at different native resolutions simultaneously — and all are composited cleanly to the single output resolution by the blender. No special casing, no CPU involvement, just independent pixel replication per layer before blending.

This is unusual. Most display hardware has a single global resolution — everything on screen shares it, and mixing different resolutions means either scaling in software or accepting borders. FireStorm has no such constraint. Each layer picks the resolution that suits its content:

  • A background tilemap at 320×180 for a classic scrolling game feel
  • Sprites at 480×270 Standard resolution
  • A UI overlay at 960×540 Hires for sharp text and controls
  • A border/static layer at 240×270 for fat-pixel decorative elements
  • A Colony video layer at whatever resolution the incoming stream carries

All of these composite at the output pixel clock simultaneously. The blender sees every layer arriving at the same rate regardless of its native resolution — the pixel replicator for each layer handles the difference transparently.

The Amiga could mix lo-res, hi-res, and HAM bitplanes. The Ant64 can mix any valid combination from the resolution tables, per layer, with fully independent H and V scale factors. It is a direct evolution of that philosophy with no arbitrary constraints.

Layer Types (initial concept)

Layer Type Notes
Framebuffer 0/1 Dynamic Primary application render target — SG2000 big core
Tilemap 0/1 Dynamic Classic scrolling game backgrounds, per-tile palette
Sprite layer Dynamic Hardware sprites — user and system partitions, see Section 14
Text/console Static/dynamic Always-available terminal, character cell with attribute byte
ImGui / OS overlay Dynamic GUI overlaid over application content
Cursor Dynamic Mouse pointer, always highest priority
Border/static Static Fixed graphics, colour fills, overscan regions
Colony video (Ant64/Ant64C) External Video stream received via Colony network from HDMI frame grabber or other peripheral
SG2000 MIPI layer Internal CPU-generated content from SG2000 MIPI TX → FireStorm MIPI RX hardcell

Layer Properties (per layer)

  • Enable/disable
  • Priority (Z-order)
  • Colour key / transparency
  • Palette select
  • H_scale — output clocks per native pixel (horizontal pixel replication)
  • V_scale — output lines per native row (vertical line replication)
  • H_scroll — native pixel offset (horizontal scroll)
  • V_scroll — native row offset (vertical scroll)
  • X_offset — output pixel position of the layer's left edge
  • Y_offset — output line position of the layer's top edge
  • Width — active width of the layer in output pixels (clips the right edge)
  • Height — active height of the layer in output lines (clips the bottom edge)

H_scale and V_scale are fully independent, allowing any pixel aspect ratio. All parameters are writable by the Copper for per-scanline effects.

X_offset, Y_offset, Width, and Height together define a positioned bounding rectangle for each layer in output pixel coordinates. The blender only samples a layer within its active rectangle — outside it the layer contributes nothing and the next lower priority layer shows through.

Pixel Replication Architecture

Each layer has its own pixel replicator running at the output pixel clock. A register holds the current native pixel value for H_scale output clocks before advancing; a line counter holds the current native row for V_scale output lines before fetching the next row. The blender always sees all layers arriving at the output pixel clock rate — uniform, no rate-matching logic required.

This means:

  • The blender has one implementation at one clock rate regardless of per-layer native resolution
  • Adding a new layer is instantiating the same parameterised module again
  • Copper writes take effect at precise output pixel boundaries with no rate-translation bookkeeping
  • V_scale line repetition re-reads the same BRAM row on each repeated output line — at native widths ≤480 pixels, BRAM bandwidth is nowhere near a constraint
  • Layers outside their X_offset/Y_offset bounding rectangle output a transparent pixel — the replicator simply doesn't run for those positions

Layer Positioning and Overlap

Each layer is a fully positioned rectangle on the output. Layers at different Y positions simply don't contribute pixels outside their active height. Layers at overlapping positions composite normally — priority, colour key, and per-palette alpha resolve which pixels win, exactly as if the layers filled the whole screen.

This is a direct evolution of how the Amiga used the Copper to drag screens of different resolutions up and down the display. On the Amiga those regions were vertically exclusive — a lo-res region and a hi-res region could not overlap, they could only be stacked with a hard boundary between them. The Copper moved the boundary, but never allowed content from two regions to occupy the same output line.

On the Ant64 there is no such constraint. Any layer can be placed anywhere, at any native resolution, and overlap any other layer freely. The blender handles the compositing per pixel regardless of whether the layers' bounding rectangles intersect. Examples:

  • Game HUD over tilemap — a 960-wide Hires score panel at Y_offset=0 over a 320-wide Standard tilemap filling the full screen. The HUD is sharp pixel-perfect text; the game is high-colour lo-res with full scroll. Neither compromises the other. The HUD only updates when the score changes — it isn't part of the game framebuffer, shares no bitplane bandwidth, costs nothing at runtime
  • A 320×180 lo-res background layer behind a 480×270 sprite layer, with a small 960×96 Hires status bar overlapping both at the top
  • Two HAM24 photograph layers at different Y offsets with a soft alpha blend between them where they overlap — a hardware crossfade without touching either layer's content
  • A Hires popup dialog floating over a Standard game scene mid-play, composited entirely in hardware

The Copper can write X_offset or Y_offset on every output line, allowing layers to move smoothly without any CPU frame rendering involved. This is the direct descendant of the Amiga's screen-drag mechanic, generalised to every layer independently and extended to allow full overlap.

Layer Clip Table

Each layer has a CLIP_TABLE_PTR register pointing to a memory block that defines per-scanline left and right clip boundaries. This replaces the simple Width/Height rectangle with an arbitrarily shaped clip region — any contour expressible as horizontal left/right pairs.

Clip table entry format:

struct clip_entry {
    uint16_t left_x;    // left visible boundary in output pixels
    uint16_t right_x;   // right visible boundary in output pixels
    uint16_t height;    // number of output scanlines this entry covers
};

6 bytes per entry. The display engine walks the table during output — consuming one entry per height scanlines, advancing to the next automatically. When the table is exhausted the layer reverts to fully transparent.

Table sizes:

Mode Entries Table size Use case
No clip 1 6 bytes left=0, right=output_width, height=output_height
Rectangle 1 6 bytes Fixed rectangular clip
Diagonal split N (one per scanline) N×6 bytes Straight angled split
Curved boundary N N×6 bytes Any smooth curve
Per-line full precision 1080 ~6.5KB Full 1080p per-scanline
Per-line 4K 2160 ~13KB Full 4K per-scanline

All table sizes fit comfortably in BSRAM.

Key properties:

  • CLIP_TABLE_PTR is a Copper target — the Copper can swap the pointer mid-frame to change clip regions between screen areas
  • The FireStorm EE builds the clip table each frame for dynamic clips (player-tracking split screen, animated portals, etc.)
  • The table is consumed left-to-right in scanline order — the EE writes it from top to bottom, the display engine reads it the same way
  • A single-entry table covering the full screen is equivalent to no clip — zero overhead

Diagonal split-screen example:

For a straight diagonal split at a given angle, each entry has height=1 and the left/right values advance by a fixed delta per line. The EE computes this in a trivial loop each frame — a multiply and add per line. The Copper swaps CLIP_TABLE_PTR at the frame boundary. Two layers with complementary tables (one's right boundary = the other's left boundary) produce a seamless join with no gap and no overlap.

Dynamic split tracking (Lego-style):

The game logic computes the split geometry each frame — midpoint between players, rotation perpendicular to the player vector — and the EE writes a new clip table. When players are close, a single full-screen entry replaces the per-scanline table and the split disappears in one frame. No transition artefact because both layers contain valid full-screen content at all times.

Bitmap Layers and the Blitter

In addition to the hardware tilemap and sprite layers, FireStorm supports bitmap layers — intermediate framebuffers (8bpp, 16bpp, or 32bpp) that the FireStorm Blitter renders into. The scanline mixer composites them alongside the hardware layers exactly as it would any other layer.

The blitter processes an EE-defined job queue with no fixed pipeline depth — one job for simple cases, many jobs for complex multi-pass rendering. The display engine is decoupled: it reads from the front buffer at the output pixel clock while the blitter writes to the back buffer. Double and triple buffering are supported. If a layer's inputs haven't changed since the last frame, the EE skips its jobs and the front buffer retains the previous frame's content — a static HUD costs nothing until it needs updating.

Texture source hierarchy: Any blitter primitive that samples pixel data — sprites, textured triangles, tilemaps, pattern fills — uses a unified texture source system. Texture data can come from:

  • Permanent BSRAM — frequently used assets at fixed BSRAM addresses, zero cache overhead, 380MHz
  • BSRAM texture cache (~128–256KB) — hot working set backed by Graphics SRAM. Hits at full BSRAM bandwidth
  • Graphics SRAM (Ant64/Ant64C, 4.5MB) — fast pipeline SRAM with no page-miss penalty, ideal for the non-sequential UV sampling that DDR3 handles poorly. Also enables high-colour intermediate buffers (R12G12B12) for precision bloom and compositing before conversion to output format. A typical scene's textures and intermediate buffers fit within 4.5MB with no DDR3 involvement
  • DDR3 — full art library, potentially megabytes. Accessed via cache for normal use; direct bypass for streaming

Textured triangles use the same system — UV coordinates are interpolated across the triangle and the sampler reads from the texture source hierarchy per pixel. Nearest-neighbour, bilinear, and perspective-correct sampling are all supported. This gives the blitter basic textured 3D rendering capability entirely within the FPGA.

Software sprite throughput: A primitive list of 500 × 16×16 4bpp masked sprites from BSRAM cache takes approximately 42 microseconds at 380MHz. Several thousand blitter sprites per frame are readily achievable, in addition to the hardware sprite layer's own budget of hundreds per scanline.

Vector CRT simulation: The blitter supports bloom lines and bloom particles — antialiased lines and point sprites with phosphor glow falloff and additive blending, designed specifically for colour vector game aesthetics. Tempest. Asteroids. Star Wars. Each primitive has its own intensity and falloff. Line crossings and particle clusters get brighter. The bitmap saturates at hot spots. It looks like a vector CRT because the maths is the same maths. Named presets for the classic arcade titles are built in. A full Tempest-style frame — web wireframe, enemy lines, shot particles, explosion particles — renders in well under 500 microseconds.

Clip table at composite stage: When the blitter composites an intermediate bitmap to the output layer, the layer's clip table is applied — pixels outside the left/right boundaries per scanline are not written. This is the split-screen and portal mechanism.

For full blitter documentation — pipeline model, buffering, dirty tracking, texture source system, bloom lines, all primitive types — see Click here...

Since H_scale, V_scale, X_offset, Y_offset, Width, and Height are all per-layer, layers at different native resolutions can be freely positioned and composited. Example at 1080p output:

Layer Native W H_scale Native H V_scale Position Feel
Background tilemap 320 ×6 180 ×6 Full screen Classic game background
Sprite layer 480 ×4 270 ×4 Full screen Default Ant64 resolution
UI overlay 960 ×2 160 ×2 Y=860, H=220 Sharp Hires status bar at bottom
Popup panel 960 ×2 270 ×2 X=600, Y=200 Hires overlay, partial screen width
Border 240 ×8 270 ×4 Full screen Fat pixel decorative border

All arrive at the blender at the output pixel clock. Outside each layer's active rectangle, the layer is transparent. The Amiga could mix lo/hi/HAM in vertical strips with no overlap. The Ant64 can mix any resolution at any position with full overlap — the same idea with no remaining constraints.


11. Copper — Per-Scanline Register Control

The Copper executes a command list in sync with the display beam, allowing any FireStorm register to change at any horizontal or vertical position. This is the mechanism behind virtually all classic demo effects.

Key capabilities:

  • Raster bars — change background colour register every N lines
  • Split screen — switch layer enable/mode mid-frame
  • Horizontal scroll per line — sine wave over a tilemap (classic water effect)
  • Palette cycling — change colour registers on specific scanlines
  • Mid-frame resolution switch — different H or V native width above/below a line
  • Layer priority reorder — mid-frame compositor configuration change
  • VRR blanking control — stretch or compress vertical blanking from the Copper list
  • Colony video mix ratio — blend Colony-sourced video content against generated content, per scanline (Ant64 and Ant64C)

The Copper concept is direct spiritual descendant of the Amiga's Copper coprocessor, but implemented as a general register-write engine rather than a fixed-function chip. Any FireStorm register is a valid Copper target.


12. Named Display Modes (initial set)

Pulling together the most useful configurations:

Mode name Native res PAR Scale to 4K Primary character
Standard 480×270 1:1 ×8×8 Amiga low res equivalent — retro pixel feel
Hires 960×540 1:1 ×4×4 Amiga hires equivalent — double detail
Desktop 1920×1080 1:1 ×2×2 Full 1080p native passthrough
Multicolour 240×270 2:1 ×16×8 C64 multicolour / fat pixel sprites
Classic 4:3 480×360 4:3 ×8×6 Undistorted 4:3 geometry on 16:9
Retro Platform 320×240 4:3 ×12×9 SNES/PlayStation/DOS authentic feel
Panorama 960×270 1:2 ×4×8 Tall pixel cinematic widescreen
Retro Low 160×180 2:1 ×24×12 Maximum fat-pixel simulation space
Universal 640×360 1:1 ×6×6 Integer scales to 4K, 1080p, and 1440p

14. Sprite System

Sprite Attribute Table

A fixed-size table of sprite attribute slots, each holding:

Attribute Notes
X position Determines line buffer write address
Y position Used during sort/pick to determine if sprite intersects next row
Width / height Pixel fetch range
BRAM data pointer Address of sprite pixel data
Palette select Applied during pixel fetch
Priority Blender Z-order
H flip / V flip Applied during pixel fetch
Enable flag Skips disabled sprites during sort

The true hardware limit is a compile-time constant baked into the HDL — the absolute ceiling no register can exceed.

Table Partitioning

The attribute table is split from both ends:

Slot 0                                              Slot N (true limit)
│◄──── user sprites (0 → USER_SPRITE_LIMIT) ───────►│◄─── system ────►│
                                                    ▲
                                               SYS_SPRITE_BASE

USER_SPRITE_LIMIT — maximum slots the user application can populate, counting from slot 0 upward. The sprite engine stops scanning user sprites at this index. Writable by the SG2000 big core or FireStorm EE via live register; takes effect at next snapshot.

SYS_SPRITE_BASE — slot index where system sprites begin, counting from the top of the table. Writable by DeMon only — the SG2000 has no write path to this register (not a software permission check; the bus connection physically does not exist). System sprites always fetch regardless of USER_SPRITE_LIMIT.

If USER_SPRITE_LIMIT is set at or above SYS_SPRITE_BASE, the hardware clamps it silently. A status register exposes the effective limit back to software.

DeMon can adjust SYS_SPRITE_BASE dynamically — reserving more system slots during a debug session, releasing them afterward — without any SG2000 involvement.

System Sprite Use Cases

System sprite Notes
Mouse cursor Always topmost priority, updated by DeMon from Sticky input
Debug overlay marker Highlights screen regions during development
AntOS notification icon OS-level status independent of application
Personality cartridge indicator Hardware-level status, always visible

System sprites use a separate palette owned by DeMon, ensuring the cursor always renders correctly regardless of what the user application has done to colour registers.

Shadow Registers

All sprite attributes have a parallel shadow copy. At the start of each native scanline's V_scale window, the hardware atomically snapshots all live registers into shadow:

shadow[] ← live[]   (atomic, at native scanline boundary)

The sprite engine reads only from shadow registers for the entire V_scale window. The SG2000 big core, FireStorm EE, and the Copper may freely write live registers at any time — writes are queued implicitly and take effect at the next snapshot. This eliminates all mid-scanline corruption races with no software involvement.

The Copper writes sprite X positions to live registers on every native scanline without any special timing concern — the snapshot boundary is entirely the hardware's responsibility, transparent to software.

Prefetch Pipeline

The sprite engine works one native row ahead — during the V_scale window for native row N, it prepares sprites for native row N+1. This gives the full V_scale window for sort and fetch:

Native row N begins (output line 0 of V_scale window):
    shadow[] ← live[]              ← snapshot for row N+1
    Buffer A → blender             ← displaying row N (prepared last window)
    Sort: scan shadow[] for sprites active on row N+1
    Fetch: load sprite pixels into Buffer B for row N+1

Native row N+1 begins:
    shadow[] ← live[]              ← snapshot for row N+2
    Buffer B → blender             ← displaying row N+1
    Sort + fetch into Buffer A for row N+2
    (buffers ping-pong each native row)

The blender and sprite engine never touch the same buffer. The pipeline is primed during vertical blanking — row 0 sprites are prepared during the last V-blank lines so the first display row has a valid buffer ready with no special cases.

Sprite Budget

The binding constraint is not output line count but pixel fetch throughput — BRAM read bandwidth and line buffer write bandwidth.

GW5AT-138 BSRAM key specs:

  • 340 BSRAM blocks, each up to 36Kbits (6,120Kbits / ~765KB total)
  • Clock frequency up to 380 MHz
  • Dual Port mode with independent clocks and up to 72-bit data width
  • Semi Dual Port mode: dedicated write port (A) and read port (B), independent clocks

At 380 MHz and 72-bit port width, a single BSRAM can deliver 72 bits per cycle = ~3.4GB/s read bandwidth. Multiple BSRAMs running in parallel multiply this further.

For sprite pixel fetching, a practical design would use a dedicated BSRAM for the sprite sheet data, running its read port at the BSRAM clock (up to 380 MHz) independently of the output pixel clock. At 380 MHz with a 32-bit read port (one 8bpp sprite row of 4 pixels per cycle), the fetch engine can pull ~1,520 million pixels per second — far more than any scanline can consume.

The practical limit then becomes the line buffer write bandwidth and the sort logic throughput rather than raw BRAM speed. With a sensible architecture:

  • Sort scan (check Y bounds of all attribute slots): at 380MHz, 64 sprites × ~3 cycles each ≈ ~200 cycles — under 1 output line period at any resolution
  • Pixel fetch: per-sprite BRAM read + line buffer write, perhaps 8–16 cycles per sprite row at 380MHz

This gives a realistic per-V_scale slot budget of hundreds of sprites, not dozens. The full V_scale window is available since the pipeline runs one row ahead:

V_scale Total fetch slots Conservative budget Notes
×2 2 200–400 Sort + fetch at 380MHz
×4 4 400–800
×8 8 800–1,600
×16 16 1,600–3,200

The budget is more naturally expressed as pixels fetched per native scanline — a wide-sprite scene uses the same bandwidth as a narrow-sprite scene with more objects. USER_SPRITE_LIMIT acts as a ceiling so software can control worst-case fetch time; the engine fits as many sprites as it can up to that limit within each V_scale window.


15. High-Resolution Native Modes and Colour Encoding

Multi-Resolution Output — Per-Path Native Resolution

The three output paths generate independently. A 4K native layer (tilemap or HAM24) renders once and is simultaneously mixed down to the lower-resolution outputs in the same scanline pass — no second render, no framebuffer store:

Output Native res Mode Delivery
DisplayPort 3840×2160 Tilemap or HAM24 Full 4K, direct to DP encoder
HDMI 1920×1080 2:1 mixdown from 4K render Filtered, clock-domain bridged
VGA 960×540 or 640×480 N:1 mixdown from 4K render Filtered, 5-bit DAC, optional dither

FireStorm generates each output from the same layer data. A game running genuine 4K tilemap on DisplayPort simultaneously outputs clean 1080p on HDMI for a capture card with no extra CPU involvement.

4K Mixdown Pipeline

Since the 4K render produces pixels scanline-sequentially, the mixdown sits inline between the 4K pixel stream and the lower-resolution encoders. The blender composites all layers first, then the composite is mixed down — simpler than mixing individual layers separately and gives correct blending results:

4K render + composite blender
    │
    ├──────────────────────────────→ DP encoder (full 4K, no processing)
    │
    ├── [H 2:1 filter] → [line pair accumulator] → [CDC FIFO] → HDMI encoder
    │
    └── [H N:1 filter] → [line pair accumulator] → [CDC FIFO] → [dither] → [5-bit truncation] → VGA DAC

Horizontal filter — averages pairs (or N-tuples) of adjacent pixels across the line. At 2:1 for HDMI this is a 2-tap box filter (one add + shift per pixel pair), essentially free in logic. Higher quality options use DSP blocks.

Line pair accumulator — holds one line of output-resolution pixels in BSRAM (~1920 × 32bpp = ~7.5KB for HDMI) and averages it with the next native line before outputting. This provides the vertical 2:1 reduction.

CDC FIFO — a small clock-domain crossing FIFO bridges the 4K pixel clock (~533MHz) to each output's own pixel clock (HDMI ~148.5MHz, VGA ~40MHz). Only a few pixels deep — absorbing the clock ratio difference, not buffering a full line. Standard synchroniser practice.

Mixdown Filter Quality

The filter is selectable per output via a register field:

Filter Cost Quality Best for
Box 2×2 Trivial — 2 adds per pixel Adequate, slight softness VGA (analogue path does the rest)
Bilinear Small multiplier per pixel Good, smooth edges HDMI general use
Lanczos 4-tap A few DSP blocks Excellent, sharp HDMI high-quality / photographic HAM24

The GW5AT-138 has 298 DSP blocks — a Lanczos filter is easily affordable. The Copper can change the filter register mid-frame if different screen regions benefit from different filter approaches.

VGA DAC — 5-Bit Precision, Spatial Dithering, and Temporal Half-Bit

The VGA path applies 8→5 bit truncation after the mixdown filter, right at the DAC:

Mixdown output (8bpp per channel, full precision)
    ↓
[Spatial ordered dither — Bayer matrix indexed by pixel position]
    ↓
[Temporal half-bit — alternation on line parity × frame parity]
    ↓
5-bit truncation (keep bits [7:3], discard [2:0])
    ↓
VGA DAC (5bpp per channel → analogue signal)

Truncating after the mixdown filter is critical — truncating before would introduce quantisation errors that accumulate through the averaging arithmetic.

Spatial ordered dithering recovers sub-half-step precision at negligible cost. A Bayer matrix (4×4 or 8×8) adds a threshold to the low 3 bits before truncation, spreading quantisation error spatially so smooth gradients appear continuous rather than banded. The matrix is 64 entries × 3 bits = 192 bits — fits in a handful of flip-flops.

Temporal half-bit dithering recovers the half-step between adjacent 5-bit DAC values by alternating between floor and ceiling values on a line/frame pattern. The eye integrates brightness over time and perceives the average — effectively gaining one extra bit of DAC resolution. This technique was used on the Amiga to achieve effective 5-bit output from a 4-bit DAC; the Ant64 applies the same principle to gain effective 6-bit from its 5-bit DAC.

Bit [2] of the 8-bit internal value (the most significant discarded bit) acts as the half-step indicator. The alternation pattern is a 2×2 temporal-spatial checkerboard combining line parity and frame parity:

              Even frame    Odd frame
Even line:    base          base+1
Odd line:     base+1        base

Over two frames × two lines, each pixel sees base twice and base+1 twice — averaging to exactly base+0.5. The checkerboard pattern minimises visible flicker compared to a purely frame-alternating scheme.

The DAC output logic:

half_step = internal_value[2]               // half-step indicator
toggle    = frame_count[0] ^ line_count[0]  // 2×2 alternation pattern
dac_value = internal_value[7:3] + (half_step & toggle)

One XOR, one AND, one conditional increment — approximately 5 LUTs per channel. Essentially free.

This works reliably on CRT monitors: phosphor persistence naturally blends the line alternation, and 50/60Hz frame rate is above the flicker fusion threshold so frame alternation is invisible. On LCD panels connected via VGA the result is display-dependent — panels with temporal noise reduction will average correctly; panels that treat each frame independently may show shimmer. Hence "try, it may work" for non-CRT targets.

The spatial and temporal tricks are complementary — spatial dither handles sub-half-step precision, temporal handles the half-step itself. Together they recover most of the 3 bits discarded by truncation:

Method Effective bits/channel Perceived colours Notes
Raw 5-bit truncation 5 32,768 Baseline
+ Spatial Bayer dither ~5.5 ~100K Spatial only
+ Temporal half-bit ~6 ~262K Frame + line alternation
+ Both combined ~6.5 ~600K Maximum — recommended for CRT

Both spatial and temporal dithering are independently enable/disable via VGA output register bits. On CRT VGA monitors both should be enabled. On digital-analogue-digital capture chains both should be disabled to avoid introducing noise into the captured signal.

The Copper can write the dither enable register mid-frame — temporal dithering could be restricted to specific screen regions if desired, though in practice enabling it globally is correct.

HAM24 on VGA

HAM24's 8-bit channel modify precision ultimately reaches the VGA DAC at 5 bits — 32 effective steps per channel before dithering, up to ~64 effective steps with the temporal half-bit trick enabled. This is still far better than AGA HAM8's 64 steps on any output, and the analogue signal path blurs transitions further. YCbCr HAM24 on VGA retains its perceptual advantage over RGB HAM24 since chroma errors remain less visible even at 5-bit depth.

Memory Constraints at 4K Native

Raw framebuffers at 4K are impractical from on-chip BSRAM:

Format Resolution Size Fits in BSRAM?
Raw 24bpp 3840×2160 ~24.9MB No — needs DDR3
Raw 8bpp indexed 3840×2160 ~7.9MB No — needs DDR3
HAM24 (12bpp) 3840×2160 ~9.9MB No — needs DDR3
Tilemap map data (8×8 tiles) 480×270 entries × 2B ~259KB Yes ✓
Tilemap pixel data (256 tiles × 8×8 × 8bpp) ~16KB Yes ✓
Total BSRAM available ~765KB

Tilemap mode fits entirely in BSRAM. HAM24 needs DDR3 but at roughly 40% of raw 24bpp bandwidth.

Tilemap Mode at 4K

With 8×8 tiles the 4K screen is 480×270 tiles — exactly the Ant64 Standard native resolution. The entire tilemap and tile pixel data lives in BSRAM with room to spare.

With 16×16 tiles the map shrinks to 240×135 entries (~64KB), allowing more BSRAM headroom for larger tile sets or deeper colour.

Per-tile properties carried in the map entry:

Field Bits Notes
Tile index 12 Up to 4,096 unique tiles
Palette ID 8 Which of the 256 palette descriptors to use
H flip 1 Mirror horizontally
V flip 1 Mirror vertically
Priority 2 Layer sub-ordering
Reserved 8 Future use

Tilemap Scroll System

Each tilemap layer has independent horizontal and vertical scroll, with optional per-tile-row H scroll (Copper-driven) and per-tile-column V scroll (fixed hardware register file), modelled on the Mega Drive's scroll system but with configurable granularity in both axes.

Variable Tile Size

Tile size is a per-layer register — any power-of-two in both axes independently:

Field Bits Values Notes
TILE_W 3 4, 8, 16, 32, 64, 128 H tile size in native pixels
TILE_H 3 4, 8, 16, 32, 64, 128 V tile size in native pixels

Non-square tiles are free — 8×16 for character graphics, 16×8 for wide landscape strips, etc. Power-of-two sizes replace division with right-shift and modulo with bitwise AND in the renderer:

tile_col = pixel_x >> TILE_W_SHIFT
local_x  = pixel_x &  TILE_W_MASK

Larger tiles at the same native resolution mean fewer unique tile designs fit in BSRAM — a 64×64 tile at 8bpp is 4KB versus 64 bytes for an 8×8 tile. The tile size register lets each layer choose the right trade-off independently.

H_scroll — Single Register with HSCROLL_STEP

A single H_SCROLL register per layer applies the same H offset to all tiles on the current tile row. HSCROLL_STEP controls the granularity:

HSCROLL_STEP Meaning Copper updates per frame
0 Global — one H scroll value for entire layer 0 (set once)
1 Per tile row — Copper updates every tile row ceil(native_height / TILE_H) + 1
2 Per 2 tile rows ceil(native_height / (TILE_H×2)) + 1
4 Per 4 tile rows ceil(native_height / (TILE_H×4)) + 1

The Copper fires at the output line corresponding to each tile row group boundary:

Copper at output line (group × TILE_H × HSCROLL_STEP × V_scale): H_SCROLL = value

HSCROLL_STEP=0 is purely global scroll with no Copper involvement. HSCROLL_STEP=1 gives full per-tile-row resolution — the classic sine wave water effect. Higher values reduce Copper list length for simpler parallax effects that don't need per-row precision.

V_scroll — Fixed Register File with VSCROLL_STEP

A fixed hardware register file of N entries (suggested initial N=64) holds signed V scroll values. VSCROLL_STEP controls how many tile columns each register covers:

VSCROLL_STEP Meaning Registers used
0 Global — one V scroll value for entire layer 1
1 Per tile column ceil(columns) + 1
2 Per 2 tile columns ceil(columns/2) + 1
4 Per 4 tile columns ceil(columns/4) + 1
8 Per 8 tile columns ceil(columns/8) + 1

VSCROLL_STEP=0 collapses to global V scroll using only V_scroll_file[0] — it replaces the separate global/per-column mode entirely. The feature never disables — it just becomes less granular as VSCROLL_STEP increases.

The programmer chooses VSCROLL_STEP to fit their column count within N registers:

Native width TILE_W Columns VSCROLL_STEP Registers used Granularity
480 8 61 1 61 Per tile column ✓
480 4 121 2 61 Per 2 columns
3840 64 61 1 61 Per tile column ✓
3840 32 121 2 61 Per 2 columns
3840 8 481 8 61 Per 8 columns

Even the worst case (3840 native, 8×8 tiles, VSCROLL_STEP=8) is still better than the Mega Drive's fixed per-2-column scheme. A status register flags whether the current configuration overflows N entries, so software can detect and adjust VSCROLL_STEP accordingly.

The V_scroll lookup:

v_index    = (VSCROLL_STEP == 0) ? 0 : (tile_col >> VSCROLL_STEP_SHIFT)
v_index    = min(v_index, N-1)         // clamp — never undefined
scrolled_y = pixel_y + V_scroll_file[v_index]

Clamping to N-1 means overflow columns silently use the last register's value rather than producing undefined behaviour.

Renderer Evaluation Order

For each output pixel:
    tile_col    = pixel_x >> TILE_W_SHIFT          // pre-scroll column
    tile_row    = pixel_y >> TILE_H_SHIFT          // pre-scroll row
    h_off       = H_SCROLL                         // single register, Copper-maintained
    v_idx       = (VSCROLL_STEP==0) ? 0 : min(tile_col >> VSCROLL_STEP_SHIFT, N-1)
    v_off       = V_scroll_file[v_idx]
    scrolled_x  = pixel_x + h_off
    scrolled_y  = pixel_y + v_off
    tile_col_s  = (scrolled_x >> TILE_W_SHIFT) & map_width_mask
    tile_row_s  = (scrolled_y >> TILE_H_SHIFT) & map_height_mask
    tile_index  = tilemap[tile_row_s][tile_col_s]
    pixel_data  = tile_data[tile_index][local_y][local_x]

Scroll Control Registers Summary

Register Width Notes
H_SCROLL 16-bit signed Current H scroll offset — written by Copper
HSCROLL_STEP 3-bit 0=global, 1/2/4/8...=tile row grouping
V_scroll_file[0..N-1] 16-bit signed × N V scroll offsets — per column group
VSCROLL_STEP 3-bit 0=global, 1/2/4/8...=tile column grouping
VSCROLL_STATUS read-only Flags if column count exceeds N at current VSCROLL_STEP

Comparison with Mega Drive

Feature Mega Drive Ant64
H scroll Full / per-tile-row / per-line HSCROLL_STEP: 0=global, 1=per-row, N=per-N-rows
V scroll Full / per-2-tile-column (40-entry VSRAM) VSCROLL_STEP: 0=global, 1=per-col, N=per-N-cols
V scroll granularity Fixed per-2-columns Configurable — degrades gracefully, never disables
Tile size Fixed 8×8 Per-layer, power-of-two, non-square supported
Scroll source VRAM DMA Single register (H, Copper-driven) + register file (V)

A modern equivalent of the Amiga's Hold And Modify, designed for continuous-tone photographic content at 4K.

Mode Bits/pixel Direct palette Channel precision Fringing
OCS HAM6 6 16 entries 4-bit (16 steps) Very visible
AGA HAM8 8 64 entries 6-bit (64 steps) Noticeable in some content
Ant64 HAM24 12 1,024 entries 8-bit (256 steps) Essentially invisible
Raw 24bpp 24 8-bit None

Each 12-bit HAM24 pixel:

Bits Field Meaning
[11:10] Mode 00 = palette lookup, 01 = modify channel A, 10 = modify channel B, 11 = modify channel C
[9:0] Value 10-bit palette index (in palette mode) or 8-bit channel value + 2 padding bits

At 8-bit channel modify precision, fringing artefacts are 1/256th of the full channel range per pixel — essentially invisible at 4K viewing distances and imperceptible at 1080p. The 1,024-entry direct palette uses the standard palette descriptor system (see Section 16).

HAM24 Colour Space — RGB or YCbCr

A single colour space select bit in the layer register switches between two channel interpretations:

RGB mode (bit = 0) — channels A/B/C map to R/G/B. Natural for synthetic/generated content where the artist thinks in RGB. Simple, no conversion needed.

YCbCr mode (bit = 1) — channels A/B/C map to Y (luminance), Cb (blue chroma), Cr (red chroma). Better for photographic or video-derived content.

YCbCr is superior for photographic HAM content because human vision is far more sensitive to luminance errors than chroma errors — the same principle exploited by JPEG, H.264, and every video codec (which store chroma at half resolution without perceptible quality loss). In RGB HAM, modifying G causes a large perceived luminance jump because G contributes ~59% of perceived brightness. In YCbCr HAM, luma is a dedicated channel so chroma-only fringe pixels (wrong colour, correct brightness) are near-invisible.

Encoding follows BT.709 (the HD/UHD standard):

Y  =  0.2126·R + 0.7152·G + 0.0722·B
Cb = (B - Y) / 1.8556
Cr = (R - Y) / 1.5748

Cb and Cr are signed, stored as offset-binary (128 = zero chroma) matching standard video convention. The HAM decoder resolves the final YCbCr pixel value then passes it through a YCbCr→RGB converter (3 multiply-accumulates, a handful of DSP blocks) before the blender. Palette entries are always stored as RGBA — the YCbCr conversion is applied only to the resolved output pixel, not to the palette data itself.

HAM24 is a per-layer property, switchable by the Copper mid-frame. The top half of the screen could be a HAM24 photograph; the bottom half a tilemap game area — exactly the kind of split-screen mode Amiga coders achieved via Copper, but at 4K.


16. Palette System

Architecture — Flat RAM with Descriptor Table

Palettes are implemented as two independent structures:

Flat palette RAM — a single array of RGBA32 entries. The only place colour data physically lives. Accessible via two address windows:

Base address + 0x00000:  RGBA access — reads/writes raw values directly
Base address + 0x10000:  HSVA access — hardware converts RGB↔HSV on the fly

Palette descriptor table — 256 entries, each containing a base offset into the flat RAM. The pixel lookup is:

final_colour = palette_RAM[palette_descriptor[palette_id].base + pixel_value]

The base offset is one adder in hardware — essentially free.

Flat Palette RAM Sizing

Base field Addressable entries RAM size at 32bpp Status
14 bits 16,384 64KB Current implementation
16 bits 65,536 256KB Reserved address space

The register layout reserves address space for 16-bit base offsets. The current implementation uses 14 bits (bits [15:14] of the base field are reserved/zero). No register map changes are needed to expand to full 16-bit in a future revision.

Palette Descriptor Entry (32 bits)

Bits Field Notes
[13:0] Base offset 14-bit index into flat palette RAM (current)
[15:14] Reserved Zero for now — expands base to 16-bit in future
[31:16] Reserved Future flags — wrap limit, mode bits, etc.

256 descriptors × 4 bytes = 1KB total for the descriptor table. Fits in a tiny BSRAM or distributed RAM.

Variable-Size Palettes

Because each palette is just a base offset, pixel depth determines how many entries are consumed — not any field in the descriptor:

Pixel depth Max pixel value Entries used from flat RAM
1bpp 1 2
2bpp 3 4
4bpp 15 16
6bpp 63 64
8bpp 255 256
HAM24 direct 1023 1,024

A 4-colour sprite occupies exactly 4 flat RAM entries. A 256-colour background occupies 256. Multiple palettes can alias — two descriptors with the same or overlapping base offsets share entries, which enables:

  • Shared transparency — all sprite palettes arranged so pixel value 0 always lands on the same flat RAM entry (the transparent colour)
  • Gradient windows — a long gradient stored once, multiple descriptors pointing at different 16-entry windows of it
  • Sprite recolouring — same pixel data BRAM, different palette IDs, instant team colour swaps
  • Palette animation — Copper writes a new base offset mid-frame, flipping which colour set is active for every object using that palette ID simultaneously

HSV Dual-Access Window

The flat palette RAM presents two address windows. Writes to the HSVA window convert H/S/V/A → R/G/B/A in hardware before storing. Reads from the HSVA window read R/G/B/A and convert to H/S/V/A before returning. The RAM always stores RGBA — HSV is a view, not a storage format.

HSV encoding — all components 0–255 mapped linearly:

  • H: 0=0°, 255=~359° (full hue circle, 1.41°/step)
  • S: 0=greyscale, 255=fully saturated
  • V: 0=black, 255=full brightness

This makes hue rotation arithmetic natural in integer registers — add a fixed value to H across a range of palette entries to shift the entire palette around the colour wheel. The Copper can do this per-scanline for animated colour cycling effects.

Conversion hardware uses DSP blocks for the multiply operations (the GW5AT-138 has 298 DSP blocks — this costs a handful). RGB→HSV read latency is 1–2 extra cycles versus raw RGBA read, which is acceptable since palette reads are not in the display critical path.

Palette Assignment

Object type Palette ID source Notes
Sprite Palette ID field in sprite attribute table Per-sprite, from descriptor table
Tilemap tile Palette ID field in tile map entry Per-tile
HAM24 layer Palette ID field in layer register Per-layer, uses up to 1,024 entries
Global/background Layer register Solid colour or palette index 0
System sprites DeMon-owned palette IDs Protected from user writes

The Copper can write any palette descriptor's base offset or any flat RAM entry on any scanline, making per-scanline palette changes a standard zero-cost operation.


18. Hardware Ray Casting and BSP Acceleration

FireStorm includes dedicated hardware units to accelerate the core inner-loop operations of ray cast and BSP-style 3D rendering — the techniques behind Wolfenstein 3D, Doom, Quake, and their descendants. These workloads are characterised by tight, repetitive fixed-point arithmetic loops that are ideal for FPGA implementation.

18.1 Ray DDA Units

Ray casting shoots one ray per screen column, stepping through a grid cell by cell using a Digital Differential Analyser (DDA) algorithm until a solid cell is hit. Since every column's ray is independent, N parallel DDA units give N× throughput — all columns can be cast simultaneously rather than serially on the EE.

Each DDA unit takes:

  • Ray origin — (x, y) in fixed-point 16.16 map coordinates
  • Ray direction — (dx, dy) normalised fixed-point
  • Map base address — pointer to the cell grid in Graphics SRAM or BSRAM

Each DDA unit outputs:

  • Hit distance — perpendicular wall distance in fixed-point (for column height calculation)
  • Hit cell coordinates — (cell_x, cell_y)
  • Hit face — N / S / E / W wall, for texture selection
  • Texture column offset — the fractional position along the hit wall face (0.0–1.0 fixed-point), directly indexing the texture column

The EE dispatches a batch of rays — one per screen column — to the DDA unit pool. Each unit processes its ray independently, stepping through the grid at full clock rate, and raises a completion flag or interrupt when done. With 8 parallel DDA units and a typical map depth of 10–20 steps per ray, all 480 columns of a Standard-resolution frame can be cast in well under a frame period.

Graphics SRAM cell map packing: At 4 bits per cell (16 cell types), a 9-wide corridor row fits in one SRAM word. At 8 bits per cell (256 types), 4 cells per word with bits to spare for door states, floor/ceiling type, or lighting zones. The DDA stepping pattern is not purely sequential but benefits from the SRAM's 1-cycle random access latency — each grid step is a new address delivered in one cycle with no page miss.

18.2 Fixed-Point Reciprocal Unit

The fundamental operation converting ray hit distance to screen column height is:

column_height = PROJECTION_PLANE_DISTANCE / hit_distance

This is a divide — or equivalently a reciprocal followed by a multiply. The EE's divide is 16–32 cycles. A dedicated fixed-point reciprocal unit computes 1/x in 2–4 cycles using a Newton-Raphson refinement stage seeded from a lookup table. One reciprocal per screen column × 480 columns = 480 reciprocals per frame. At 2–4 cycles each versus 16–32 on the EE, the column height pass is 4–8× faster.

The same unit accelerates BSP plane tests (see below) and any other workload requiring fast division — perspective-correct texture mapping (divide U and V by W per pixel), fog intensity falloff, and lighting attenuation all use reciprocals.

18.3 Dot Product / Half-Plane Test Unit

BSP traversal requires a half-plane test at every node: which side of a partition plane is the viewpoint on? This reduces to:

side = sign( plane_normal · (viewpoint - plane_point) )

A dot product unit computes this in 2–3 cycles — one cycle per multiply-accumulate for a 2D or 3D dot product, plus the sign extraction. The result is a single bit (front or back) plus the full signed value for soft cases (on-plane tolerance).

The same unit serves:

  • BSP traversal — which child to visit first
  • Frustum culling — is a BSP node's bounding box inside the view frustum?
  • Polygon backface culling — dot product of face normal with view direction
  • Lighting — dot product of surface normal with light direction for diffuse intensity
  • Collision response — dot product of velocity with surface normal for reflection

18.4 BSP Traversal Engine

The BSP traversal engine walks a BSP tree autonomously given a viewpoint, outputting leaf sector or subsector IDs in front-to-back (or back-to-front) order via a small hardware FIFO. The EE reads from the FIFO to get the rendering order without implementing the recursion itself.

Operation:

  1. EE writes viewpoint (x, y, z) and BSP tree root address to the engine registers
  2. Engine begins traversal — at each node, uses the dot product unit to determine which side the viewpoint is on, pushes the back subtree onto an internal stack, descends into the front subtree first
  3. Each reached leaf sector is output to the FIFO
  4. Engine raises interrupt when traversal is complete or FIFO reaches a threshold

BSP node packing in Graphics SRAM: A 2D BSP node (Doom-style) requires:

  • Partition line: x, y, dx, dy (4 × 16-bit = 64 bits)
  • Right child pointer, left child pointer (2 × 16-bit = 32 bits)
  • Bounding boxes (optional, 4 × 16-bit per side = 128 bits)

A compact 2D BSP node (Doom-style) without bounding boxes fits in 3 SRAM words with bits to spare for flags, sector type, and lighting data. A 1024-node BSP tree fits in 3KB of Graphics SRAM, leaving the vast majority available for textures and cell maps.

18.5 Combined Rendering Pipeline

Ray casting and BSP acceleration work together for a Doom-style renderer:

Frame render sequence:

1. EE writes viewpoint to BSP traversal engine
2. BSP engine traverses tree → outputs visible sector list to FIFO
   (EE free to do other work during traversal)

3. EE reads sector list from FIFO
4. For each visible sector, EE dispatches column rays to DDA units
   (multiple sectors in flight simultaneously across DDA unit pool)

5. DDA units return hit distances, faces, texture offsets
6. Reciprocal unit converts distances to column heights
7. Blitter draws textured column spans to bitmap layer
   (TEX_GFXSRAM for texture atlas lookup)

8. Scanline mixer composites bitmap layer with hardware sprite layer
   (sprites for items, enemies, pickups — hardware sprites with zero blitter cost)

The hardware accelerators handle the pure arithmetic. The EE handles the scheduling and control flow. The blitter handles the pixel fill. The scanline mixer composites everything. Each unit does its job while the others run in parallel.

18.7 Height-Field Voxel Acceleration (Comanche / Delta Force Style)

Height-field voxel rendering casts rays across a 2D height map — each ray steps forward at ground level, sampling the height value at each grid position. When the sampled height projects higher on screen than the current column's drawn horizon, a vertical span is drawn upward to the new projected height. Rays step front-to-back; the column fills upward as taller features are found.

The inner loop per step:

  1. Step ray forward (DDA — same unit as wall ray caster)
  2. Sample height map at current (x,y) — one Graphics SRAM read
  3. Project height to screen: screen_y = horizon - (height - camera_z) × scale / distance — one reciprocal
  4. Compare projected height against current column top — if higher, draw span
  5. Sample colour/texture at (x,y) for the span — second Graphics SRAM read

Steps 1–4 repeat for every step along the ray, typically 200–500 steps per column at Comanche-era quality. The entire column loop runs in hardware with no EE involvement per step.

Height Map Sampler unit

A dedicated height map sampler takes a 2D (x,y) address in fixed-point and returns the height value and colour value in one pipeline operation. The Graphics SRAM holds the height map and colour map in adjacent banks:

Graphics SRAM height map packing: Multiple height values pack per SRAM word — e.g. 4 height values at 9-bit resolution per word. A 512×512 height map fits in ~64KB; a 512×512 colour map alongside it stays well within 4.5MB. A 1024×1024 map uses DDR3 backing for the colour data with Graphics SRAM holding the active working region.

Span renderer

A vertical span renderer draws the filled column segments generated by the height-field scan, working from the hit (x, screen_y_top, screen_y_bottom, colour) tuples into the bitmap layer. One span write per height step that beats the current horizon — typically far fewer than the DDA steps, since only the first visible feature per height zone draws.

Column state registers: Each column maintains a "current horizon" register — the highest screen Y drawn so far. The height-field DDA unit reads and writes these per column as it steps. A bank of 480 horizon registers (one per Standard-resolution column) lives in BSRAM.

Throughput example: 480-column frame at 300 steps/column

480 columns × 300 steps = 144,000 DDA steps per frame
Each step: 1 SRAM read (height), 1 reciprocal, 1 compare
At 8 DDA units × 200MHz = ~1,600 million steps/second
144,000 steps / 1,600M = ~0.09ms — well under frame budget

The height-field renderer is among the cheapest 3D techniques in terms of hardware cost — the DDA and reciprocal units from wall ray casting handle it almost entirely, with only the height map sampler and span renderer as additions.


18.8 3D Voxel Grid Acceleration (Dense Grid / Minecraft Style)

3D voxel DDA extends the 2D wall ray caster to three axes. Each step through the grid determines which axis face (X, Y, or Z plane) the ray crosses next — this requires three comparisons per step rather than two.

3D DDA unit — extends the 2D DDA unit with a Z axis:

  • Input: ray origin (x,y,z), ray direction (dx,dy,dz), voxel grid base address
  • Per step: three t_max values (tMaxX, tMaxY, tMaxZ) — the ray parameter at which the next X, Y, or Z grid boundary is crossed. The smallest wins and the ray advances to that face
  • Output: hit voxel coordinates (cx,cy,cz), hit face (±X/±Y/±Z), distance

The EE configures the direction and sends it; the 3D DDA unit steps autonomously until a solid voxel is hit, returning the result.

Graphics SRAM grid packing: A 64³ grid at 4 bits per voxel fits in ~18KB; at 8 bits per voxel, ~64KB. A 128³ grid at 8 bits approaches the full 4.5MB — use 4-bit types or back with DDR3 for larger grids.


18.9 Sparse Voxel Octree (SVO) Acceleration

Sparse voxel octrees skip empty space efficiently — the tree subdivides space into octants, and empty subtrees are single null pointers rather than arrays of empty voxels. High-resolution detailed voxel scenes (millions of voxels) that would be impractical as dense grids are tractable as SVOs.

The key operation per node: slab test (ray vs AABB)

At each octree node, the ray is tested against the node's axis-aligned bounding box:

tmin = max( (box_min - ray_origin) / ray_dir )   ← entry distance
tmax = min( (box_max - ray_origin) / ray_dir )   ← exit distance
hit = (tmin < tmax) and (tmax > 0)

Each of the three axis divisions is a subtract and a reciprocal-multiply. A slab test unit performs all three axis tests in parallel, delivering tmin, tmax, and hit in 3–4 cycles.

Octant ordering: Given tmin per axis, the octant entry order is determined by sorting the three axis entry distances — which axis face is crossed first, second, third. This determines the order in which child octants are visited, ensuring front-to-back traversal for early termination.

SVO Traversal Engine

Analogous to the BSP traversal engine, the SVO traversal engine walks the octree autonomously:

  1. EE writes ray origin, direction, and SVO root address
  2. Engine performs slab test at each node using the slab test unit
  3. On hit: if leaf, output voxel hit (address, face, distance) to FIFO; if branch, push far children to stack, descend near child
  4. On miss: pop next entry from stack
  5. Raises interrupt when ray terminates (first solid hit or tree exhausted)

Multiple SVO engines running in parallel cast multiple rays simultaneously — one per screen column for full-frame rendering, or one per shadow ray for lighting.

SVO node packing: Each node contains a child presence mask (8 bits), a child pointer (20 bits), a leaf flag (1 bit), and colour/material data (7 bits) — 36 bits total. One node per Graphics SRAM word, no padding.

An SVO node fits exactly in one Graphics SRAM word — no padding, no wasted bits. A 16K-node tree (sufficient for a detailed scene) occupies 64KB of Graphics SRAM. Larger trees spill to FireStorm DDR3 with Graphics SRAM serving as a node cache for the active ray front.


18.10 Shared Voxel / Ray Cast Resources

Hardware unit Wall ray cast Height-field 3D grid SVO
DDA unit (2D) ✓ primary ✓ primary
DDA unit (3D) ✓ primary
Reciprocal unit ✓ column height ✓ height projection ✓ column height ✓ slab test
Dot product unit BSP plane test
Slab test unit frustum cull ✓ primary
Height map sampler ✓ primary
Span renderer ✓ primary
BSP traversal engine ✓ sector order
SVO traversal engine ✓ primary
Graphics SRAM cell map height/colour map voxel grid node cache

All units dispatch work via EE register writes and signal completion via interrupt or FIFO. The EE schedules multiple engines simultaneously — a BSP sector walk can happen in parallel with DDA column casting, and SVO traversal for one scene region can overlap height-field sampling for another. The FPGA's task scheduler and parallel execution model make fine-grained overlap natural.


18.11 Renderer Style Capability

Renderer style Key operations FireStorm hardware path
Wolfenstein 3D 2D DDA per column, flat walls DDA units, reciprocal unit, blitter column fill
Doom BSP sector order, ray per column seg, textured walls BSP engine, DDA units, reciprocal unit, blitter textured span fill
Quake (software) BSP + PVS, affine texture mapping BSP engine, dot product unit, blitter affine textured spans
Height-field voxel 2D DDA + height map sample + span draw DDA units, height map sampler, reciprocal unit, span renderer
Dense 3D voxel 3D DDA per ray 3D DDA units, reciprocal unit, blitter face fill
Sparse voxel octree Slab test per node, front-to-back traversal SVO traversal engine, slab test unit, reciprocal unit
Isometric tilemap Scanline mapper, diamond spans, depth buffer Isometric scanline mapper, diamond span unit, dot product unit, depth buffer, blitter span fill
Isometric sprites World→screen transform, depth test Dot product unit, blitter masked sprite, depth buffer
Shadow maps Depth buffer render + depth compare Depth comparator, coord transform, dot product unit
SSAO / contact shadows / SSR Screen-space DDA + depth samples DDA units (reused), blitter memory unit
BVH + shadow rays BVH traversal + depth compare SVO engine (parameterised), slab test unit
Ray-triangle intersection Möller–Trumbore Cross product unit, dot product unit, reciprocal unit
Custom voxel / hybrid Mix of above Any combination dispatched by EE

The 480×270 Standard resolution is a natural fit for all of these renderers — it matches the internal resolution most period engines actually ran at, pixel-doubled to the output. At Standard resolution the DDA unit pool casts all 480 columns in parallel; the blitter draws textured spans into the bitmap layer; the scanline mixer composites hardware sprites on top at zero additional render cost.


19. Ray Trace Acceleration and Shadow Functions

Full path tracing is not a realistic target for FPGA hardware at this scale. What is realistic — and sufficient to produce convincingly lit scenes — is a layered set of targeted accelerators, each adding shadow and lighting quality at incremental hardware cost. The following are ordered from cheapest to most capable.

19.1 Shadow Maps — Minimal New Hardware

Shadow mapping is a two-pass technique: render the scene from the light source's point of view into a depth buffer, then during the main render compare each pixel's light-space depth against the stored value. The comparison produces a single shadow/lit bit per pixel.

New hardware required: a depth comparator and a coordinate transform (world position → light space). The coordinate transform is a 4×4 matrix multiply — four dot products per pixel, using the existing dot product unit. The depth comparator is a subtract and sign check — trivial.

Depth buffer storage: a Standard-resolution depth buffer at 16-bit precision is 480×270×2 = ~253KB, fitting comfortably in Graphics SRAM. A separate depth buffer is maintained for each active light source. The shadow map render pass is a standard blitter job targeting the depth buffer rather than a colour bitmap.

Percentage-closer filtering (PCF): instead of a single depth comparison, sample several neighbouring texels and average the results — producing soft shadow edges proportional to sample count. The blitter's memory unit performs the multiple samples as sequential reads; the EE accumulates and averages. No additional hardware needed.

Result: proper directional shadows with controllable edge softness, using the blitter's existing render-pass model, at almost no additional silicon cost.

19.2 Screen-Space Techniques — Zero New Hardware

The DDA units designed for ray casting and voxel stepping are directly applicable to screen-space ray marching. Step a ray along the screen-space depth buffer, sample at each step, detect intersection. No new hardware — new uses of existing units.

Screen-space ambient occlusion (SSAO): sample depth values in a hemisphere around each pixel's world-space position. Count how many samples are occluded by nearby geometry — the ratio approximates how much ambient light reaches that point. Conventionally expensive; on FireStorm the depth buffer samples are burst reads from Graphics SRAM and the accumulation runs on the EE.

Contact shadows: very short ray marches (8–16 steps) along the screen-space depth buffer near geometry edges. Produces convincing shadows where surfaces meet — cracks, corners, where objects rest on floors. Extremely cheap, high visual impact. The DDA units step through screen space; the blitter composites the contact shadow mask over the main render.

Screen-space reflections (SSR): march a reflection ray along the depth buffer until it hits geometry, then sample the colour buffer at the hit point. Convincing for flat reflective surfaces (floors, wet ground, metal). Same DDA units, same depth buffer, different ray direction.

All three techniques share the depth buffer generated by the shadow map pass — no extra render cost for the buffer itself.

19.3 BVH Traversal — Minimal New Hardware

A Bounding Volume Hierarchy is structurally identical to a sparse voxel octree for traversal purposes — both walk a tree using slab tests (ray vs AABB) at each node. The SVO traversal engine (Section 18.9) can be parameterised to handle BVH nodes with minimal additional logic — the node format changes, the traversal algorithm does not.

With BVH traversal, shadow rays for polygon geometry become feasible: trace one ray from each lit surface point toward each light source, test it against the BVH, output shadow/lit. This gives hard per-pixel ray-tested shadows for polygon scenes at a fraction of full path tracing cost.

BVH node format: partition plane or AABB bounds, left/right child pointers, leaf flag, primitive index. Fits in Graphics SRAM for scenes with up to ~16K nodes.

19.4 Ray-Triangle Intersection — Small New Hardware

The Möller–Trumbore algorithm: compute two edge vectors, one cross product, two dot products, a reciprocal, bounds checks. The dot products and reciprocal reuse existing units. The one new operation is the cross product: three multiplies and three subtracts — approximately 8–10 DSP blocks.

With a cross product unit added, the hardware can perform full ray-triangle intersection. Combined with BVH traversal (Section 19.3), this gives:

  • Hard shadows for polygon geometry via shadow rays
  • Primary ray casting for polygon scenes (not just DDA grid)
  • Reflection rays against polygon geometry

One intersection test per 4–6 cycles. A pool of parallel intersection units (parameterised at HDL build time) gives proportional throughput scaling.

19.5 Recommended Addition Order

Stage New hardware Technique enabled Visual result
1 Depth comparator + coord transform (~5 DSP) Shadow maps Proper directional shadows
2 None SSAO, contact shadows, SSR Ambient occlusion, reflections
3 Parameterise SVO engine for BVH BVH traversal Hard ray-tested shadows for polygons
4 Cross product unit (~10 DSP) Ray-triangle intersection Full polygon shadow rays, reflection rays

Stages 1 and 2 together — shadow maps plus screen-space techniques — produce the combination most people perceive as "ray traced looking" without any ray-triangle intersection hardware. Shadow maps give the directional shadows; SSAO gives the contact darkening and ambient occlusion that makes geometry feel grounded; contact shadows fill in the fine detail at surface intersections. The visual step from this combination to full ray traced shadows is smaller than the visual step from no shadows to this combination.

Stages 3 and 4 are the path toward a proper hybrid rasterise-and-raytrace pipeline — rasterise the primary view, cast shadow and reflection rays against a BVH for secondary lighting. Not a real-time path tracer, but convincingly lit geometry within the constraints of the hardware.

19.6 EE and Blitter Role

The EE schedules all shadow and ray passes as blitter jobs, exactly as it does for the main render:

Frame render with shadows:

Job 1: Shadow map pass    — render scene depth from light POV → depth buffer
Job 2: Main render pass   — render scene colour → world bitmap
Job 3: SSAO pass          — DDA screen-space samples → AO mask
Job 4: Contact shadows    — short DDA marches → contact shadow mask
Job 5: Composite          — world bitmap × AO mask × contact mask → output

→ Jobs 1 and 2 are sequentially dependent (shadow map before main render)
→ Jobs 3 and 4 can run in parallel (both read the same depth buffer, write different masks)
→ Job 5 depends on Jobs 2, 3, 4

The depth comparator and coordinate transform run as part of Job 2 — each pixel's shadow test is a per-pixel operation during the main render, not a separate pass. The blitter performs the comparison as it writes each pixel's colour value.


20. Isometric Rendering System

FireStorm includes dedicated acceleration for isometric tilemaps and sprites — the rendering approach behind Populous, Syndicate, Theme Hospital, and RollerCoaster Tycoon. The system is designed to eliminate the per-frame CPU overhead that constrained those games on period hardware, and to handle depth ordering automatically via a hardware depth buffer rather than requiring software painter's algorithm sorting.

20.1 The Isometric Transform

Every object in the isometric world — tile or sprite — has a world-space position (world_x, world_y, world_z). The screen position is a fixed linear transform:

screen_x = (world_x - world_y) × tile_half_width  + scroll_x
screen_y = (world_x + world_y) × tile_half_height - world_z × height_scale + scroll_y
depth_z  =  world_x + world_y  - world_z

This is handled by the existing dot product unit — three dot products and two additions per object, 2–3 cycles. The same transform positions both tiles and sprites. The FPGA computes screen positions for an entire row of tiles in a handful of cycles.

Scrolling is an increment to scroll_x and scroll_y — no repaint of unchanged tiles, no full-screen redraw. Only tiles entering or leaving the visible region require new data.

20.2 Isometric Scanline Mapper

The isometric scanline mapper is a small state machine that, given a screen Y coordinate, returns the set of (world_x, world_y) tile positions whose diamonds intersect that scanline. This is the inverse of the isometric transform — instead of tile → screen, it computes screen_y → tile list.

The mapper walks the isometric grid scanline by scanline, analogous to the DDA ray caster walking a grid cell by cell. For each output scanline it produces a compact list of (tile_col, tile_row, span_left, span_right) tuples — the tiles visible on that line and the horizontal pixel extent of each diamond.

This replaces the classic tile-by-tile painter's algorithm with a scanline-coherent render that only processes tiles actually contributing pixels to each line. No invisible tiles are touched.

20.3 Diamond Span Renderer

Each tile contributes a horizontal span of pixels to each scanline it covers — the width of the diamond narrows toward the top and bottom of the tile. The diamond span unit computes left_x and right_x for a given tile at a given scanline Y from the tile's screen position and size. This is a small combinational unit — a handful of additions and shifts, no DSP blocks.

The span coordinates feed directly into the blitter as a textured span fill job: fetch the tile's pixel row from the tile texture at the correct V coordinate, write it to the output bitmap between left_x and right_x. The blitter's existing masked textured span primitive handles this exactly.

20.4 Depth Buffer

The depth buffer eliminates painter's algorithm sorting entirely. Each pixel drawn to the isometric bitmap layer carries a Z value — the depth_z from the isometric transform. The blitter tests each incoming pixel against the stored depth; closer pixels write, further pixels are discarded.

Buffer specification:

Parameter Value
Format 16-bit per pixel
Resolution Matches output bitmap (e.g. 480×270 Standard)
Storage Graphics SRAM (~253KB)
Precision 16 bits — more than sufficient for isometric scene depths
Shared with Shadow map depth buffer (Section 19) — same buffer, different frames or separate regions

Depth buffer flags on blit job descriptor:

Flag Meaning
BLT_DEPTH_WRITE Write depth_z to depth buffer when writing a pixel
BLT_DEPTH_TEST Discard pixel if depth_z ≥ stored depth at that pixel position
BLT_DEPTH_CLEAR Fill job targeting depth buffer with maximum Z — used at frame start

These flags work on any blitter primitive — tile span fills, sprite blits, shape fills — making the depth buffer available to the full primitive set.

What the depth buffer eliminates:

  • Unified draw list sorting for opaque objects — no longer needed
  • Split-sprite calculation for tall buildings occluding sprites — depth buffer arbitrates per-pixel automatically
  • Stencil mask pass for occlusion — depth test replaces it
  • Most EE overhead managing draw order

Transparency: semi-transparent objects (windows, foliage, rain, shadows) use BLT_DEPTH_TEST without BLT_DEPTH_WRITE and draw after all opaque geometry. Transparent objects still need painter's order among themselves, but this is a small fraction of scene objects.

20.5 Hardware Sprite Layer — No Depth Buffer

The hardware sprite layer composites sprites at the output pixel clock in the scanline mixer — a real-time pipeline operating against a live display stream. Depth buffer testing would require a Graphics SRAM read per sprite pixel synchronised to the pixel clock, which is architecturally incompatible with the scanline pipeline's timing and latency constraints.

More fundamentally, the hardware sprite layer's purpose is guaranteed-foreground compositing — the cursor, system overlays, DeMon's reserved region, HUD elements. These are precisely the objects that should always appear on top, regardless of scene depth. Depth testing would be counterproductive for this use case.

The division of labour is therefore natural:

Layer Depth buffer Best for
Hardware sprite layer No Cursor, system overlays, HUD, always-foreground UI
Blitter bitmap layer Yes All world objects — tiles, buildings, characters, items

World objects in an isometric game live in the blitter bitmap layer with full depth testing. The one-frame pipeline latency of the blitter is imperceptible at 60fps and is a worthwhile trade for correct per-pixel occlusion of all world geometry.

20.6 Isometric Sprites

Isometric sprites — characters, vehicles, animals, equipment, loose items — are drawn by the blitter as standard masked sprite blits into the isometric bitmap layer, with BLT_DEPTH_TEST | BLT_DEPTH_WRITE. The depth value is world_x + world_y - world_z computed from the object's world position by the EE.

The blitter draws sprites in any convenient order. The depth buffer arbitrates occlusion automatically:

  • A character behind a building loses the depth test for occluded pixels — correct occlusion without split-sprite calculations
  • Two characters overlapping are resolved per-pixel — correct for any arbitrary overlap
  • A character partly behind a tree, partly in front — resolved correctly per pixel with no special case handling

Isometric sprite attributes extend the standard sprite descriptor with world-space position:

Field Notes
world_x, world_y, world_z World position — EE computes screen_x, screen_y, depth_z from these
Sprite sheet address Tile in Graphics SRAM or DDR3
Animation frame Selects tile within sheet
Direction 4 or 8-way facing — selects row in sprite sheet
Scale Optional — blitter handles nearest/bilinear
Depth_z override For objects that should sort differently than their position implies

The EE maintains a world-object list. Each frame it computes screen positions for all visible objects, builds a blitter primitive list, and dispatches it as a single job. The blitter draws all objects; the depth buffer resolves occlusion.

20.7 Tall Objects and Multi-Tile Buildings

A Theme Hospital ward block or RCT roller coaster may occupy multiple tiles and extend several floors upward. On classic hardware this required split-sprite rendering — draw the lower part of the sprite, draw the tiles in front, draw the upper part. With a depth buffer this is handled automatically:

The building is either:

  • A single sprite with depth_z set to its base tile depth. Floors of the building at higher world_z will correctly occlude objects at lower world_z behind them.
  • A set of sprites per floor each with their own world_z-derived depth_z. Characters on upper floors composite correctly against the building geometry at each height level.

The depth buffer resolves every pixel correctly regardless of which approach is used.

20.8 Frame Render Sequence

Frame N job queue (dispatched at V-blank start):

Job 1: Depth buffer clear → max Z          (memory/copy — fast fill)
Job 2: Tile pass          → bitmap + depth  (BLT_DEPTH_WRITE)
       Isometric scanline mapper feeds span list
       Blitter draws textured diamond spans with depth_z per tile
Job 3: Sprite pass        → bitmap + depth  (BLT_DEPTH_TEST | BLT_DEPTH_WRITE)
       EE-built primitive list, any order
       Depth buffer arbitrates occlusion automatically
Job 4: Transparent pass   → bitmap only     (BLT_DEPTH_TEST, no BLT_DEPTH_WRITE)
       Sorted transparent objects (foliage, windows, weather effects)
Job 5: Composite          → output layer    (clip table for UI border if needed)

→ Jobs 2 and 3 can overlap on different sub-units if depth buffer regions don't conflict
→ Job 4 depends on Jobs 2 and 3
→ Hardware sprite layer composites cursor and HUD on top at scanline output — zero cost

Jobs 2 and 3 together replace everything that was software in the classic isometric games — the tile loop, the sprite sort, the depth ordering, the occlusion handling. On FireStorm the EE's role is building the sprite primitive list (world position → screen position, one transform per object) and dispatching the jobs. Game logic — pathfinding, AI, economy — runs on the SG2000 entirely unaffected.

20.9 Scale Assessment — Theme Hospital / RollerCoaster Tycoon

Parameter Classic hardware FireStorm
Visible tiles Full redraw every frame, CPU cost per tile Scanline mapper + blitter, EE not involved per tile
Dynamic objects Software painter sort + blit every frame Blitter primitive list, depth buffer arbitrates
Occlusion Split-sprite, stencil, or incorrect Per-pixel depth test, always correct
Scroll Full repaint Scroll register increment, only edge tiles update
CPU role Rendering Game logic only
Sprite budget Limited by CPU blit speed Several thousand blitter sprites per frame

A RollerCoaster Tycoon-scale scene — 50×50 visible tiles, 500 visible peeps and vehicles — fits comfortably within a single frame's blitter budget at 480×270 Standard resolution. The SG2000 spends its time on guest AI, ride physics, and economy rather than pixel pushing.


Feature Amiga OCS/ECS Atari ST Ant64 FireStorm
Playfields / layers 2 (dual playfield) 1 Multiple (configurable)
Hardware sprites 8 per scanline (fixed width) No hardware sprites V_scale dependent, hundreds to thousands per scanline
Copper coprocessor Yes — fixed function No Yes — general register write engine
Blitter Yes — 3-source No Yes — hardware blitter with full primitive set
Ray casting / BSP / voxel No No Yes — DDA units (2D/3D), BSP engine, SVO engine, height map sampler, reciprocal, slab test
Shadow / ray trace No No Shadow maps, SSAO, contact shadows, BVH traversal, ray-triangle intersection
Isometric rendering Software only Software only Scanline mapper, diamond span unit, depth buffer, blitter sprites — no CPU render cost
Hardware depth buffer No No Per-pixel depth test/write on all blitter primitives
Scanline effects Via Copper No Via Copper, any register
Mixed H resolutions Yes (lo/hi/HAM) No Yes — per layer independently
Native palette 32 colours (OCS), 64 (ECS) 512 colours, 16 on screen 16,384 entries flat RAM, 256 palette descriptors, dual RGB/HSV access
Tilemap scroll Per-tile-row H / per-2-col V No Per-tile-row H / per-tile-col V / per-line H — full independent axes
HAM mode HAM6 (4-bit/channel) / HAM8 (6-bit/channel) No HAM24 (8-bit/channel, ~invisible fringing) per layer, Copper switchable
Max output res 1280×512 (interlaced) 640×400 (mono) 3840×2160 @ 60Hz (480×270 Standard / 960×540 Hires internally)
Audio integration Yes (Paula 4-channel) Yes (YM2149) Yes — FireStorm DSP, 128+ voices
Programmable logic No — fixed silicon No Yes — full FPGA, reconfigurable per cartridge

The key difference: Amiga and Atari programmers had to work around the fixed capabilities of their custom chips. On the Ant64, the custom chip is the FPGA itself — personality cartridges can reconfigure the entire display pipeline for a completely different architecture if desired.

The Amiga parallel is intentional: Standard (480×270) closely matches Amiga low res pixel size; Hires (960×540) doubles both axes exactly as Amiga hires did. The difference is that Amiga hires needed interlace to hit 512 lines and flickered doing it. The Ant64 does it cleanly at 4K/60.



21. Workstation App Rendering — ImGui Backend

FireStorm acts as the Dear ImGui rendering backend for the Music Workstation App running on the SG2000 big core. The CPU builds the ImGui draw list (a compact buffer of vertices, indices, and draw commands) and DMAs it to FireStorm, which rasterises it into the framebuffer in hardware. The CPU does zero pixel work.

┌─────────────────────────────────────────────────────────────────────┐
│              Application Processor (SG2000 big core, bare metal)    │
│                                                                     │
│  Music Workstation App (C++)                                        │
│  Page A·D·G·K·L·V·W·H·S·R·E·F·M                                     │
│                                                                     │
│  Dear ImGui::NewFrame() → ImGui::Render() → ImDrawData              │
│  (vertex buffer + index buffer + draw commands)                     │
│         │                                                           │
│         │  DMA — draw list  (not pixels — just geometry + colour)   │
└─────────┼───────────────────────────────────────────────────────────┘
          │
┌─────────▼────────────────────────────────────────────────────────────┐
│              FireStorm FPGA — Dual Role                              │
│                                                                      │
│  ┌─────────────────────────────┐  ┌───────────────────────────────┐  │
│  │    AUDIO DSP PIPELINE       │  │    2D RASTERIZER              │  │
│  │                             │  │                               │  │
│  │  128 voices time-multiplexed│  │  Triangle setup               │  │
│  │  VA · FM · Sample engines   │  │  Scanline rasterizer          │  │
│  │  Filters · BBD chorus       │  │  Gouraud colour interp        │  │
│  │  48kHz sample rate          │  │  Font texture sampler (BRAM)  │  │
│  │                             │  │  Framebuffer write (SRAM B)   │  │
│  │  Uses: DSP blocks + SRAM A  │  │  Uses: logic + BRAM + SRAM B  │  │
│  │  Orthogonal FPGA resources  │  │  Orthogonal FPGA resources    │  │
│  └─────────────────────────────┘  └──────────┬────────────────────┘  │
│                                              │                       │
│                                    ┌─────────▼──────────┐            │
│                                    │  Display output    │            │
│                                    │  (DP / HDMI / VGA) │            │
│                                    └────────────────────┘            │
└──────────────────────────────────────────────────────────────────────┘

The two roles are orthogonal — audio DSP uses DSP multiply blocks and SRAM A; the rasteriser uses logic cells, BRAM, and SRAM B. They run simultaneously on independent fabric resources and never contend.

Rasterizer Performance

The rasteriser runs on completely independent fabric and SRAM B — it does not share any resources with the audio DSP and does not consume any of the audio cycle budget.

  Rasterizer clock:         200MHz (same fabric)
  Output pixel clock:       74.25MHz (720p) · 148.5MHz (1080p)
  Fabric cycles per pixel:  200 / 74.25 = 2.7 fabric cycles per pixel output

  Per-frame budget (60fps):  200,000,000 / 60 = 3,333,333 cycles

  Spectrogram render (512 × 256 pixels):
    131,072 pixels × 3 cycles (LUT + write) = 393,216 cycles = 11.8% of frame
    → runs in 1.97ms — well within 16.67ms frame budget

  ImGui UI (typical complex panel, avg triangle 50×50 px):
    Setup: 15 cycles · Fill: 2,500 px × 3 cycles = 7,515 cycles/triangle
    3,333,333 / 7,515 = 443 triangles per frame at 60fps
    → typical ImGui draw list: 200–1000 triangles — very comfortable

  At 250MHz:
    Frame budget: 4,166,667 cycles → 554 triangles/frame

The spectrogram and a full ImGui UI panel render simultaneously within the same frame with substantial headroom.

The ImGui rasteriser pipeline (triangle fetch → setup → bounding box clip → scanline fill → Gouraud colour → font atlas BRAM sample → framebuffer write) is documented in Section 20 of this document.


3D Rendering Architecture — Two-Pass Split with LOD

The FireStorm rasteriser supports 3D world rendering using a two-pass technique that solves the fundamental z-buffer precision problem for large-scale scenes. The approach is well documented in James Lambert's N64 world renderer series — the same constraints apply here: limited z-buffer precision, constrained memory, and the need to render both a vast far scene and a geometrically precise near scene in the same frame.


The Problem with a Single Full-Range Z-Buffer

A perspective-projected z-buffer stores depth non-linearly. The precision distribution is heavily biased toward the near plane:

  Linear world depth:   0.1m ──────────────────────────────→ 1000m
  Z-buffer values:      ████████████████████░░░░░░░░░░░░░░░░░ (16-bit)
                        ▲ near: very precise     far: almost none ▲

  Most precision wasted on the range 0.1–2m.
  Objects at 200m vs 201m may map to the same z-buffer value → z-fighting.
  Objects at 1000m+ are all quantised to maximum depth — no distinction possible.

A full-range z-buffer covering 0.1m to infinity simply cannot resolve distant geometry reliably, regardless of bit depth. The solution is to not use the z-buffer for far geometry at all.


The Two-Pass Solution

The scene is divided at a split distance D into two zones. Each zone uses a completely different visibility technique:

  Camera
    │
    │◄──────── NEAR ZONE ───────────►│◄────────── FAR ZONE ───────────────►│
    │          [0 ... D]             │           [D ... ∞]                 │
    │                                │                                     │
    │  Full z-buffer precision       │  No z-buffer                        │
    │  All 16 bits used for [0..D]   │  Painter's algorithm (back to front)│
    │  Resolves cm-level depth       │  Pre-baked LOD chunks               │
    │  Dynamic geometry welcome      │  Baked meshes, sorted at load time  │
    │                                │                                     │
    │  Rendered SECOND               │  Rendered FIRST                     │

D is chosen per-scene — typically the distance at which individual polygon edges become sub-pixel and LOD simplification is invisible to the player. 200–500 world units is typical for a Mega Drive / N64-style game world.


Pass 1 — Far Zone (Back to Front, No Z-Buffer)

Rendered first, farthest to nearest, painter's algorithm. No z-buffer reads or writes. Each element simply overwrites whatever was drawn before it. Correct because the elements are sorted by depth — a closer sky element will always be drawn on top of a farther one.

  Order of rendering (far pass):
  ┌──────────────────────────────────────────────────────┐
  │  1. Skybox / sky dome                                │  ← always furthest
  │     Single fullscreen quad or cube faces             │
  │     Fills every pixel — clears the framebuffer       │
  │                                                      │
  │  2. Very distant terrain / world chunks              │
  │     Pre-baked to billboard textures at horizon       │
  │     Essentially flat — no z needed                   │
  │                                                      │
  │  3. Far LOD chunks (distance D to D/2)               │
  │     Low-polygon baked meshes, sorted far→near        │
  │     Each chunk drawn as a simple triangle list       │
  │                                                      │
  │  4. Mid LOD chunks (distance D/2 to D)               │
  │     Medium detail, still painter's algorithm         │
  │     Sorted far→near within chunk grid                │
  └──────────────────────────────────────────────────────┘

At the end of pass 1, the framebuffer contains the entire far scene — sky, distant terrain, far geometry — without a single z-buffer comparison having been made. The z-buffer is completely unused and memory is not allocated for it during this pass.


Pass 2 — Near Zone (Z-Buffer Active, Distance 0 to D)

Rendered second, on top of the far pass result. The z-buffer is now active for depth testing and writing. Crucially: the z-buffer only needs to represent depths in [0, D]. All 15 or 16 bits of precision are concentrated in this near range.

  Z-buffer precision in near zone only:

  Signed 15-bit (S15, range −16384 to +16383 in clip space):
  Near plane 0.1m → D (say, 300m):
  Precision: 32768 levels across 300m → ~9mm per z-level at worst case

  For a game where near geometry is furniture, characters, walls:
  9mm precision is completely invisible. No z-fighting anywhere.

The near pass draws all dynamic and high-detail geometry: characters, objects, foreground tiles, particle effects, UI elements in world space.

  Near pass render order:
  ┌──────────────────────────────────────────────────────┐
  │  Near world geometry (distance < D)                  │
  │  → z-test against buffer, write on pass              │
  │                                                      │
  │  Dynamic objects (characters, enemies, projectiles)  │
  │  → z-test, write                                     │
  │                                                      │
  │  Transparent/alpha geometry (sort back-to-front      │
  │  within near zone, no z-write but z-test)            │
  │                                                      │
  │  Particle effects, billboards                        │
  │  → z-test only, no z-write (additive blend)          │
  │                                                      │
  │  HUD / UI (screen-space, drawn last, no z)           │
  └──────────────────────────────────────────────────────┘

Z-Buffer Format and Memory

15-bit signed (S15) is the natural choice — clip-space depth is naturally signed (in front of / behind the near plane), and 15 bits leaves 1 bit for a stencil flag or can be rounded to 16-bit aligned storage with the top bit unused.

16-bit (U16) is the practical storage format — 16-bit aligned reads and writes are native to SRAM, and the top bit is either the sign or a stencil/flag bit.

Memory cost:

  Resolution    Pixels      Z-buffer (16-bit)   Framebuffer RGB12 (36-bit/px)
  640×480       307,200     600 KB              1.1 MB
  1280×720      921,600     1.8 MB              3.3 MB
  1280×720      921,600     1.8 MB packed       → SRAM B can hold both at 36-bit

  SRAM B: 36-bit wide. At 1280×720:
  ├─ Framebuffer: 921,600 × 36 bits = 3.3 MB — fits in a typical 4MB 36-bit SRAM
  └─ Z-buffer:    921,600 × 16 bits = 1.8 MB — separate region or DDR3

Option A — Z-buffer in DDR3: Framebuffer RGB12 in SRAM B (dedicated, fast, zero contention). Z-buffer in DDR3 — accessed in a tile-coherent pattern by the rasteriser, so DDR3 latency is amortised. The rasteriser processes tiles of (e.g.) 8×8 pixels; z-reads within a tile are spatially coherent and burst well from DDR3.

Option B — Packed 36-bit word (near pass only):

The 36-bit SRAM word can pack colour and depth together for the near pass:

  Bits [35:21] — 15-bit signed z-value (S15)
  Bit  [20]    — stencil / flag bit
  Bits [19:16] — alpha (4-bit)
  Bits [15:12] — R (4-bit)
  Bits [11:8]  — G (4-bit)
  Bits  [7:4]  — B (4-bit)
  Bits  [3:0]  — spare / extended alpha

This gives RGB4 colour + 15-bit z + stencil in a single 36-bit SRAM word — one read or write per pixel per pass, zero bus overhead. RGB4 is coarser than RGB12, but for depth-tested near geometry where colour precision matters, the colour can be stored in the framebuffer separately and the 36-bit word used purely as a z+stencil+alpha buffer alongside a full RGB12 framebuffer in a second SRAM region.

Option C — RGB12 framebuffer in SRAM B, Z16 in second half of SRAM B:

A 4MB 36-bit SRAM holds:

  • Framebuffer at RGB12: 640×480 = 1.1MB, or 1280×720 = 3.3MB
  • Z-buffer at 16-bit: 640×480 = 0.6MB, or 1280×720 = 1.8MB

At 640×480: both fit in 2MB of a 4MB SRAM with room to spare. At 1280×720: 3.3 + 1.8 = 5.1MB — needs either a larger SRAM or DDR3 overflow.

The practical recommendation: 640×480 near z-buffer in SRAM B alongside the framebuffer; 1280×720 z-buffer in DDR3 with tile-burst access.


Pre-Baked LOD Chunks (Far Zone Geometry)

The world beyond distance D is divided into a regular chunk grid. Each chunk is pre-baked at multiple LOD levels — the geometry is simplified offline and stored as static triangle lists. At runtime, Pulse or the big core selects the appropriate LOD for each visible chunk and submits it to FireStorm's draw list.

  World chunk grid (top-down view):

  ┌─────┬─────┬─────┬─────┬─────┐
  │ 4,4 │ 4,3 │ 4,2 │ 4,3 │ 4,4 │  LOD 3 (lowest detail, horizon)
  ├─────┼─────┼─────┼─────┼─────┤
  │ 3,3 │ 3,2 │ 3,1 │ 3,2 │ 3,3 │  LOD 2
  ├─────┼─────┼─────┼─────┼─────┤
  │ 2,2 │ 2,1 │ CAM │ 2,1 │ 2,2 │  LOD 1 (highest far detail)
  ├─────┼─────┼─────┼─────┼─────┤
  │ 3,3 │ 3,2 │ 3,1 │ 3,2 │ 3,3 │  LOD 2
  └─────┴─────┴─────┴─────┴─────┘

  Distance ring → LOD level:
  D/4  to D/2:  LOD 0 (near pass — full detail, z-buffer)
  D/2  to D:    LOD 1 (far pass — medium detail, painter's)
  D    to 2D:   LOD 2 (far pass — low detail)
  2D   to ∞:    LOD 3 (far pass — very low detail or billboard)

LOD baking process (offline or load-time):

  • Full-resolution mesh → quadric error simplification (standard mesh reduction)
  • Each LOD stored as a compact triangle list in DDR3 (indices + quantised vertices)
  • Normals and UVs quantised to fit in a compact vertex format
  • Sky-distant chunks can be baked to a single billboard texture — a flat quad at the horizon that looks correct from all viewing angles at that distance

Chunk streaming: only chunks within the view frustum are submitted to FireStorm. Frustum culling runs on the big core in C++, intersecting the chunk grid against the camera frustum planes. Hidden chunks generate no draw calls. The visible list is sorted back-to-front and submitted as a single DMA draw list batch.


The Complete Frame

  Frame N (16.67ms at 60fps):
  ┌───────────────────────────────────────────────────────────────┐
  │  Big core (C906 @ 1GHz, bare metal):                          │
  │  ├─ Frustum cull chunk grid → visible far chunks (sorted)     │
  │  ├─ Sort near objects by state (texture, shader)              │
  │  ├─ Build ImGui draw list (UI overlay)                        │
  │  └─ DMA draw list → FireStorm via QSPI                        │
  │                                                               │
  │  FireStorm rasteriser (200MHz, SRAM B):                       │
  │                                                               │
  │  PASS 1 — FAR ZONE (no z-buffer):                             │
  │  ├─ Draw skybox (fullscreen quad)                             │
  │  ├─ Draw far LOD chunks (sorted back-to-front)                │
  │  └─ Far pass complete — framebuffer contains full far scene   │
  │                                                               │
  │  PASS 2 — NEAR ZONE (z-buffer active):                        │
  │  ├─ Clear z-buffer (near region only)                         │
  │  ├─ Draw near world geometry (z-test + write)                 │
  │  ├─ Draw dynamic objects (z-test + write)                     │
  │  ├─ Draw alpha geometry (z-test, no write, back-to-front)     │
  │  ├─ Draw particles (z-test, additive blend)                   │
  │  └─ Draw UI / HUD (no z, screen-space, last)                  │
  │                                                               │
  │  DISPLAY READOUT:                                             │
  │  └─ HDMI timing generator scans completed framebuffer         │
  └───────────────────────────────────────────────────────────────┘

Audio DSP runs throughout — completely independent fabric and SRAM A. The rasteriser consumes SRAM B; the audio engine consumes SRAM A; they never contend.


FireStorm Vertex Format (3D)

A compact fixed-point vertex format for both passes:

  3D vertex (near pass — full precision):
  ├─ X: S16.8  (24-bit fixed-point clip-space x)
  ├─ Y: S16.8  (24-bit fixed-point clip-space y)
  ├─ Z: S15    (15-bit clip-space depth — maps to z-buffer)
  ├─ U: U12    (12-bit texture coordinate)
  ├─ V: U12    (12-bit texture coordinate)
  ├─ R, G, B:  4-bit each (vertex colour for Gouraud)
  └─ Total: ~13 bytes, padded to 16 bytes per vertex

  LOD chunk vertex (far pass — reduced precision):
  ├─ X, Y, Z:  S8.4 each (12-bit fixed-point — sufficient at far distance)
  ├─ U, V:     U8 each (8-bit UVs — low-res textures at LOD 2+)
  ├─ R, G, B:  4-bit each
  └─ Total: ~8 bytes per vertex — 2× more geometry in same bandwidth

The reduced far-pass vertex format means twice as many far-geometry triangles fit in the same DMA transfer budget — important since the far scene often has more surface area than the near scene even at lower polygon counts.


Reference

The two-pass painter's / z-buffer split technique, LOD chunk baking, and the N64-era reasoning behind these decisions are documented in detail in James Lambert's 3D world renderer series for the Nintendo 64. The constraints are analogous: limited z-buffer precision, fixed-point arithmetic throughout, tile-based memory access patterns, and the need to render a convincing large-scale world at 60fps on constrained silicon.

The Ant64 applies the same technique on more capable hardware — FireStorm has more rasteriser throughput and a faster BSRAM than the N64's RCP, and the 36-bit SRAM architecture avoids the N64's shared framebuffer / z-buffer contention entirely. The technique still applies because the fundamental geometry of the problem — perspective z-buffer precision distribution — is independent of hardware capability.



22. Light Synth — Audio-Reactive Video

The Ant64 integrates audio-reactive video output driven directly by the synthesis engine — a light synth in the tradition of the Atari ST demo scene, early VJ culture, and the Edirol CG-8 Visual Synthesizer. This is a completely empty market in hardware synthesis: no current production synthesizer has integrated video output.

All four modes render into FireStorm's compositor as standard layers, benefiting from the full layer stack — CRT simulation, Copper effects, alpha blending, and all output paths simultaneously.

Mode 1 — Audio-Reactive Visualiser

Waveform and spectrum displays driven by live audio output from FireStorm:

  • Classic oscilloscope view — audio waveform drawn in real time as a vector trace
  • FFT spectrum analyser — frequency domain magnitude as a bar or line display
  • Lissajous (XY) display — L/R audio channels on X/Y axes, generating complex Lissajous figures from stereo synthesis. Especially striking with chorus and FM — the phase relationships produce rotating, evolving geometric forms.
  • Spectrogram — the Page D STFT display rendered as a scrolling layer

Mode 2 — Synth Parameter Visualiser

Each playing voice rendered as a distinct visual element:

Audio parameter Visual mapping
Voice pitch Screen position (low = bottom, high = top) or hue
Amplitude / VCA envelope Element size or brightness
LFO modulation Oscillating motion / pulsing
Filter cutoff Colour temperature (cool = closed, warm = open)
Filter resonance Shape sharpness / saturation
Note velocity Impact size at note-on
Polyphony Up to 128 simultaneous visual elements

Mode 3 — Light Synth / Generative Video

Procedural visual synthesis driven by MIDI and sequencer events:

  • Each note trigger generates a visual event — flash, shape spawn, particle burst
  • Pattern and rhythm of the sequencer drives visual rhythm
  • Synthesis parameters modulate visual parameters in real time
  • Programmable via AntOS scripting bindings — audio data feeds (voice states, FFT bins, MIDI events, sequencer position) available as inputs to video scripts
  • The scripting API exposes the same audio data that Page D uses for spectral analysis — video scripts can respond to spectral content, not just note events

Mode 4 — VJ Tool

  • Pre-rendered or procedurally generated visual clips triggered by MIDI notes
  • BPM-synced transitions and cuts
  • Visual patterns stored in DBFS, loaded on demand
  • MIDI CC → visual parameter control (opacity, colour, zoom, position)
  • Drivable externally from any MIDI controller or sequencer

Synthesis Analogy

Audio concept Visual equivalent
Oscillator frequency Shape oscillation / rotation speed
Filter cutoff Colour saturation / blur radius
Envelope Visual event size and decay
LFO Periodic motion (wave, pulse, spin)
Reverb Trail / persistence / echo of visual events
Chorus Shape duplication with spatial offset
Waveform type Visual pattern morphology
Note velocity Brightness and impact scale
Polyphony Simultaneous independent visual voices

AntOS Scripting API

The audio system makes its live data available to video scripts running on the little core. Scripts receive callbacks on audio events and can read live audio state:

-- Register a callback for note events
video.on_note(function(voice, note, velocity)
    spawn_particle(note_to_y(note), velocity / 127)
end)

-- Read live FFT data from the audio engine
local bins = audio.get_fft_bins(512)
for i, magnitude in ipairs(bins) do
    draw_bar(i, magnitude)
end

-- Read voice state
local voices = audio.get_active_voices()
for _, v in ipairs(voices) do
    draw_voice_element(v.pitch, v.amplitude, v.filter_cutoff)
end

Real-time audio processing remains entirely on the big core and FireStorm. Scripts running on the little core read audio state asynchronously — they cannot affect audio timing or introduce latency into the audio pipeline.


Important: The Ant64 family of home computers are at early design/prototype stage, everything you see here is subject to change.