FireStorm Blitter


1. Overview

The FireStorm Blitter is a dedicated hardware unit inside the FireStorm FPGA that performs bulk pixel and memory operations independently of the CPU cores. It is controlled by the FireStorm Execution Engine (EE), which dispatches blit jobs and can continue executing while the blitter runs in the background. When a job completes, the blitter raises an interrupt that the EE scheduler can use to launch the next task.

This separation is deliberate:

  • The EE is a programmable CPU optimised for control flow, arithmetic, and fine-grained pixel work via register blitting
  • The Blitter is dedicated hardware optimised for throughput — moving, transforming, and drawing large amounts of pixel data at maximum BSRAM bandwidth
  • Neither gets in the other's way

The blitter operates on bitmap layers — intermediate framebuffers (8bpp, 16bpp, or 32bpp) that the scanline mixer composites alongside the hardware tilemap and sprite layers. Blitter-drawn content is distinct from the hardware sprite and tilemap system, and the two coexist freely.


2. Execution Model

The blitter is a parallel job scheduler built around specialised sub-units. The rules are simple:

  • A job is a list of primitives that executes sequentially, in order. Each primitive is routed to the appropriate sub-unit for its type.
  • If the required sub-unit is busy with another job, the current job stalls at that primitive and waits until the sub-unit becomes free, then continues.
  • Multiple jobs run concurrently. When their primitives target different sub-units — or when enough instances of a sub-unit exist — they make progress simultaneously with no waiting.
  • A job that needs the output of another job declares that job as a dependency. Without a dependency, jobs dispatch immediately.

This is natural hardware backpressure. No job fails or errors when a sub-unit is busy — it simply waits its turn. The programmer writes jobs as straightforward primitive lists; the scheduler handles all concurrency and contention transparently.

Sub-Units

Sub-unit Handles
Pixel fill Sprites, tilemap fills, textured triangles, column fills, shapes, text glyphs
Line / particle Bloom lines, polylines, particles, spark trails, vector primitives
Ray / DDA Ray casting, 2D/3D DDA voxel stepping, height-field sampling
Memory / copy Bulk transfers, texture prefetch, DMA, format conversion
Composite Bitmap-to-layer compositing with clip table, colour space conversion, downscale

Each sub-unit type can have multiple instances. The count is a compile-time HDL parameter. If profiling reveals that many jobs are stalling waiting for the memory/copy sub-unit, the fix is simply incrementing a parameter and rebuilding the bitstream — one more copy unit, no software changes, no API changes, stalls reduce. The programmer never has to think about sub-unit counts; the hardware just has more or less capacity.

A job can span multiple sub-unit types within a single primitive list. A job that copies a texture then blits sprites from it will use the memory unit for the copy primitive, then stall waiting for a pixel fill unit when it reaches the sprite primitives — entirely correct behaviour.

Jobs Are Sequential Internally

All primitives within a job execute in order. Draw order, copy-before-blit sequencing, line-before-composite — all preserved by the sequential-within-job guarantee. There are no flags to set.

Jobs Run Concurrently

Multiple jobs run at the same time. Each job advances through its primitive list independently, stalling only when the sub-unit it needs is occupied. Jobs targeting disjoint sub-units make progress simultaneously without any interaction.

; Three jobs dispatched at the same time:

Job 1: sprite list      → uses Pixel fill unit
Job 2: text pass        → uses Pixel fill unit  (stalls if Job 1 is using it;
                                                  runs in parallel if 2 units exist)
Job 3: texture prefetch → uses Memory/copy unit  (always parallel with Jobs 1 and 2)

; Job 3 always makes progress regardless of pixel fill contention.
; Jobs 1 and 2 share or contend on pixel fill depending on unit count.

Job Dependencies

A job declares dependencies only when it genuinely needs another job's output:

Job 1: tilemap blit  → Bitmap A    (pixel fill)
Job 2: sprite list   → Bitmap A    DEPENDS_ON Job 1
                                   — reads bitmap A, must wait for Job 1 to finish
Job 3: text pass     → Bitmap B    (no dependency — different destination)
Job 4: composite     → Output      DEPENDS_ON Job 2, Job 3

→ Jobs 1 and 3 start immediately
→ Job 2 holds until Job 1 completes, then dispatches
→ Job 4 holds until both Jobs 2 and 3 complete
Dependency option Meaning
(none) Dispatch immediately
DEPENDS_ON id[, id, ...] Hold until up to 4 specific job IDs complete
BLT_SEQUENTIAL Hold until all in-flight jobs complete; block new jobs until this one finishes — full memory fence

Scaling Sub-Unit Counts

If a particular sub-unit is a recurring bottleneck, increase its instance count in the HDL parameters:

If this is a bottleneck Solution
Sprite/text jobs stalling on each other More pixel fill units
Memory copies slowing blit setup More memory/copy units
Voxel/ray jobs queuing More ray/DDA units
Vector/particle jobs waiting More line/particle units

No software changes needed. The programmer's job descriptors are identical regardless of how many units exist.

Doom-Style Frame Example

V-blank — all jobs dispatched simultaneously:

Job 1: BSP traversal                         (ray/DDA)
Job 2: DDA column rays      DEPENDS_ON Job 1 (ray/DDA — queues if busy)
Job 3: sprite list          independent      (pixel fill)
Job 4: texture prefetch     independent      (memory/copy — always parallel)
Job 5: particles            independent      (line/particle — always parallel)
Job 6: textured span fill   DEPENDS_ON Job 2 (pixel fill)
Job 7: composite            DEPENDS_ON Job 3, Job 6  (composite)

→ Jobs 1, 3, 4, 5 start immediately
→ Job 4 always runs in parallel — different sub-unit from everything else
→ Job 2 starts when Job 1 done
→ Job 6 starts when Job 2 done
→ Job 7 starts when Jobs 3 and 6 done
→ Completion interrupt after Job 7

Double and Triple Buffering

Two bitmap buffers per layer. The display reads buffer A while the blitter writes buffer B. At V-blank the EE swaps the pointers — no tearing, no partial frames visible.

V-blank N:    display ← A,  blitter → B
V-blank N+1:  swap — display ← B, blitter → A

Triple buffering adds a third buffer so the blitter can start the next frame before the current one has been displayed — useful when the job graph takes longer than one frame period.

Dirty Tracking

The EE maintains a dirty flag per bitmap layer. If nothing changed in the inputs to a layer since the last frame, that job is skipped entirely — the front buffer retains the previous frame's content. A static HUD that only changes when the score updates generates zero blitter work between score changes.


3. Control Model

Dispatching Blit Jobs

The EE dispatches blit jobs by writing a job descriptor to the blitter's command queue. The descriptor specifies the operation type, source and destination addresses, dimensions, transform parameters, dependency declarations, and completion behaviour.

; Example: dispatch a sprite blit
blit.sprite  A0,        ; source sprite data
             A1,        ; destination bitmap
             D0, D1,    ; destination X, Y
             #16, #16,  ; width, height
             #BLT_MASKED | BLT_4BPP  ; flags

The blitter accepts the job, assigns it a job ID, and dispatches it immediately to an available sub-unit (or holds it pending dependency completion). The EE continues immediately.

Completion Interrupt → EE Task

When a blit job completes:

Blitter job done
    ↓
Blitter raises interrupt
    ↓
EE hardware scheduler launches registered task
    (~2 cycle context switch, no OS overhead)

Synchronisation

EE instruction Effect
WAIT Rd Yield until job ID in Rd is complete
STATUS Rd Test if job ID is still running — sets/clears Z flag, no yield
YIELD Voluntarily yield, scheduler may run blitter-triggered tasks

3. Texture Source System

Any blitter primitive that reads pixel data from a source — sprites, filled triangles, tilemaps, bitmap composites, pattern fills — uses the same texture source mechanism. The source is not restricted to BSRAM; it can come from any level of the memory hierarchy. Full memory specifications are in the Memory Architecture reference.

Memory Hierarchy

FireStorm DDR3                   ← full art library
    ↑
Graphics SRAM (Ant64/Ant64C)     ← active texture pool, intermediate buffers
    ↑
BSRAM texture cache              ← hottest working set, 380MHz
    ↑
Blitter sampler
    ↓
Destination bitmap

FireStorm DDR3 — bulk art library. Too large for on-chip memory; the working set for any scene is a fraction of the total.

Graphics SRAM (Ant64/Ant64C, 4.5MB) — fast pipeline SRAM dedicated to the blitter and audio DSP. 1-cycle latency, burst mode for sequential streaming. Ideal for texture atlases and high-colour intermediate buffers. Shared with the audio DSP for wavetables and FM voice data.

BSRAM texture cache (~128–256KB) — hottest working set. Cache hits at full 380MHz bandwidth.

Permanent BSRAM residency — frequently used assets at fixed BSRAM addresses, bypassing all caching.

Texture Source Field

Every blit job descriptor includes a texture source descriptor:

Source type Location Use case
TEX_BSRAM_DIRECT Fixed BSRAM address Permanent hot assets
TEX_BSRAM_CACHE BSRAM cache → Graphics SRAM Normal texture sampling
TEX_GFXSRAM Graphics SRAM direct Large atlases, intermediate buffers
TEX_FLASH FPGA flash Built-in fonts, default palettes
TEX_DDR3 DDR3 Full art library
TEX_INLINE In job descriptor Tiny patterns, solid colours

Cache Architecture

  • Size: 128–256KB of BSRAM (configurable at HDL build time)
  • Backing store: Graphics SRAM (primary), DDR3 (overflow)
  • Organisation: Direct-mapped or 2-way set-associative
  • Cache line: One tile (e.g. 16×16 × 4bpp = 128 bytes)
  • Eviction: LRU or pseudo-LRU
  • EE prefetch: blit.prefetch addr, size — warm cache from Graphics SRAM before the job list runs

Memory Summary

For capacity figures, bandwidth, and bus isolation details see the Memory Architecture reference. In brief: the EE Code SRAM bus is exclusively the EE's; Graphics SRAM is shared between the blitter and audio DSP; FireStorm's DDR3 serves the blitter and audio DSP only — the SG2000 has its own separate memory.

All Sampling Primitives Use This System

Primitive Texture source usage
Sprite blit Sprite sheet — 4bpp/8bpp indexed, palette from palette RAM
Textured triangle Texture atlas — UV interpolated across triangle, any pixel format
Affine tilemap Tile pixel data — fetched by tile index from texture pool
Mode 7 / perspective tilemap Floor/ceiling texture — sampled via inverse affine
Bitmap composite Source bitmap — used as texture for the composite operation
Pattern fill Repeating pattern tile — any size up to cache line
Text glyph Font bitmap — typically permanent BSRAM resident

3.1 Memory Operations

Linear copy Simple source → destination copy. Optional format conversion at copy time (e.g. 4bpp → 8bpp expansion, ARGB → BGRA swizzle). The fastest blit — limited only by BSRAM bandwidth.

2D copy (rectangular blit) Copy a rectangle from source bitmap to destination bitmap with independent source and destination strides. The workhorse operation for sprite and background rendering.

3-source blit (Amiga-style) Three inputs: A (source), B (minterm mask), C (destination). Result = boolean function of A, B, C using any of the 256 possible 3-input logic functions. Covers: copy, masked copy, XOR, NOT, AND, OR, and every combination. The full Amiga blitter operation set is a subset of this.

Fill Fill a rectangle with a solid colour or a repeating pattern. Pattern can be up to 16×16 pixels, tiled across the destination rectangle.


3.2 Sprite Blitting

Masked sprite blit Blit a sprite rectangle to a destination bitmap, treating one colour index (typically index 0) as transparent. The fundamental sprite draw operation.

Flags: BLT_MASKED | BLT_4BPP | BLT_16BPP_DEST

Alpha sprite blit Blit with per-pixel or per-palette-entry alpha blend. Requires the source to carry alpha information (RGBA) or uses the palette entry's alpha byte.

Scaled sprite blit Blit with nearest-neighbour or bilinear scaling. Scale factors are independent in X and Y — stretch, squash, zoom.

Affine sprite blit Full affine transform: scale, rotate, shear. Specified as a 2×2 matrix plus translation. Used for Mode 7-style effects, rotating game objects, perspective sprites.

Software Sprite Throughput

The blitter is designed to draw large numbers of software sprites efficiently. Unlike the hardware sprite layer (which uses a dedicated sort/fetch pipeline operating on native-resolution pixel data), blitter sprites are drawn to an intermediate bitmap layer at blit time.

Sprite pixel data can come from any level of the texture source hierarchy — permanent BSRAM for the most frequently used sprites, DDR3-backed texture cache for the full sprite library. With the EE pre-warming the cache at the start of the frame, cache hits during blitting are the normal case.

Example: 16-colour (4bpp), 16×16 sprites from BSRAM cache

Each 16×16 4bpp sprite is 128 bytes. At ~8 pixels/cycle at 380MHz:

16×16 sprite = 256 output pixels
At ~8 pixels/cycle = 32 cycles per sprite
At 380MHz = ~11.875 million sprites/second
At 60fps = ~197,000 sprites per frame

In practice the limit is blitter job overhead, destination write bandwidth, and available V-blank/H-blank time rather than raw pixel throughput. A practical budget of several thousand 16×16 sprites per frame is realistic and well in excess of any game's requirements.

These are in addition to the hardware sprite layer's own budget of hundreds of sprites per scanline — the two systems operate independently and can both be active simultaneously.


3.3 Tilemap Blitting

Flat tilemap blit Render a tilemap into a bitmap layer — the same as the hardware tilemap layer but software-rendered to an intermediate bitmap. Useful when the tilemap needs post-processing before display.

Affine tilemap blit (Mode 7) Render a tilemap with a full affine transform — floor/ceiling planes, rotating/scaling game boards, SNES Mode 7-style backgrounds. The transform matrix maps output pixel positions back to tilemap coordinates via inverse affine. Perspective can be approximated by varying the scale per scanline (driven by the EE computing a new matrix each line).

; Mode 7 floor plane
; A0 = tilemap data, A1 = destination bitmap
; D0-D3 = affine matrix (fixed-point 16.16)
; D4, D5 = translation (screen centre)
blit.tilemap_affine  A0, A1, D0, D1, D2, D3, D4, D5, #BLT_WRAP

Perspective tilemap blit A per-scanline affine blit where the EE provides a new matrix for each output line, creating a perspective projection. The blitter renders one scanline per matrix update. The EE computes the matrix sequence during the previous frame and hands it to the blitter as a matrix list.


3.4 Line and Shape Primitives

Line draw (Bresenham) Draw a line between two points with a given colour. The fastest line — integer only, no antialiasing.

blit.line  #x0, #y0, #x1, #y1, #colour

Antialiased line (Wu) Draw a line using Wu's algorithm — fractional pixel coverage at the endpoints and along diagonal edges, giving smooth sub-pixel accuracy. Single pixel wide with soft edges.

blit.line  #x0, #y0, #x1, #y1, #colour, #BLT_ANTIALIAS

Bloom line — Vector CRT simulation

Draw an antialiased line with a phosphor bloom glow effect, simulating the characteristic aesthetic of vector arcade CRTs (Tempest, Asteroids, Star Wars, Battlezone). The bloom adds a coloured halo around the line core — bright and saturated at the centre, falling off with distance.

The bloom effect has two parameters beyond the line colour:

  • Intensity (0–255) — drives the brightness of the line core and the width of the glow. Low intensity gives a dim line with a narrow soft edge. High intensity gives a saturated core with a wide bright halo, exactly as a vector CRT looks when the beam is driven hard
  • Falloff (0–255) — controls how quickly the glow fades with distance from the line. High falloff gives a tight bright line. Low falloff gives a wide diffuse glow

The falloff model per pixel:

brightness(d) = intensity × falloff_factor^d

where d = perpendicular distance from line centre in pixels
      falloff_factor = falloff / 255   (0.0 = instant falloff, 1.0 = no falloff)

The glow is rendered in additional passes at d=1, 2, 3... pixels from the line centre, each at reduced intensity, stopping when brightness drops below a threshold (~4/255). At high intensity and low falloff the glow radius can extend 4–8 pixels from the line centre, matching real vector phosphor behaviour.

blit.line_bloom  #x0, #y0, #x1, #y1, #colour, #intensity, #falloff

Destination bitmap must use additive blend mode for vector rendering. The bitmap layer is initialised to black. Each bloom line call uses BLT_ADDITIVE blend mode:

dest = clamp(dest + src, 0, max)

Colour values add together where lines overlap or cross — crossing points become brighter, exactly as they do on a real vector CRT where the electron beam passes twice. The destination saturates to maximum brightness at hot spots. This is the correct and authentic behaviour, not a workaround.

A complete vector frame is rendered by:

  1. Clear bitmap layer to black (one fast fill job)
  2. Dispatch bloom line list (one primitive list job, BLT_ADDITIVE)
  3. EE flips front/back buffer at V-blank

For a game with 200 lines per frame (Tempest has roughly this many), the entire frame renders in a few hundred microseconds — far within the V-blank budget. The bitmap layer is then composited by the scanline mixer alongside hardware sprite layers for UI elements.

Vector line list Batch dispatch of many bloom lines in one job — the same primitive list mechanism as sprite lists. Each entry specifies start point, end point, colour, intensity, and falloff. The blitter processes the list sequentially with a single completion interrupt at the end.

blit.linelist  A0,          ; list base — array of (x0,y0,x1,y1,colour,intensity,falloff)
               #200,        ; line count
               A1,          ; destination bitmap
               #BLT_ADDITIVE | BLT_ANTIALIAS | BLT_BLOOM

Named vector presets — analogous to the display system's named CRT monitor presets:

Preset Intensity Falloff Character
Tempest 200 180 Wide saturated glow — Atari colour vector
Asteroids 160 220 Tighter glow, green-white phosphor feel
Star Wars 220 140 Very wide bloom, high-intensity battle look
Battlezone 150 200 Crisp wireframe, narrow green glow
Sharp 255 240 Maximum brightness, minimal bloom

Particle (point sprite) Draw a single dot at a given position. The simplest primitive — much cheaper than a line since no angular rasterisation is needed.

Parameters:

  • Position (x, y) — centre of the particle
  • Colour — RGBA
  • Size — radius in pixels (1 = single pixel, 2 = 2px radius disc, etc.)
  • Intensity (0–255) — brightness, same scale as bloom lines
  • Falloff (0–255) — phosphor glow falloff, same model as bloom lines. At falloff=255 the particle is a hard-edged disc. At lower values it has a soft glowing halo

With BLT_ADDITIVE | BLT_BLOOM, a particle glows exactly like a point light on a vector CRT — bright saturated core, soft halo, additive mixing with nearby particles and lines. Overlapping particles become brighter. This is the correct and authentic vector CRT behaviour.

A size-4 particle with bloom is visually similar to a filled circle but renders significantly faster — useful for explosion effects where exact circularity doesn't matter.

blit.particle  #x, #y, #colour, #size, #intensity, #falloff, #BLT_ADDITIVE | BLT_BLOOM

Particle list Batch dispatch of many particles as a single blit job — the natural companion to the vector line list. Each entry specifies position, colour, size, intensity, and falloff. The blitter renders all particles sequentially with one completion interrupt at the end.

blit.particlelist  A0,          ; list base — array of (x, y, colour, size, intensity, falloff)
                   #500,        ; particle count
                   A1,          ; destination bitmap
                   #BLT_ADDITIVE | BLT_BLOOM

A complete vector explosion frame — line list plus particle list — both writing additively to the same bitmap layer. Lines and particles mix naturally: a bright particle near a bright line creates a hot spot where they overlap, exactly as on a real vector display.

Spark trail A particle list with a linearly or exponentially decreasing intensity along a path. Simulates the glowing tail of a missile, the decay arc of an explosion fragment, or the fading trace of a fast-moving object. The EE computes the intensity ramp and position array; the blitter renders the trail as a standard particle list.

; Spark trail: 20 particles along a path, intensity fading from 200 to 0
; EE pre-computes positions and intensities into a list at A0
blit.particlelist  A0, #20, A1, #BLT_ADDITIVE | BLT_BLOOM

Combined vector frame example

A Tempest-style frame — web wireframe + enemy shots + explosions + particles — all rendered additively to one bitmap layer:

; Frame job queue:
Job 1: blit.linelist     web_lines,     #180, bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; web
Job 2: blit.linelist     enemy_lines,   #40,  bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; enemies
Job 3: blit.particlelist shot_particles, #12, bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; shots
Job 4: blit.particlelist explosion,     #80,  bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; explosion
; One completion interrupt after Job 4. EE flips front buffer at V-blank.

Total for ~310 primitives at typical complexity: well under 500 microseconds. V-blank at 60fps on 1080p is approximately 1.3ms — comfortable headroom for game logic and audio in the same V-blank period.

Filled rectangle Fill a rectangle with solid colour or pattern. Faster than a general fill for rectangular regions.

Filled circle / ellipse Draw a filled circle or ellipse. Antialiased outline variant. Used for particles, explosions, simple UI elements.

Filled triangle Draw a solid filled triangle with a flat colour. The fundamental primitive for 2D polygon rendering. Lists of triangles can be dispatched as a single blit job for polygon meshes.

Textured triangle Draw a filled triangle sampling colour from a texture source (see Section 3 — Texture Source System). Each vertex carries UV texture coordinates; the blitter interpolates U and V linearly across the triangle and samples texture[u][v] for each output pixel.

Texture sampling modes:

  • Nearest-neighbour — pixel art, no softening, fastest
  • Bilinear — smooth scaling, 4 samples per pixel
  • Perspective-correct — divide U/V by W per pixel, eliminating affine warping on 3D geometry

The texture source can be any level of the hierarchy — permanent BSRAM resident for small textures, DDR3-backed cache for large texture atlases. Triangle lists with a shared texture source are dispatched as a single blit job; the blitter rasterises each triangle in sequence without re-loading the texture.

; Textured triangle
blit.triangle_tex  #x0,#y0,#u0,#v0,  ; vertex 0: position + UV
                   #x1,#y1,#u1,#v1,  ; vertex 1
                   #x2,#y2,#u2,#v2,  ; vertex 2
                   A0,                ; texture source descriptor
                   #BLT_BILINEAR | BLT_PERSP_CORRECT

Rounded rectangle Rectangle with configurable corner radius. Common UI element.

Arc / sector Circular arc or filled sector (pie slice). Useful for health/progress indicators.


3.5 Flood Fill

Flood fill (boundary fill) Fill a connected region bounded by a specific colour, starting from a seed pixel. Uses a scanline-coherent algorithm for efficiency — processes horizontal spans rather than individual pixels, keeping the work queue small.

blit.flood_fill  #seed_x, #seed_y, #fill_colour, #boundary_colour

Span fill (non-boundary) Fill a region defined by a left/right boundary table (same format as the layer clip table). More predictable performance than flood fill for programmatically-defined regions.


3.6 Bitmap Layer Operations

Bitmap-to-bitmap blit Copy one intermediate bitmap to another with optional transform. Used for double-buffering, compositing multiple blit passes, and applying the Stage 2 affine transform to a completed intermediate bitmap before it goes to the scanline mixer.

Clip table application Blit an intermediate bitmap to an output layer with a clip table applied. Pixels outside the clip table's left/right boundaries are not written. This is the mechanism for split-screen and portal effects — see the display system document for clip table format.

Colour space conversion Convert a bitmap from one format to another: RGB→YCbCr, RGBA→BGRA, 8bpp indexed→16bpp direct colour, 4bpp→8bpp expansion, etc.

Downscale Reduce a bitmap's resolution with box or bilinear filter. Used for the mixdown pipeline from 4K native content to HDMI/VGA output resolution.


3.7 Text Rendering

Glyph blit Blit a single glyph from a font bitmap (1bpp, 4bpp, or 8bpp) to a destination bitmap with optional colour tint and background transparency.

String blit Render a complete string — the blitter walks the character table, blits each glyph with appropriate advance width, handles line breaks. Returns the bounding box of the rendered text.

Proportional and monospace fonts Both supported. Font metrics (advance widths, kerning pairs) stored alongside the font bitmap.


3.8 Image Processing

Horizontal flip / vertical flip Mirror a bitmap region. Also available as flags on any blit operation.

Rotation by 90° / 180° / 270° Fast integer rotation — no interpolation needed for exact quarter-turns.

Arbitrary rotation Affine blit with rotation matrix. Uses bilinear sampling to reduce aliasing.

Scale with filter Nearest-neighbour (pixel art, no softening), bilinear (smooth scaling), or Lanczos (highest quality, sharpest).

Colour key extraction Create a 1bpp mask bitmap from a source by testing each pixel against a colour key. Used to pre-compute masks for masked sprite blits.


5. Blit Job Descriptor

Every blit operation is described by a job descriptor written to the blitter's command queue by the EE. The descriptor format varies by operation type but always includes:

Field Width Notes
Operation 8 bits Operation type code — also determines which sub-unit handles the job
Flags 16 bits BLT_MASKED, BLT_ALPHA, BLT_ANTIALIAS, BLT_WRAP, BLT_BILINEAR, BLT_PERSP_CORRECT, BLT_ADDITIVE, BLT_BLOOM, BLT_SEQUENTIAL, BLT_DEPTH_TEST, BLT_DEPTH_WRITE, BLT_DEPTH_CLEAR, etc.
Dependency count 4 bits Number of predecessor job IDs declared (0–4)
Predecessor IDs 0–4 × 16 bits Job IDs that must complete before this job dispatches
Texture source type 4 bits TEX_BSRAM_DIRECT, TEX_BSRAM_CACHE, TEX_GFXSRAM, TEX_FLASH, TEX_DDR3, TEX_INLINE
Texture source address 32 bits Address or inline data depending on source type
Texture width / height / stride 16+16+16 bits Source texture dimensions
Texture pixel format 8 bits 1bpp, 4bpp, 8bpp, 16bpp, 32bpp
Destination address 32 bits Target bitmap pointer
Destination stride 16 bits Bytes per row in destination
Dest X, Y 16+16 bits Top-left of destination rectangle
Width / Height 16+16 bits Operation dimensions in output pixels
Completion task 16 bits EE task address to launch on completion (0 = none)
Job ID (out) 16 bits Assigned by blitter scheduler, returned to EE

Transform-capable operations (affine blit, Mode 7, textured triangles, etc.) add a transform/UV parameter block after the base descriptor.

BLT_SEQUENTIAL in the flags field is equivalent to declaring a dependency on every currently in-flight job — it provides a full memory fence without requiring the programmer to enumerate specific job IDs.


5. Blitter vs Hardware Layer System

The blitter and the hardware layer system (tilemaps, hardware sprites) are independent and complementary:

Hardware layer system Blitter to bitmap layer
Timing Real-time, per-scanline during display Pre-rendered during V-blank/H-blank
Sprite count Hundreds per scanline (hardware pipeline) Thousands per frame (throughput limited)
Transforms None (pixel replication only) Full affine per sprite
Tilemap scroll Per-tile-row H / per-tile-col V (hardware) Any affine including Mode 7
Setup cost Register writes only Job descriptor queue
Latency Zero — always live One frame (rendered previous V-blank)
Colour depth Indexed palette Any (8bpp, 16bpp, 32bpp)
Clip Layer clip table Per-blit or clip table at composite time

A typical game uses both: hardware sprites for the player character and key game objects (zero latency, always live), blitter sprites for large numbers of background objects, particles, or effects where affine transforms are needed or the count exceeds the hardware sprite budget.


6. Blitter in the Display Pipeline

The blitter's output is an intermediate bitmap layer. That layer then enters the scanline mixer at the same priority level as any other layer — hardware tilemap, hardware sprites, or framebuffer layers. The scanline mixer does not know or care whether a layer was rendered by the hardware tilemap engine or the blitter.

Previous frame V-blank:
    EE dispatches blit jobs to blitter
    Blitter renders primitive list → bitmap layer A
    Blitter raises completion interrupt → EE launches "blit complete" task
    EE dispatches Stage 2 blit: bitmap A → output layer, with clip table

Current frame display:
    Scanline mixer composites all layers:
        Hardware tilemap layer(s)
        Hardware sprite layer
        Blitter bitmap layer  ← appears here
        UI overlay
        etc.

7. Clip Table Integration

The clip table (see display system document, Section 10) is applied when the blitter composites the intermediate bitmap to the output layer. This is the mechanism for split-screen, portals, and shaped display regions.

The EE generates the clip table during V-blank, then dispatches:

; Stage 2: blit intermediate bitmap to output layer with clip table
blit.composite  A0,          ; source: intermediate bitmap
                A1,          ; destination: output layer
                A2,          ; clip table pointer
                #BLT_CLIPPED

The blitter walks the clip table scanline by scanline as it writes to the output layer, skipping pixels outside the left/right boundaries. The operation is otherwise identical to a 2D blit — the clip table adds no significant cost per pixel.


8. FireStorm EE and Blitter — Division of Labour

Task EE Blitter
Compute sprite positions, animation
Dispatch blit job descriptors
Generate clip tables
Generate Mode 7 matrix lists
Fine-grained pixel ops (register blitting)
Bulk pixel copy
Affine transform rendering
Line/shape/fill primitives
Flood fill
Text rendering
Completion interrupt → EE task ✓ (raises interrupt)
Job queue management shared shared

The EE is the programmer-facing control surface. The blitter is the pixel-pushing engine. The EE tells the blitter what to draw; the blitter draws it and tells the EE when it's done.


9. Primitive List — Batch Dispatch

For rendering many sprites, lines, or particles in one frame, the EE builds a primitive list — a compact array of blit descriptors — and dispatches the entire list as a single job. The job's sub-unit processes the list sequentially from start to finish, in order, raising one completion interrupt when done.

; Sprite list: 500 sprites, 16x16, 4bpp, masked
; Executes in list order — correct Z ordering guaranteed
blit.primlist  A0,          ; list base address
               #500,        ; entry count
               A1,          ; destination bitmap
               #BLT_4BPP | BLT_MASKED

; Particle list: additive blend — order irrelevant, but still in-order within job
blit.primlist  A2,          ; particle list
               #800,
               A3,          ; destination bitmap
               #BLT_ADDITIVE | BLT_BLOOM

The EE is free to dispatch other jobs — to different sub-units — while this job runs. A sprite list and a text pass and a memory copy dispatched at the same time all run in parallel on their respective sub-units. The sprite list's internal sequencing is preserved regardless.

Throughput: 500 × 16×16 4bpp masked sprites

500 sprites × 256 pixels = 128,000 pixels
At ~8 pixels/cycle at 380MHz ≈ 42 microseconds
Several thousand sprites per frame readily achievable,
in addition to the hardware sprite layer's hundreds per scanline.

11. Primitive Types — Complete List

Category Primitive Texture source Notes
Copy Linear copy Any With optional format conversion
2D rectangular blit Any With stride
3-source blit Any Any 3-input boolean function
Sprites Masked sprite Any Colour-key transparency
Alpha sprite Any Per-pixel or per-palette alpha
Scaled sprite Any Nearest / bilinear
Affine sprite Any Full scale/rotate/shear
Triangles Filled triangle Flat colour
Textured triangle Any UV interpolated, nearest/bilinear/perspective-correct
Triangle list Any Batch of triangles sharing one texture
Tilemaps Flat tilemap Any Standard scrolling tilemap to bitmap
Affine tilemap Any Mode 7-style with transform matrix
Perspective tilemap Any Per-scanline matrix list
Lines Line Bresenham integer
Antialiased line Wu's algorithm
Bloom line Antialiased + phosphor glow, intensity + falloff, additive blend
Polyline Connected segment list, Bresenham/AA/bloom variants
Vector line list Batch bloom lines, additive blend, single completion interrupt
Particles Particle Point sprite, size + colour + bloom falloff, additive blend
Particle list Batch particles, additive blend, single completion interrupt
Spark trail Particle list with intensity ramp along path
Shapes Filled rectangle Any (pattern) Solid or texture-pattern fill
Filled circle Any (pattern) With optional antialiased outline
Filled ellipse Any (pattern) With optional antialiased outline
Filled triangle Solid colour
Rounded rectangle Any (pattern) Configurable corner radius
Arc / sector Circular arc or pie slice
Fill Flood fill Boundary fill, scanline-coherent
Span fill Any (pattern) Left/right boundary table
Bitmap ops Bitmap composite Any With clip table and optional affine
Colour space convert RGB↔YCbCr, format conversions
Downscale Box / bilinear filter
Flip Horizontal, vertical
Rotate 90°/180°/270° integer, arbitrary affine
Text Glyph blit BSRAM (permanent) Single character from font bitmap
String blit BSRAM (permanent) Full string with metrics
Prefetch Texture prefetch DDR3 → cache Warm cache before blit list runs
Misc Colour key mask Generate 1bpp mask from colour key

12. Performance Counters and Profiler Overlay

FireStorm includes hardware performance counters for the blitter sub-system. All counters are memory-mapped registers readable via the FRAM bus by the SG2000, the EE, Pulse, or DeMon. They are designed to answer the practical question: which sub-unit is the bottleneck, and how much would adding another instance help?

The Hard RISC-V Core

The GoWin GW5AT-138 contains a hardened RISC-V core built into the FPGA silicon — not implemented in LUTs, so it consumes zero fabric resources. It exists specifically for internal hardware monitoring and debugging, with direct access to all FPGA-internal signals and registers. This core is the natural home for blitter performance counter collection.

The hard RISC-V:

  • Runs entirely independently of the FireStorm EE, SG2000, Pulse, and DeMon
  • Has zero impact on the system it is measuring — it does not share any execution resource with the EE or blitter sub-units
  • Reads all internal blitter counters directly, without going through the FRAM bus
  • Formats and renders the profiler UI into the ImGui overlay layer's backing BSRAM
  • Can be activated or deactivated by a single register write from any chip in the system

Because it is hardened silicon rather than LUT logic, enabling the profiler has no effect on FPGA timing closure or resource utilisation. It is always present and always collecting — the only choice is whether to display the data.

Counter Snapshot Model

All counters run continuously during blitter operation. At each V-blank, the hardware latches a snapshot of all counters into a separate read register bank. The live counters continue accumulating; the snapshot holds the previous frame's values undisturbed until the next V-blank. The hard RISC-V reads the snapshot bank during the frame with no race conditions.

A reset-on-snapshot option clears the live counters at each V-blank latch, giving per-frame deltas rather than cumulative totals. Either mode is selectable via a control register.


Per-Sub-Unit Counters

For each sub-unit instance (pixel fill 0, pixel fill 1, memory/copy 0, etc.):

Counter Meaning
ACTIVE_CYCLES Cycles this unit spent executing a primitive
STALL_CYCLES Cycles a job was waiting for this unit to become free
IDLE_CYCLES Cycles this unit had no work queued
JOBS_COMPLETED Number of jobs (or job segments) completed by this unit
PRIMITIVES_COMPLETED Number of individual primitives processed

ACTIVE_CYCLES + STALL_CYCLES + IDLE_CYCLES = total frame cycles (sanity check).

Derived metrics:

Stall ratio    = STALL_CYCLES / (STALL_CYCLES + ACTIVE_CYCLES)
Utilisation    = ACTIVE_CYCLES / total_frame_cycles
Idle fraction  = IDLE_CYCLES   / total_frame_cycles
Stall ratio Interpretation
< 10% Sub-unit not a bottleneck
10–30% Moderate contention — monitor
30–60% Significant bottleneck — second unit recommended
> 60% Severe bottleneck — adding a unit would substantially reduce frame time

A high idle fraction alongside a high stall ratio means jobs are queuing for this unit while it is actually doing nothing — a dependency ordering problem rather than a capacity problem. The fix is reviewing job dependency declarations, not adding units.


Per-Job Counters

For each job ID:

Counter Meaning
DISPATCH_TIME Cycle timestamp when the job was submitted
START_TIME Cycle timestamp when the job first acquired a sub-unit
COMPLETE_TIME Cycle timestamp when the job's last primitive completed
STALL_TIME Total cycles this job spent waiting for a sub-unit
EXECUTE_TIME Total cycles this job spent actively executing primitives
Queue latency  = START_TIME - DISPATCH_TIME
Total latency  = COMPLETE_TIME - DISPATCH_TIME
Stall fraction = STALL_TIME / EXECUTE_TIME

Global Blitter Counters

Counter Meaning
FRAME_CYCLES Total cycles in the last frame
BLITTER_BUSY_CYCLES Cycles at least one sub-unit was active
ALL_IDLE_CYCLES Cycles all sub-units were idle simultaneously
JOBS_DISPATCHED Jobs submitted this frame
JOBS_COMPLETED Jobs completed this frame
DEPENDENCY_STALL_CYCLES Cycles jobs spent held on dependency (not sub-unit contention)
PEAK_QUEUE_DEPTH Maximum jobs simultaneously queued or in-flight

DEPENDENCY_STALL_CYCLES separates programmer-imposed ordering overhead from genuine sub-unit contention. Only the latter is improved by adding units.


The ImGui Profiler Overlay

The hard RISC-V renders blitter profiling data into the ImGui overlay layer — a system-reserved layer in the FireStorm scanline mixer that sits above all application content. The layer is owned by the monitoring subsystem, composited in hardware, and visible on all display outputs simultaneously.

Activation: A single write to the profiler enable register, accessible via FRAM from the SG2000, EE, DeMon, or Copper. A keyboard shortcut, a debug menu option, a Copper trap on a specific raster line — any mechanism that can write a register can toggle the overlay. The hard RISC-V continues collecting data regardless of whether the overlay is visible.

Overlay contents (suggested layout):

┌─ FireStorm Blitter Profiler ────────────────────────────────┐
│ Frame: 16.7ms  Busy: 12.3ms (73%)  Idle: 4.4ms (26%)        │
│                                                             │
│ Sub-unit        Active    Stall    Idle   Stall%  Jobs      │
│ Pixel fill  0   8.2ms     1.1ms    7.4ms   12%    47        │
│ Pixel fill  1   7.9ms     0.8ms    7.8ms   9%     44        │
│ Line/part   0   2.1ms     0.0ms   14.6ms   0%     3         │
│ Memory/copy 0   9.8ms     3.2ms    3.7ms   25%  ← watch     │
│ Composite   0   1.2ms     0.0ms   15.5ms   0%     8         │
│ Ray/DDA     0   4.4ms     0.0ms   12.3ms   0%     12        │
│                                                             │
│ Dependency stalls: 0.4ms    Peak queue depth: 7             │
│                                                             │
│ Last 8 frames: ▁▂▃▂▂▃▂▂  (frame time sparkline)         │
└─────────────────────────────────────────────────────────────┘

The ← watch annotation is generated automatically when a unit's stall ratio exceeds the warning threshold. The hard RISC-V computes the stall ratios and derived metrics directly from the counter snapshot, applies the threshold logic, and renders the formatted text into the overlay BSRAM each frame — typically completing well within V-blank.

Impact on the system being profiled: zero. The hard RISC-V runs on its own clock domain. The ImGui layer sits above all application layers in the compositor priority stack — it never overwrites application BSRAM. The blitter sub-units are unaffected by the overlay rendering because the hard RISC-V writes to BSRAM regions not used by the blitter. The frame time numbers shown are the frame times of the system running normally, not the system running with profiling overhead.

Persistence: The overlay can also write to a circular buffer in FireStorm DDR3, giving a rolling history of counter snapshots that can be read out post-session by the SG2000, saved to disk, or streamed over the network via AntOS. Long-running profiling sessions that capture rare frame spikes are feasible this way.


Workflow: Identifying and Fixing a Bottleneck

  1. Enable the overlay — write the profiler enable register. The UI appears immediately.

  2. Find the bottleneck — look for the sub-unit with the highest stall% in the overlay. The ← watch annotation appears automatically at the threshold.

  3. Check dependency stalls — if DEPENDENCY_STALL_CYCLES is large relative to sub-unit stall cycles, the problem is job ordering rather than capacity. Review dependency declarations.

  4. If it is a capacity problem — increment the HDL instance count for the bottleneck sub-unit, rebuild the bitstream. No code changes. Re-enable the overlay and confirm the stall% has dropped.

  5. Check idle fractions — a unit with high idle% is using LUTs unnecessarily. Reducing its count recovers LUT budget for units that are actually busy.

  6. Disable the overlay — write the profiler enable register. The layer disappears. The hard RISC-V continues collecting in the background.


Important: The Ant64 family of home computers are at early design/prototype stage, everything you see here is subject to change.