FireStorm Blitter

1. Overview

The FireStorm Blitter is a dedicated hardware unit inside the FireStorm FPGA that performs bulk pixel and memory operations independently of the CPU cores. It is controlled by the FireStorm Execution Engine (EE), which dispatches blit jobs and can continue executing while the blitter runs in the background. When a job completes, the blitter raises an interrupt that the EE scheduler can use to launch the next task.

This separation is deliberate:

The EE is a programmable CPU optimised for control flow, arithmetic, and fine-grained pixel work via register blitting
The Blitter is dedicated hardware optimised for throughput — moving, transforming, and drawing large amounts of pixel data at maximum BSRAM bandwidth
Neither gets in the other's way

The blitter operates on bitmap layers — intermediate framebuffers (8bpp, 16bpp, or 32bpp) that the scanline mixer composites alongside the hardware tilemap and sprite layers. Blitter-drawn content is distinct from the hardware sprite and tilemap system, and the two coexist freely.

2. Execution Model

The blitter is a parallel job scheduler built around specialised sub-units. The rules are simple:

A job is a list of primitives that executes sequentially, in order. Each primitive is routed to the appropriate sub-unit for its type.
If the required sub-unit is busy with another job, the current job stalls at that primitive and waits until the sub-unit becomes free, then continues.
Multiple jobs run concurrently. When their primitives target different sub-units — or when enough instances of a sub-unit exist — they make progress simultaneously with no waiting.
A job that needs the output of another job declares that job as a dependency. Without a dependency, jobs dispatch immediately.

This is natural hardware backpressure. No job fails or errors when a sub-unit is busy — it simply waits its turn. The programmer writes jobs as straightforward primitive lists; the scheduler handles all concurrency and contention transparently.

Sub-Units

Sub-unit	Handles
Pixel fill	Sprites, tilemap fills, textured triangles, column fills, shapes, text glyphs
Line / particle	Bloom lines, polylines, particles, spark trails, vector primitives
Ray / DDA	Ray casting, 2D/3D DDA voxel stepping, height-field sampling
Memory / copy	Bulk transfers, texture prefetch, DMA, format conversion
Composite	Bitmap-to-layer compositing with clip table, colour space conversion, downscale

Each sub-unit type can have multiple instances. The count is a compile-time HDL parameter. If profiling reveals that many jobs are stalling waiting for the memory/copy sub-unit, the fix is simply incrementing a parameter and rebuilding the bitstream — one more copy unit, no software changes, no API changes, stalls reduce. The programmer never has to think about sub-unit counts; the hardware just has more or less capacity.

A job can span multiple sub-unit types within a single primitive list. A job that copies a texture then blits sprites from it will use the memory unit for the copy primitive, then stall waiting for a pixel fill unit when it reaches the sprite primitives — entirely correct behaviour.

Jobs Are Sequential Internally

All primitives within a job execute in order. Draw order, copy-before-blit sequencing, line-before-composite — all preserved by the sequential-within-job guarantee. There are no flags to set.

Jobs Run Concurrently

Multiple jobs run at the same time. Each job advances through its primitive list independently, stalling only when the sub-unit it needs is occupied. Jobs targeting disjoint sub-units make progress simultaneously without any interaction.

; Three jobs dispatched at the same time:

Job 1: sprite list      → uses Pixel fill unit
Job 2: text pass        → uses Pixel fill unit  (stalls if Job 1 is using it;
                                                  runs in parallel if 2 units exist)
Job 3: texture prefetch → uses Memory/copy unit  (always parallel with Jobs 1 and 2)

; Job 3 always makes progress regardless of pixel fill contention.
; Jobs 1 and 2 share or contend on pixel fill depending on unit count.

Job Dependencies

A job declares dependencies only when it genuinely needs another job's output:

Job 1: tilemap blit  → Bitmap A    (pixel fill)
Job 2: sprite list   → Bitmap A    DEPENDS_ON Job 1
                                   — reads bitmap A, must wait for Job 1 to finish
Job 3: text pass     → Bitmap B    (no dependency — different destination)
Job 4: composite     → Output      DEPENDS_ON Job 2, Job 3

→ Jobs 1 and 3 start immediately
→ Job 2 holds until Job 1 completes, then dispatches
→ Job 4 holds until both Jobs 2 and 3 complete

Dependency option	Meaning
(none)	Dispatch immediately
`DEPENDS_ON id[, id, ...]`	Hold until up to 4 specific job IDs complete
`BLT_SEQUENTIAL`	Hold until all in-flight jobs complete; block new jobs until this one finishes — full memory fence

Scaling Sub-Unit Counts

If a particular sub-unit is a recurring bottleneck, increase its instance count in the HDL parameters:

If this is a bottleneck	Solution
Sprite/text jobs stalling on each other	More pixel fill units
Memory copies slowing blit setup	More memory/copy units
Voxel/ray jobs queuing	More ray/DDA units
Vector/particle jobs waiting	More line/particle units

No software changes needed. The programmer's job descriptors are identical regardless of how many units exist.

Doom-Style Frame Example

V-blank — all jobs dispatched simultaneously:

Job 1: BSP traversal                         (ray/DDA)
Job 2: DDA column rays      DEPENDS_ON Job 1 (ray/DDA — queues if busy)
Job 3: sprite list          independent      (pixel fill)
Job 4: texture prefetch     independent      (memory/copy — always parallel)
Job 5: particles            independent      (line/particle — always parallel)
Job 6: textured span fill   DEPENDS_ON Job 2 (pixel fill)
Job 7: composite            DEPENDS_ON Job 3, Job 6  (composite)

→ Jobs 1, 3, 4, 5 start immediately
→ Job 4 always runs in parallel — different sub-unit from everything else
→ Job 2 starts when Job 1 done
→ Job 6 starts when Job 2 done
→ Job 7 starts when Jobs 3 and 6 done
→ Completion interrupt after Job 7

Double and Triple Buffering

Two bitmap buffers per layer. The display reads buffer A while the blitter writes buffer B. At V-blank the EE swaps the pointers — no tearing, no partial frames visible.

V-blank N:    display ← A,  blitter → B
V-blank N+1:  swap — display ← B, blitter → A

Triple buffering adds a third buffer so the blitter can start the next frame before the current one has been displayed — useful when the job graph takes longer than one frame period.

Dirty Tracking

The EE maintains a dirty flag per bitmap layer. If nothing changed in the inputs to a layer since the last frame, that job is skipped entirely — the front buffer retains the previous frame's content. A static HUD that only changes when the score updates generates zero blitter work between score changes.

3. Control Model

Dispatching Blit Jobs

The EE dispatches blit jobs by writing a job descriptor to the blitter's command queue. The descriptor specifies the operation type, source and destination addresses, dimensions, transform parameters, dependency declarations, and completion behaviour.

; Example: dispatch a sprite blit
blit.sprite  A0,        ; source sprite data
             A1,        ; destination bitmap
             D0, D1,    ; destination X, Y
             #16, #16,  ; width, height
             #BLT_MASKED | BLT_4BPP  ; flags

The blitter accepts the job, assigns it a job ID, and dispatches it immediately to an available sub-unit (or holds it pending dependency completion). The EE continues immediately.

Completion Interrupt → EE Task

When a blit job completes:

Blitter job done
    ↓
Blitter raises interrupt
    ↓
EE hardware scheduler launches registered task
    (~2 cycle context switch, no OS overhead)

Synchronisation

EE instruction	Effect
`WAIT Rd`	Yield until job ID in Rd is complete
`STATUS Rd`	Test if job ID is still running — sets/clears Z flag, no yield
`YIELD`	Voluntarily yield, scheduler may run blitter-triggered tasks

3. Texture Source System

Any blitter primitive that reads pixel data from a source — sprites, filled triangles, tilemaps, bitmap composites, pattern fills — uses the same texture source mechanism. The source is not restricted to BSRAM; it can come from any level of the memory hierarchy. Full memory specifications are in the Memory Architecture reference.

Memory Hierarchy

FireStorm DDR3                   ← full art library
    ↑
Graphics SRAM (Ant64/Ant64C)     ← active texture pool, intermediate buffers
    ↑
BSRAM texture cache              ← hottest working set, 380MHz
    ↑
Blitter sampler
    ↓
Destination bitmap

FireStorm DDR3 — bulk art library. Too large for on-chip memory; the working set for any scene is a fraction of the total.

Graphics SRAM (Ant64/Ant64C, 4.5MB) — the fast 36-bit pipeline SRAM bus, used by the blitter and audio DSP for working data and shared (arbitrated) with the EE's wide-mode code. 1-cycle latency, burst mode for sequential streaming. Ideal for texture atlases and high-colour intermediate buffers, wavetables, and FM voice data.

BSRAM texture cache (~128–256KB) — hottest working set. Cache hits at full 380MHz bandwidth.

Permanent BSRAM residency — frequently used assets at fixed BSRAM addresses, bypassing all caching.

Texture Source Field

Every blit job descriptor includes a texture source descriptor:

Source type	Location	Use case
`TEX_BSRAM_DIRECT`	Fixed BSRAM address	Permanent hot assets
`TEX_BSRAM_CACHE`	BSRAM cache → Graphics SRAM	Normal texture sampling
`TEX_GFXSRAM`	Graphics SRAM direct	Large atlases, intermediate buffers
`TEX_FLASH`	FPGA flash	Built-in fonts, default palettes
`TEX_DDR3`	DDR3	Full art library
`TEX_INLINE`	In job descriptor	Tiny patterns, solid colours

Cache Architecture

Size: 128–256KB of BSRAM (configurable at HDL build time)
Backing store: Graphics SRAM (primary), DDR3 (overflow)
Organisation: Direct-mapped or 2-way set-associative
Cache line: One tile (e.g. 16×16 × 4bpp = 128 bytes)
Eviction: LRU or pseudo-LRU
EE prefetch: blit.prefetch addr, size — warm cache from Graphics SRAM before the job list runs

Memory Summary

For capacity figures, bandwidth, and arbitration details see the Memory Architecture reference. In brief: the single 36-bit SRAM bus carries the EE's wide-mode code and can also hold blitter/audio working data (arbitrated when shared); the blitter and audio DSP otherwise use on-chip BSRAM and FireStorm's 64-bit DDR3, placed by best fit.

All Sampling Primitives Use This System

Primitive	Texture source usage
Sprite blit	Sprite sheet — 4bpp/8bpp indexed, palette from palette RAM
Textured triangle	Texture atlas — UV interpolated across triangle, any pixel format
Affine tilemap	Tile pixel data — fetched by tile index from texture pool
Mode 7 / perspective tilemap	Floor/ceiling texture — sampled via inverse affine
Bitmap composite	Source bitmap — used as texture for the composite operation
Pattern fill	Repeating pattern tile — any size up to cache line
Text glyph	Font bitmap — typically permanent BSRAM resident

3.1 Memory Operations

Linear copy Simple source → destination copy. Optional format conversion at copy time (e.g. 4bpp → 8bpp expansion, ARGB → BGRA swizzle). The fastest blit — limited only by BSRAM bandwidth.

2D copy (rectangular blit) Copy a rectangle from source bitmap to destination bitmap with independent source and destination strides. The workhorse operation for sprite and background rendering.

3-source blit (Amiga-style) Three inputs: A (source), B (minterm mask), C (destination). Result = boolean function of A, B, C using any of the 256 possible 3-input logic functions. Covers: copy, masked copy, XOR, NOT, AND, OR, and every combination. The full Amiga blitter operation set is a subset of this.

Fill Fill a rectangle with a solid colour or a repeating pattern. Pattern can be up to 16×16 pixels, tiled across the destination rectangle.

3.2 Sprite Blitting

Masked sprite blit Blit a sprite rectangle to a destination bitmap, treating one colour index (typically index 0) as transparent. The fundamental sprite draw operation.

Flags: BLT_MASKED | BLT_4BPP | BLT_16BPP_DEST

Alpha sprite blit Blit with per-pixel or per-palette-entry alpha blend. Requires the source to carry alpha information (RGBA) or uses the palette entry's alpha byte.

Scaled sprite blit Blit with nearest-neighbour or bilinear scaling. Scale factors are independent in X and Y — stretch, squash, zoom.

Affine sprite blit Full affine transform: scale, rotate, shear. Specified as a 2×2 matrix plus translation. Used for Mode 7-style effects, rotating game objects, perspective sprites.

Software Sprite Throughput

The blitter is designed to draw large numbers of software sprites efficiently. Unlike the hardware sprite layer (which uses a dedicated sort/fetch pipeline operating on native-resolution pixel data), blitter sprites are drawn to an intermediate bitmap layer at blit time.

Sprite pixel data can come from any level of the texture source hierarchy — permanent BSRAM for the most frequently used sprites, DDR3-backed texture cache for the full sprite library. With the EE pre-warming the cache at the start of the frame, cache hits during blitting are the normal case.

Example: 16-colour (4bpp), 16×16 sprites from BSRAM cache

Each 16×16 4bpp sprite is 128 bytes. At ~8 pixels/cycle at 380MHz:

16×16 sprite = 256 output pixels
At ~8 pixels/cycle = 32 cycles per sprite
At 380MHz = ~11.875 million sprites/second
At 60fps = ~197,000 sprites per frame

In practice the limit is blitter job overhead, destination write bandwidth, and available V-blank/H-blank time rather than raw pixel throughput. A practical budget of several thousand 16×16 sprites per frame is realistic and well in excess of any game's requirements.

These are in addition to the hardware sprite layer's own budget of hundreds of sprites per scanline — the two systems operate independently and can both be active simultaneously.

3.3 Tilemap Blitting

Flat tilemap blit Render a tilemap into a bitmap layer — the same as the hardware tilemap layer but software-rendered to an intermediate bitmap. Useful when the tilemap needs post-processing before display.

Affine tilemap blit (Mode 7) Render a tilemap with a full affine transform — floor/ceiling planes, rotating/scaling game boards, SNES Mode 7-style backgrounds. The transform matrix maps output pixel positions back to tilemap coordinates via inverse affine. Perspective can be approximated by varying the scale per scanline (driven by the EE computing a new matrix each line).

; Mode 7 floor plane
; A0 = tilemap data, A1 = destination bitmap
; D0-D3 = affine matrix (fixed-point 16.16)
; D4, D5 = translation (screen centre)
blit.tilemap_affine  A0, A1, D0, D1, D2, D3, D4, D5, #BLT_WRAP

Perspective tilemap blit A per-scanline affine blit where the EE provides a new matrix for each output line, creating a perspective projection. The blitter renders one scanline per matrix update. The EE computes the matrix sequence during the previous frame and hands it to the blitter as a matrix list.

3.4 Line and Shape Primitives

Line draw (Bresenham) Draw a line between two points with a given colour. The fastest line — integer only, no antialiasing.

blit.line  #x0, #y0, #x1, #y1, #colour

Antialiased line (Wu) Draw a line using Wu's algorithm — fractional pixel coverage at the endpoints and along diagonal edges, giving smooth sub-pixel accuracy. Single pixel wide with soft edges.

blit.line  #x0, #y0, #x1, #y1, #colour, #BLT_ANTIALIAS

Bloom line — Vector CRT simulation

Draw an antialiased line with a phosphor bloom glow effect, simulating the characteristic aesthetic of vector arcade CRTs (Tempest, Asteroids, Star Wars, Battlezone). The bloom adds a coloured halo around the line core — bright and saturated at the centre, falling off with distance.

The bloom effect has two parameters beyond the line colour:

Intensity (0–255) — drives the brightness of the line core and the width of the glow. Low intensity gives a dim line with a narrow soft edge. High intensity gives a saturated core with a wide bright halo, exactly as a vector CRT looks when the beam is driven hard
Falloff (0–255) — controls how quickly the glow fades with distance from the line. High falloff gives a tight bright line. Low falloff gives a wide diffuse glow

The falloff model per pixel:

brightness(d) = intensity × falloff_factor^d

where d = perpendicular distance from line centre in pixels
      falloff_factor = falloff / 255   (0.0 = instant falloff, 1.0 = no falloff)

The glow is rendered in additional passes at d=1, 2, 3... pixels from the line centre, each at reduced intensity, stopping when brightness drops below a threshold (~4/255). At high intensity and low falloff the glow radius can extend 4–8 pixels from the line centre, matching real vector phosphor behaviour.

blit.line_bloom  #x0, #y0, #x1, #y1, #colour, #intensity, #falloff

Destination bitmap must use additive blend mode for vector rendering. The bitmap layer is initialised to black. Each bloom line call uses BLT_ADDITIVE blend mode:

dest = clamp(dest + src, 0, max)

Colour values add together where lines overlap or cross — crossing points become brighter, exactly as they do on a real vector CRT where the electron beam passes twice. The destination saturates to maximum brightness at hot spots. This is the correct and authentic behaviour, not a workaround.

A complete vector frame is rendered by:

Clear bitmap layer to black (one fast fill job)
Dispatch bloom line list (one primitive list job, BLT_ADDITIVE)
EE flips front/back buffer at V-blank

For a game with 200 lines per frame (Tempest has roughly this many), the entire frame renders in a few hundred microseconds — far within the V-blank budget. The bitmap layer is then composited by the scanline mixer alongside hardware sprite layers for UI elements.

Vector line list Batch dispatch of many bloom lines in one job — the same primitive list mechanism as sprite lists. Each entry specifies start point, end point, colour, intensity, and falloff. The blitter processes the list sequentially with a single completion interrupt at the end.

blit.linelist  A0,          ; list base — array of (x0,y0,x1,y1,colour,intensity,falloff)
               #200,        ; line count
               A1,          ; destination bitmap
               #BLT_ADDITIVE | BLT_ANTIALIAS | BLT_BLOOM

Named vector presets — analogous to the display system's named CRT monitor presets:

Preset	Intensity	Falloff	Character
Tempest	200	180	Wide saturated glow — Atari colour vector
Asteroids	160	220	Tighter glow, green-white phosphor feel
Star Wars	220	140	Very wide bloom, high-intensity battle look
Battlezone	150	200	Crisp wireframe, narrow green glow
Sharp	255	240	Maximum brightness, minimal bloom

Particle (point sprite) Draw a single dot at a given position. The simplest primitive — much cheaper than a line since no angular rasterisation is needed.

Parameters:

Position (x, y) — centre of the particle
Colour — RGBA
Size — radius in pixels (1 = single pixel, 2 = 2px radius disc, etc.)
Intensity (0–255) — brightness, same scale as bloom lines
Falloff (0–255) — phosphor glow falloff, same model as bloom lines. At falloff=255 the particle is a hard-edged disc. At lower values it has a soft glowing halo

With BLT_ADDITIVE | BLT_BLOOM, a particle glows exactly like a point light on a vector CRT — bright saturated core, soft halo, additive mixing with nearby particles and lines. Overlapping particles become brighter. This is the correct and authentic vector CRT behaviour.

A size-4 particle with bloom is visually similar to a filled circle but renders significantly faster — useful for explosion effects where exact circularity doesn't matter.

blit.particle  #x, #y, #colour, #size, #intensity, #falloff, #BLT_ADDITIVE | BLT_BLOOM

Particle list Batch dispatch of many particles as a single blit job — the natural companion to the vector line list. Each entry specifies position, colour, size, intensity, and falloff. The blitter renders all particles sequentially with one completion interrupt at the end.

blit.particlelist  A0,          ; list base — array of (x, y, colour, size, intensity, falloff)
                   #500,        ; particle count
                   A1,          ; destination bitmap
                   #BLT_ADDITIVE | BLT_BLOOM

A complete vector explosion frame — line list plus particle list — both writing additively to the same bitmap layer. Lines and particles mix naturally: a bright particle near a bright line creates a hot spot where they overlap, exactly as on a real vector display.

Spark trail A particle list with a linearly or exponentially decreasing intensity along a path. Simulates the glowing tail of a missile, the decay arc of an explosion fragment, or the fading trace of a fast-moving object. The EE computes the intensity ramp and position array; the blitter renders the trail as a standard particle list.

; Spark trail: 20 particles along a path, intensity fading from 200 to 0
; EE pre-computes positions and intensities into a list at A0
blit.particlelist  A0, #20, A1, #BLT_ADDITIVE | BLT_BLOOM

Combined vector frame example

A Tempest-style frame — web wireframe + enemy shots + explosions + particles — all rendered additively to one bitmap layer:

; Frame job queue:
Job 1: blit.linelist     web_lines,     #180, bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; web
Job 2: blit.linelist     enemy_lines,   #40,  bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; enemies
Job 3: blit.particlelist shot_particles, #12, bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; shots
Job 4: blit.particlelist explosion,     #80,  bitmap, #BLT_ADDITIVE|BLT_BLOOM  ; explosion
; One completion interrupt after Job 4. EE flips front buffer at V-blank.

Total for ~310 primitives at typical complexity: well under 500 microseconds. V-blank at 60fps on 1080p is approximately 1.3ms — comfortable headroom for game logic and audio in the same V-blank period.

Filled rectangle Fill a rectangle with solid colour or pattern. Faster than a general fill for rectangular regions.

Filled circle / ellipse Draw a filled circle or ellipse. Antialiased outline variant. Used for particles, explosions, simple UI elements.

Filled triangle Draw a solid filled triangle with a flat colour. The fundamental primitive for 2D polygon rendering. Lists of triangles can be dispatched as a single blit job for polygon meshes.

Textured triangle Draw a filled triangle sampling colour from a texture source (see Section 3 — Texture Source System). Each vertex carries UV texture coordinates; the blitter interpolates U and V linearly across the triangle and samples texture[u][v] for each output pixel.

Texture sampling modes:

Nearest-neighbour — pixel art, no softening, fastest
Bilinear — smooth scaling, 4 samples per pixel
Perspective-correct — divide U/V by W per pixel, eliminating affine warping on 3D geometry

The texture source can be any level of the hierarchy — permanent BSRAM resident for small textures, DDR3-backed cache for large texture atlases. Triangle lists with a shared texture source are dispatched as a single blit job; the blitter rasterises each triangle in sequence without re-loading the texture.

; Textured triangle
blit.triangle_tex  #x0,#y0,#u0,#v0,  ; vertex 0: position + UV
                   #x1,#y1,#u1,#v1,  ; vertex 1
                   #x2,#y2,#u2,#v2,  ; vertex 2
                   A0,                ; texture source descriptor
                   #BLT_BILINEAR | BLT_PERSP_CORRECT

Rounded rectangle Rectangle with configurable corner radius. Common UI element.

Arc / sector Circular arc or filled sector (pie slice). Useful for health/progress indicators.

3.5 Flood Fill

Flood fill (boundary fill) Fill a connected region bounded by a specific colour, starting from a seed pixel. Uses a scanline-coherent algorithm for efficiency — processes horizontal spans rather than individual pixels, keeping the work queue small.

blit.flood_fill  #seed_x, #seed_y, #fill_colour, #boundary_colour

Span fill (non-boundary) Fill a region defined by a left/right boundary table (same format as the layer clip table). More predictable performance than flood fill for programmatically-defined regions.

3.6 Bitmap Layer Operations

Bitmap-to-bitmap blit Copy one intermediate bitmap to another with optional transform. Used for double-buffering, compositing multiple blit passes, and applying the Stage 2 affine transform to a completed intermediate bitmap before it goes to the scanline mixer.

Clip table application Blit an intermediate bitmap to an output layer with a clip table applied. Pixels outside the clip table's left/right boundaries are not written. This is the mechanism for split-screen and portal effects — see the display system document for clip table format.

Colour space conversion Convert a bitmap from one format to another: RGB→YCbCr, RGBA→BGRA, 8bpp indexed→16bpp direct colour, 4bpp→8bpp expansion, etc.

Downscale Reduce a bitmap's resolution with box or bilinear filter. Used for the mixdown pipeline from 4K native content to HDMI/VGA output resolution.

3.7 Text Rendering

Glyph blit Blit a single glyph from a font bitmap (1bpp, 4bpp, or 8bpp) to a destination bitmap with optional colour tint and background transparency.

String blit Render a complete string — the blitter walks the character table, blits each glyph with appropriate advance width, handles line breaks. Returns the bounding box of the rendered text.

Proportional and monospace fonts Both supported. Font metrics (advance widths, kerning pairs) stored alongside the font bitmap.

3.8 Image Processing

Horizontal flip / vertical flip Mirror a bitmap region. Also available as flags on any blit operation.

Rotation by 90° / 180° / 270° Fast integer rotation — no interpolation needed for exact quarter-turns.

Arbitrary rotation Affine blit with rotation matrix. Uses bilinear sampling to reduce aliasing.

Scale with filter Nearest-neighbour (pixel art, no softening), bilinear (smooth scaling), or Lanczos (highest quality, sharpest).

Colour key extraction Create a 1bpp mask bitmap from a source by testing each pixel against a colour key. Used to pre-compute masks for masked sprite blits.

5. Blit Job Descriptor

Every blit operation is described by a job descriptor written to the blitter's command queue by the EE. The descriptor format varies by operation type but always includes:

Field	Width	Notes
Operation	8 bits	Operation type code — also determines which sub-unit handles the job
Flags	16 bits	BLT_MASKED, BLT_ALPHA, BLT_ANTIALIAS, BLT_WRAP, BLT_BILINEAR, BLT_PERSP_CORRECT, BLT_ADDITIVE, BLT_BLOOM, BLT_SEQUENTIAL, BLT_DEPTH_TEST, BLT_DEPTH_WRITE, BLT_DEPTH_CLEAR, etc.
Dependency count	4 bits	Number of predecessor job IDs declared (0–4)
Predecessor IDs	0–4 × 16 bits	Job IDs that must complete before this job dispatches
Texture source type	4 bits	TEX_BSRAM_DIRECT, TEX_BSRAM_CACHE, TEX_GFXSRAM, TEX_FLASH, TEX_DDR3, TEX_INLINE
Texture source address	32 bits	Address or inline data depending on source type
Texture width / height / stride	16+16+16 bits	Source texture dimensions
Texture pixel format	8 bits	1bpp, 4bpp, 8bpp, 16bpp, 32bpp
Destination address	32 bits	Target bitmap pointer
Destination stride	16 bits	Bytes per row in destination
Dest X, Y	16+16 bits	Top-left of destination rectangle
Width / Height	16+16 bits	Operation dimensions in output pixels
Completion task	16 bits	EE task address to launch on completion (0 = none)
Job ID (out)	16 bits	Assigned by blitter scheduler, returned to EE

Transform-capable operations (affine blit, Mode 7, textured triangles, etc.) add a transform/UV parameter block after the base descriptor.

BLT_SEQUENTIAL in the flags field is equivalent to declaring a dependency on every currently in-flight job — it provides a full memory fence without requiring the programmer to enumerate specific job IDs.

5. Blitter vs Hardware Layer System

The blitter and the hardware layer system (tilemaps, hardware sprites) are independent and complementary:

	Hardware layer system	Blitter to bitmap layer
Timing	Real-time, per-scanline during display	Pre-rendered during V-blank/H-blank
Sprite count	Hundreds per scanline (hardware pipeline)	Thousands per frame (throughput limited)
Transforms	None (pixel replication only)	Full affine per sprite
Tilemap scroll	Per-tile-row H / per-tile-col V (hardware)	Any affine including Mode 7
Setup cost	Register writes only	Job descriptor queue
Latency	Zero — always live	One frame (rendered previous V-blank)
Colour depth	Indexed palette	Any (8bpp, 16bpp, 32bpp)
Clip	Layer clip table	Per-blit or clip table at composite time

A typical game uses both: hardware sprites for the player character and key game objects (zero latency, always live), blitter sprites for large numbers of background objects, particles, or effects where affine transforms are needed or the count exceeds the hardware sprite budget.

6. Blitter in the Display Pipeline

The blitter's output is an intermediate bitmap layer. That layer then enters the scanline mixer at the same priority level as any other layer — hardware tilemap, hardware sprites, or framebuffer layers. The scanline mixer does not know or care whether a layer was rendered by the hardware tilemap engine or the blitter.

Previous frame V-blank:
    EE dispatches blit jobs to blitter
    Blitter renders primitive list → bitmap layer A
    Blitter raises completion interrupt → EE launches "blit complete" task
    EE dispatches Stage 2 blit: bitmap A → output layer, with clip table

Current frame display:
    Scanline mixer composites all layers:
        Hardware tilemap layer(s)
        Hardware sprite layer
        Blitter bitmap layer  ← appears here
        UI overlay
        etc.

7. Clip Table Integration

The clip table (see display system document, Section 10) is applied when the blitter composites the intermediate bitmap to the output layer. This is the mechanism for split-screen, portals, and shaped display regions.

The EE generates the clip table during V-blank, then dispatches:

; Stage 2: blit intermediate bitmap to output layer with clip table
blit.composite  A0,          ; source: intermediate bitmap
                A1,          ; destination: output layer
                A2,          ; clip table pointer
                #BLT_CLIPPED

The blitter walks the clip table scanline by scanline as it writes to the output layer, skipping pixels outside the left/right boundaries. The operation is otherwise identical to a 2D blit — the clip table adds no significant cost per pixel.

8. FireStorm EE and Blitter — Division of Labour

Task	EE	Blitter
Compute sprite positions, animation	✓	—
Dispatch blit job descriptors	✓	—
Generate clip tables	✓	—
Generate Mode 7 matrix lists	✓	—
Fine-grained pixel ops (register blitting)	✓	—
Bulk pixel copy	—	✓
Affine transform rendering	—	✓
Line/shape/fill primitives	—	✓
Flood fill	—	✓
Text rendering	—	✓
Completion interrupt → EE task	—	✓ (raises interrupt)
Job queue management	shared	shared

The EE is the programmer-facing control surface. The blitter is the pixel-pushing engine. The EE tells the blitter what to draw; the blitter draws it and tells the EE when it's done.

9. Primitive List — Batch Dispatch

For rendering many sprites, lines, or particles in one frame, the EE builds a primitive list — a compact array of blit descriptors — and dispatches the entire list as a single job. The job's sub-unit processes the list sequentially from start to finish, in order, raising one completion interrupt when done.

; Sprite list: 500 sprites, 16x16, 4bpp, masked
; Executes in list order — correct Z ordering guaranteed
blit.primlist  A0,          ; list base address
               #500,        ; entry count
               A1,          ; destination bitmap
               #BLT_4BPP | BLT_MASKED

; Particle list: additive blend — order irrelevant, but still in-order within job
blit.primlist  A2,          ; particle list
               #800,
               A3,          ; destination bitmap
               #BLT_ADDITIVE | BLT_BLOOM

The EE is free to dispatch other jobs — to different sub-units — while this job runs. A sprite list and a text pass and a memory copy dispatched at the same time all run in parallel on their respective sub-units. The sprite list's internal sequencing is preserved regardless.

Throughput: 500 × 16×16 4bpp masked sprites

500 sprites × 256 pixels = 128,000 pixels
At ~8 pixels/cycle at 380MHz ≈ 42 microseconds
Several thousand sprites per frame readily achievable,
in addition to the hardware sprite layer's hundreds per scanline.

11. Primitive Types — Complete List

Category	Primitive	Texture source	Notes
Copy	Linear copy	Any	With optional format conversion
	2D rectangular blit	Any	With stride
	3-source blit	Any	Any 3-input boolean function
Sprites	Masked sprite	Any	Colour-key transparency
	Alpha sprite	Any	Per-pixel or per-palette alpha
	Scaled sprite	Any	Nearest / bilinear
	Affine sprite	Any	Full scale/rotate/shear
Triangles	Filled triangle	—	Flat colour
	Textured triangle	Any	UV interpolated, nearest/bilinear/perspective-correct
	Triangle list	Any	Batch of triangles sharing one texture
Tilemaps	Flat tilemap	Any	Standard scrolling tilemap to bitmap
	Affine tilemap	Any	Mode 7-style with transform matrix
	Perspective tilemap	Any	Per-scanline matrix list
Lines	Line	—	Bresenham integer
	Antialiased line	—	Wu's algorithm
	Bloom line	—	Antialiased + phosphor glow, intensity + falloff, additive blend
	Polyline	—	Connected segment list, Bresenham/AA/bloom variants
	Vector line list	—	Batch bloom lines, additive blend, single completion interrupt
Particles	Particle	—	Point sprite, size + colour + bloom falloff, additive blend
	Particle list	—	Batch particles, additive blend, single completion interrupt
	Spark trail	—	Particle list with intensity ramp along path
Shapes	Filled rectangle	Any (pattern)	Solid or texture-pattern fill
	Filled circle	Any (pattern)	With optional antialiased outline
	Filled ellipse	Any (pattern)	With optional antialiased outline
	Filled triangle	—	Solid colour
	Rounded rectangle	Any (pattern)	Configurable corner radius
	Arc / sector	—	Circular arc or pie slice
Fill	Flood fill	—	Boundary fill, scanline-coherent
	Span fill	Any (pattern)	Left/right boundary table
Bitmap ops	Bitmap composite	Any	With clip table and optional affine
	Colour space convert	—	RGB↔YCbCr, format conversions
	Downscale	—	Box / bilinear filter
	Flip	—	Horizontal, vertical
	Rotate	—	90°/180°/270° integer, arbitrary affine
Text	Glyph blit	BSRAM (permanent)	Single character from font bitmap
	String blit	BSRAM (permanent)	Full string with metrics
Prefetch	Texture prefetch	DDR3 → cache	Warm cache before blit list runs
Misc	Colour key mask	—	Generate 1bpp mask from colour key

12. Performance Counters and Profiler Overlay

FireStorm includes hardware performance counters for the blitter sub-system. All counters are memory-mapped registers readable via the FRAM bus by the EE, Pulse, or DeMon. They are designed to answer the practical question: which sub-unit is the bottleneck, and how much would adding another instance help?

The Hard RISC-V Core

The GoWin GW5AST-138 contains a hardened RISC-V core built into the FPGA silicon — not implemented in LUTs, so it consumes zero fabric resources. It exists specifically for internal hardware monitoring and debugging, with direct access to all FPGA-internal signals and registers. This core is the natural home for blitter performance counter collection.

The hard RISC-V:

Runs entirely independently of the FireStorm EE, Pulse, and DeMon
Has zero impact on the system it is measuring — it does not share any execution resource with the EE or blitter sub-units
Reads all internal blitter counters directly, without going through the FRAM bus
Formats and renders the profiler UI into the ImGui overlay layer's backing BSRAM
Can be activated or deactivated by a single register write from any chip in the system

Because it is hardened silicon rather than LUT logic, enabling the profiler has no effect on FPGA timing closure or resource utilisation. It is always present and always collecting — the only choice is whether to display the data.

Counter Snapshot Model

All counters run continuously during blitter operation. At each V-blank, the hardware latches a snapshot of all counters into a separate read register bank. The live counters continue accumulating; the snapshot holds the previous frame's values undisturbed until the next V-blank. The hard RISC-V reads the snapshot bank during the frame with no race conditions.

A reset-on-snapshot option clears the live counters at each V-blank latch, giving per-frame deltas rather than cumulative totals. Either mode is selectable via a control register.

Per-Sub-Unit Counters

For each sub-unit instance (pixel fill 0, pixel fill 1, memory/copy 0, etc.):

Counter	Meaning
`ACTIVE_CYCLES`	Cycles this unit spent executing a primitive
`STALL_CYCLES`	Cycles a job was waiting for this unit to become free
`IDLE_CYCLES`	Cycles this unit had no work queued
`JOBS_COMPLETED`	Number of jobs (or job segments) completed by this unit
`PRIMITIVES_COMPLETED`	Number of individual primitives processed

ACTIVE_CYCLES + STALL_CYCLES + IDLE_CYCLES = total frame cycles (sanity check).

Derived metrics:

Stall ratio    = STALL_CYCLES / (STALL_CYCLES + ACTIVE_CYCLES)
Utilisation    = ACTIVE_CYCLES / total_frame_cycles
Idle fraction  = IDLE_CYCLES   / total_frame_cycles

Stall ratio	Interpretation
< 10%	Sub-unit not a bottleneck
10–30%	Moderate contention — monitor
30–60%	Significant bottleneck — second unit recommended
> 60%	Severe bottleneck — adding a unit would substantially reduce frame time

A high idle fraction alongside a high stall ratio means jobs are queuing for this unit while it is actually doing nothing — a dependency ordering problem rather than a capacity problem. The fix is reviewing job dependency declarations, not adding units.

Per-Job Counters

For each job ID:

Counter	Meaning
`DISPATCH_TIME`	Cycle timestamp when the job was submitted
`START_TIME`	Cycle timestamp when the job first acquired a sub-unit
`COMPLETE_TIME`	Cycle timestamp when the job's last primitive completed
`STALL_TIME`	Total cycles this job spent waiting for a sub-unit
`EXECUTE_TIME`	Total cycles this job spent actively executing primitives

Queue latency  = START_TIME - DISPATCH_TIME
Total latency  = COMPLETE_TIME - DISPATCH_TIME
Stall fraction = STALL_TIME / EXECUTE_TIME

Global Blitter Counters

Counter	Meaning
`FRAME_CYCLES`	Total cycles in the last frame
`BLITTER_BUSY_CYCLES`	Cycles at least one sub-unit was active
`ALL_IDLE_CYCLES`	Cycles all sub-units were idle simultaneously
`JOBS_DISPATCHED`	Jobs submitted this frame
`JOBS_COMPLETED`	Jobs completed this frame
`DEPENDENCY_STALL_CYCLES`	Cycles jobs spent held on dependency (not sub-unit contention)
`PEAK_QUEUE_DEPTH`	Maximum jobs simultaneously queued or in-flight

DEPENDENCY_STALL_CYCLES separates programmer-imposed ordering overhead from genuine sub-unit contention. Only the latter is improved by adding units.

The ImGui Profiler Overlay

The hard RISC-V renders blitter profiling data into the ImGui overlay layer — a system-reserved layer in the FireStorm scanline mixer that sits above all application content. The layer is owned by the monitoring subsystem, composited in hardware, and visible on all display outputs simultaneously.

Activation: A single write to the profiler enable register, accessible via FRAM from the EE, DeMon, or Copper. A keyboard shortcut, a debug menu option, a Copper trap on a specific raster line — any mechanism that can write a register can toggle the overlay. The hard RISC-V continues collecting data regardless of whether the overlay is visible.

Overlay contents (suggested layout):

┌─ FireStorm Blitter Profiler ────────────────────────────────┐
│ Frame: 16.7ms  Busy: 12.3ms (73%)  Idle: 4.4ms (26%)        │
│                                                             │
│ Sub-unit        Active    Stall    Idle   Stall%  Jobs      │
│ Pixel fill  0   8.2ms     1.1ms    7.4ms   12%    47        │
│ Pixel fill  1   7.9ms     0.8ms    7.8ms   9%     44        │
│ Line/part   0   2.1ms     0.0ms   14.6ms   0%     3         │
│ Memory/copy 0   9.8ms     3.2ms    3.7ms   25%  ← watch     │
│ Composite   0   1.2ms     0.0ms   15.5ms   0%     8         │
│ Ray/DDA     0   4.4ms     0.0ms   12.3ms   0%     12        │
│                                                             │
│ Dependency stalls: 0.4ms    Peak queue depth: 7             │
│                                                             │
│ Last 8 frames: ▁▂▃▂▂▃▂▂  (frame time sparkline)         │
└─────────────────────────────────────────────────────────────┘

The ← watch annotation is generated automatically when a unit's stall ratio exceeds the warning threshold. The hard RISC-V computes the stall ratios and derived metrics directly from the counter snapshot, applies the threshold logic, and renders the formatted text into the overlay BSRAM each frame — typically completing well within V-blank.

Impact on the system being profiled: zero. The hard RISC-V runs on its own clock domain. The ImGui layer sits above all application layers in the compositor priority stack — it never overwrites application BSRAM. The blitter sub-units are unaffected by the overlay rendering because the hard RISC-V writes to BSRAM regions not used by the blitter. The frame time numbers shown are the frame times of the system running normally, not the system running with profiling overhead.

Persistence: The overlay can also write to a circular buffer in FireStorm DDR3, giving a rolling history of counter snapshots that can be read out post-session by the EE, saved to disk, or streamed over the network via AntOS. Long-running profiling sessions that capture rare frame spikes are feasible this way.

Workflow: Identifying and Fixing a Bottleneck

Enable the overlay — write the profiler enable register. The UI appears immediately.
Find the bottleneck — look for the sub-unit with the highest stall% in the overlay. The ← watch annotation appears automatically at the threshold.
Check dependency stalls — if DEPENDENCY_STALL_CYCLES is large relative to sub-unit stall cycles, the problem is job ordering rather than capacity. Review dependency declarations.
If it is a capacity problem — increment the HDL instance count for the bottleneck sub-unit, rebuild the bitstream. No code changes. Re-enable the overlay and confirm the stall% has dropped.
Check idle fractions — a unit with high idle% is using LUTs unnecessarily. Reducing its count recovers LUT budget for units that are actually busy.
Disable the overlay — write the profiler enable register. The layer disappears. The hard RISC-V continues collecting in the background.