FireStorm Blitter
1. Overview
The FireStorm Blitter is a dedicated hardware unit inside the FireStorm FPGA that performs bulk pixel and memory operations independently of the CPU cores. It is controlled by the FireStorm Execution Engine (EE), which dispatches blit jobs and can continue executing while the blitter runs in the background. When a job completes, the blitter raises an interrupt that the EE scheduler can use to launch the next task.
This separation is deliberate:
- The EE is a programmable CPU optimised for control flow, arithmetic, and fine-grained pixel work via register blitting
- The Blitter is dedicated hardware optimised for throughput — moving, transforming, and drawing large amounts of pixel data at maximum BSRAM bandwidth
- Neither gets in the other's way
The blitter operates on bitmap layers — intermediate framebuffers (8bpp, 16bpp, or 32bpp) that the scanline mixer composites alongside the hardware tilemap and sprite layers. Blitter-drawn content is distinct from the hardware sprite and tilemap system, and the two coexist freely.
2. Execution Model
The blitter is a parallel job scheduler built around specialised sub-units. The rules are simple:
- A job is a list of primitives that executes sequentially, in order. Each primitive is routed to the appropriate sub-unit for its type.
- If the required sub-unit is busy with another job, the current job stalls at that primitive and waits until the sub-unit becomes free, then continues.
- Multiple jobs run concurrently. When their primitives target different sub-units — or when enough instances of a sub-unit exist — they make progress simultaneously with no waiting.
- A job that needs the output of another job declares that job as a dependency. Without a dependency, jobs dispatch immediately.
This is natural hardware backpressure. No job fails or errors when a sub-unit is busy — it simply waits its turn. The programmer writes jobs as straightforward primitive lists; the scheduler handles all concurrency and contention transparently.
Sub-Units
| Sub-unit | Handles |
|---|---|
| Pixel fill | Sprites, tilemap fills, textured triangles, column fills, shapes, text glyphs |
| Line / particle | Bloom lines, polylines, particles, spark trails, vector primitives |
| Ray / DDA | Ray casting, 2D/3D DDA voxel stepping, height-field sampling |
| Memory / copy | Bulk transfers, texture prefetch, DMA, format conversion |
| Composite | Bitmap-to-layer compositing with clip table, colour space conversion, downscale |
Each sub-unit type can have multiple instances. The count is a compile-time HDL parameter. If profiling reveals that many jobs are stalling waiting for the memory/copy sub-unit, the fix is simply incrementing a parameter and rebuilding the bitstream — one more copy unit, no software changes, no API changes, stalls reduce. The programmer never has to think about sub-unit counts; the hardware just has more or less capacity.
A job can span multiple sub-unit types within a single primitive list. A job that copies a texture then blits sprites from it will use the memory unit for the copy primitive, then stall waiting for a pixel fill unit when it reaches the sprite primitives — entirely correct behaviour.
Jobs Are Sequential Internally
All primitives within a job execute in order. Draw order, copy-before-blit sequencing, line-before-composite — all preserved by the sequential-within-job guarantee. There are no flags to set.
Jobs Run Concurrently
Multiple jobs run at the same time. Each job advances through its primitive list independently, stalling only when the sub-unit it needs is occupied. Jobs targeting disjoint sub-units make progress simultaneously without any interaction.
; Three jobs dispatched at the same time:
Job 1: sprite list → uses Pixel fill unit
Job 2: text pass → uses Pixel fill unit (stalls if Job 1 is using it;
runs in parallel if 2 units exist)
Job 3: texture prefetch → uses Memory/copy unit (always parallel with Jobs 1 and 2)
; Job 3 always makes progress regardless of pixel fill contention.
; Jobs 1 and 2 share or contend on pixel fill depending on unit count.
Job Dependencies
A job declares dependencies only when it genuinely needs another job's output:
Job 1: tilemap blit → Bitmap A (pixel fill)
Job 2: sprite list → Bitmap A DEPENDS_ON Job 1
— reads bitmap A, must wait for Job 1 to finish
Job 3: text pass → Bitmap B (no dependency — different destination)
Job 4: composite → Output DEPENDS_ON Job 2, Job 3
→ Jobs 1 and 3 start immediately
→ Job 2 holds until Job 1 completes, then dispatches
→ Job 4 holds until both Jobs 2 and 3 complete
| Dependency option | Meaning |
|---|---|
| (none) | Dispatch immediately |
DEPENDS_ON id[, id, ...] |
Hold until up to 4 specific job IDs complete |
BLT_SEQUENTIAL |
Hold until all in-flight jobs complete; block new jobs until this one finishes — full memory fence |
Scaling Sub-Unit Counts
If a particular sub-unit is a recurring bottleneck, increase its instance count in the HDL parameters:
| If this is a bottleneck | Solution |
|---|---|
| Sprite/text jobs stalling on each other | More pixel fill units |
| Memory copies slowing blit setup | More memory/copy units |
| Voxel/ray jobs queuing | More ray/DDA units |
| Vector/particle jobs waiting | More line/particle units |
No software changes needed. The programmer's job descriptors are identical regardless of how many units exist.
Doom-Style Frame Example
V-blank — all jobs dispatched simultaneously:
Job 1: BSP traversal (ray/DDA)
Job 2: DDA column rays DEPENDS_ON Job 1 (ray/DDA — queues if busy)
Job 3: sprite list independent (pixel fill)
Job 4: texture prefetch independent (memory/copy — always parallel)
Job 5: particles independent (line/particle — always parallel)
Job 6: textured span fill DEPENDS_ON Job 2 (pixel fill)
Job 7: composite DEPENDS_ON Job 3, Job 6 (composite)
→ Jobs 1, 3, 4, 5 start immediately
→ Job 4 always runs in parallel — different sub-unit from everything else
→ Job 2 starts when Job 1 done
→ Job 6 starts when Job 2 done
→ Job 7 starts when Jobs 3 and 6 done
→ Completion interrupt after Job 7
Double and Triple Buffering
Two bitmap buffers per layer. The display reads buffer A while the blitter writes buffer B. At V-blank the EE swaps the pointers — no tearing, no partial frames visible.
V-blank N: display ← A, blitter → B
V-blank N+1: swap — display ← B, blitter → A
Triple buffering adds a third buffer so the blitter can start the next frame before the current one has been displayed — useful when the job graph takes longer than one frame period.
Dirty Tracking
The EE maintains a dirty flag per bitmap layer. If nothing changed in the inputs to a layer since the last frame, that job is skipped entirely — the front buffer retains the previous frame's content. A static HUD that only changes when the score updates generates zero blitter work between score changes.
3. Control Model
Dispatching Blit Jobs
The EE dispatches blit jobs by writing a job descriptor to the blitter's command queue. The descriptor specifies the operation type, source and destination addresses, dimensions, transform parameters, dependency declarations, and completion behaviour.
; Example: dispatch a sprite blit
blit.sprite A0, ; source sprite data
A1, ; destination bitmap
D0, D1, ; destination X, Y
#16, #16, ; width, height
#BLT_MASKED | BLT_4BPP ; flags
The blitter accepts the job, assigns it a job ID, and dispatches it immediately to an available sub-unit (or holds it pending dependency completion). The EE continues immediately.
Completion Interrupt → EE Task
When a blit job completes:
Blitter job done
↓
Blitter raises interrupt
↓
EE hardware scheduler launches registered task
(~2 cycle context switch, no OS overhead)
Synchronisation
| EE instruction | Effect |
|---|---|
WAIT Rd |
Yield until job ID in Rd is complete |
STATUS Rd |
Test if job ID is still running — sets/clears Z flag, no yield |
YIELD |
Voluntarily yield, scheduler may run blitter-triggered tasks |
3. Texture Source System
Any blitter primitive that reads pixel data from a source — sprites, filled triangles, tilemaps, bitmap composites, pattern fills — uses the same texture source mechanism. The source is not restricted to BSRAM; it can come from any level of the memory hierarchy. Full memory specifications are in the Memory Architecture reference.
Memory Hierarchy
FireStorm DDR3 ← full art library
↑
Graphics SRAM (Ant64/Ant64C) ← active texture pool, intermediate buffers
↑
BSRAM texture cache ← hottest working set, 380MHz
↑
Blitter sampler
↓
Destination bitmap
FireStorm DDR3 — bulk art library. Too large for on-chip memory; the working set for any scene is a fraction of the total.
Graphics SRAM (Ant64/Ant64C, 4.5MB) — fast pipeline SRAM dedicated to the blitter and audio DSP. 1-cycle latency, burst mode for sequential streaming. Ideal for texture atlases and high-colour intermediate buffers. Shared with the audio DSP for wavetables and FM voice data.
BSRAM texture cache (~128–256KB) — hottest working set. Cache hits at full 380MHz bandwidth.
Permanent BSRAM residency — frequently used assets at fixed BSRAM addresses, bypassing all caching.
Texture Source Field
Every blit job descriptor includes a texture source descriptor:
| Source type | Location | Use case |
|---|---|---|
TEX_BSRAM_DIRECT |
Fixed BSRAM address | Permanent hot assets |
TEX_BSRAM_CACHE |
BSRAM cache → Graphics SRAM | Normal texture sampling |
TEX_GFXSRAM |
Graphics SRAM direct | Large atlases, intermediate buffers |
TEX_FLASH |
FPGA flash | Built-in fonts, default palettes |
TEX_DDR3 |
DDR3 | Full art library |
TEX_INLINE |
In job descriptor | Tiny patterns, solid colours |
Cache Architecture
- Size: 128–256KB of BSRAM (configurable at HDL build time)
- Backing store: Graphics SRAM (primary), DDR3 (overflow)
- Organisation: Direct-mapped or 2-way set-associative
- Cache line: One tile (e.g. 16×16 × 4bpp = 128 bytes)
- Eviction: LRU or pseudo-LRU
- EE prefetch:
blit.prefetch addr, size— warm cache from Graphics SRAM before the job list runs
Memory Summary
For capacity figures, bandwidth, and bus isolation details see the Memory Architecture reference. In brief: the EE Code SRAM bus is exclusively the EE's; Graphics SRAM is shared between the blitter and audio DSP; FireStorm's DDR3 serves the blitter and audio DSP only — the SG2000 has its own separate memory.
All Sampling Primitives Use This System
| Primitive | Texture source usage |
|---|---|
| Sprite blit | Sprite sheet — 4bpp/8bpp indexed, palette from palette RAM |
| Textured triangle | Texture atlas — UV interpolated across triangle, any pixel format |
| Affine tilemap | Tile pixel data — fetched by tile index from texture pool |
| Mode 7 / perspective tilemap | Floor/ceiling texture — sampled via inverse affine |
| Bitmap composite | Source bitmap — used as texture for the composite operation |
| Pattern fill | Repeating pattern tile — any size up to cache line |
| Text glyph | Font bitmap — typically permanent BSRAM resident |
3.1 Memory Operations
Linear copy Simple source → destination copy. Optional format conversion at copy time (e.g. 4bpp → 8bpp expansion, ARGB → BGRA swizzle). The fastest blit — limited only by BSRAM bandwidth.
2D copy (rectangular blit) Copy a rectangle from source bitmap to destination bitmap with independent source and destination strides. The workhorse operation for sprite and background rendering.
3-source blit (Amiga-style) Three inputs: A (source), B (minterm mask), C (destination). Result = boolean function of A, B, C using any of the 256 possible 3-input logic functions. Covers: copy, masked copy, XOR, NOT, AND, OR, and every combination. The full Amiga blitter operation set is a subset of this.
Fill Fill a rectangle with a solid colour or a repeating pattern. Pattern can be up to 16×16 pixels, tiled across the destination rectangle.
3.2 Sprite Blitting
Masked sprite blit Blit a sprite rectangle to a destination bitmap, treating one colour index (typically index 0) as transparent. The fundamental sprite draw operation.
Flags: BLT_MASKED | BLT_4BPP | BLT_16BPP_DEST
Alpha sprite blit Blit with per-pixel or per-palette-entry alpha blend. Requires the source to carry alpha information (RGBA) or uses the palette entry's alpha byte.
Scaled sprite blit Blit with nearest-neighbour or bilinear scaling. Scale factors are independent in X and Y — stretch, squash, zoom.
Affine sprite blit Full affine transform: scale, rotate, shear. Specified as a 2×2 matrix plus translation. Used for Mode 7-style effects, rotating game objects, perspective sprites.
Software Sprite Throughput
The blitter is designed to draw large numbers of software sprites efficiently. Unlike the hardware sprite layer (which uses a dedicated sort/fetch pipeline operating on native-resolution pixel data), blitter sprites are drawn to an intermediate bitmap layer at blit time.
Sprite pixel data can come from any level of the texture source hierarchy — permanent BSRAM for the most frequently used sprites, DDR3-backed texture cache for the full sprite library. With the EE pre-warming the cache at the start of the frame, cache hits during blitting are the normal case.
Example: 16-colour (4bpp), 16×16 sprites from BSRAM cache
Each 16×16 4bpp sprite is 128 bytes. At ~8 pixels/cycle at 380MHz:
16×16 sprite = 256 output pixels
At ~8 pixels/cycle = 32 cycles per sprite
At 380MHz = ~11.875 million sprites/second
At 60fps = ~197,000 sprites per frame
In practice the limit is blitter job overhead, destination write bandwidth, and available V-blank/H-blank time rather than raw pixel throughput. A practical budget of several thousand 16×16 sprites per frame is realistic and well in excess of any game's requirements.
These are in addition to the hardware sprite layer's own budget of hundreds of sprites per scanline — the two systems operate independently and can both be active simultaneously.
3.3 Tilemap Blitting
Flat tilemap blit Render a tilemap into a bitmap layer — the same as the hardware tilemap layer but software-rendered to an intermediate bitmap. Useful when the tilemap needs post-processing before display.
Affine tilemap blit (Mode 7) Render a tilemap with a full affine transform — floor/ceiling planes, rotating/scaling game boards, SNES Mode 7-style backgrounds. The transform matrix maps output pixel positions back to tilemap coordinates via inverse affine. Perspective can be approximated by varying the scale per scanline (driven by the EE computing a new matrix each line).
; Mode 7 floor plane
; A0 = tilemap data, A1 = destination bitmap
; D0-D3 = affine matrix (fixed-point 16.16)
; D4, D5 = translation (screen centre)
blit.tilemap_affine A0, A1, D0, D1, D2, D3, D4, D5, #BLT_WRAP
Perspective tilemap blit A per-scanline affine blit where the EE provides a new matrix for each output line, creating a perspective projection. The blitter renders one scanline per matrix update. The EE computes the matrix sequence during the previous frame and hands it to the blitter as a matrix list.
3.4 Line and Shape Primitives
Line draw (Bresenham) Draw a line between two points with a given colour. The fastest line — integer only, no antialiasing.
blit.line #x0, #y0, #x1, #y1, #colour
Antialiased line (Wu) Draw a line using Wu's algorithm — fractional pixel coverage at the endpoints and along diagonal edges, giving smooth sub-pixel accuracy. Single pixel wide with soft edges.
blit.line #x0, #y0, #x1, #y1, #colour, #BLT_ANTIALIAS
Bloom line — Vector CRT simulation
Draw an antialiased line with a phosphor bloom glow effect, simulating the characteristic aesthetic of vector arcade CRTs (Tempest, Asteroids, Star Wars, Battlezone). The bloom adds a coloured halo around the line core — bright and saturated at the centre, falling off with distance.
The bloom effect has two parameters beyond the line colour:
- Intensity (0–255) — drives the brightness of the line core and the width of the glow. Low intensity gives a dim line with a narrow soft edge. High intensity gives a saturated core with a wide bright halo, exactly as a vector CRT looks when the beam is driven hard
- Falloff (0–255) — controls how quickly the glow fades with distance from the line. High falloff gives a tight bright line. Low falloff gives a wide diffuse glow
The falloff model per pixel:
brightness(d) = intensity × falloff_factor^d
where d = perpendicular distance from line centre in pixels
falloff_factor = falloff / 255 (0.0 = instant falloff, 1.0 = no falloff)
The glow is rendered in additional passes at d=1, 2, 3... pixels from the line centre, each at reduced intensity, stopping when brightness drops below a threshold (~4/255). At high intensity and low falloff the glow radius can extend 4–8 pixels from the line centre, matching real vector phosphor behaviour.
blit.line_bloom #x0, #y0, #x1, #y1, #colour, #intensity, #falloff
Destination bitmap must use additive blend mode for vector rendering. The bitmap layer is initialised to black. Each bloom line call uses BLT_ADDITIVE blend mode:
dest = clamp(dest + src, 0, max)
Colour values add together where lines overlap or cross — crossing points become brighter, exactly as they do on a real vector CRT where the electron beam passes twice. The destination saturates to maximum brightness at hot spots. This is the correct and authentic behaviour, not a workaround.
A complete vector frame is rendered by:
- Clear bitmap layer to black (one fast fill job)
- Dispatch bloom line list (one primitive list job,
BLT_ADDITIVE) - EE flips front/back buffer at V-blank
For a game with 200 lines per frame (Tempest has roughly this many), the entire frame renders in a few hundred microseconds — far within the V-blank budget. The bitmap layer is then composited by the scanline mixer alongside hardware sprite layers for UI elements.
Vector line list Batch dispatch of many bloom lines in one job — the same primitive list mechanism as sprite lists. Each entry specifies start point, end point, colour, intensity, and falloff. The blitter processes the list sequentially with a single completion interrupt at the end.
blit.linelist A0, ; list base — array of (x0,y0,x1,y1,colour,intensity,falloff)
#200, ; line count
A1, ; destination bitmap
#BLT_ADDITIVE | BLT_ANTIALIAS | BLT_BLOOM
Named vector presets — analogous to the display system's named CRT monitor presets:
| Preset | Intensity | Falloff | Character |
|---|---|---|---|
| Tempest | 200 | 180 | Wide saturated glow — Atari colour vector |
| Asteroids | 160 | 220 | Tighter glow, green-white phosphor feel |
| Star Wars | 220 | 140 | Very wide bloom, high-intensity battle look |
| Battlezone | 150 | 200 | Crisp wireframe, narrow green glow |
| Sharp | 255 | 240 | Maximum brightness, minimal bloom |
Particle (point sprite) Draw a single dot at a given position. The simplest primitive — much cheaper than a line since no angular rasterisation is needed.
Parameters:
- Position (x, y) — centre of the particle
- Colour — RGBA
- Size — radius in pixels (1 = single pixel, 2 = 2px radius disc, etc.)
- Intensity (0–255) — brightness, same scale as bloom lines
- Falloff (0–255) — phosphor glow falloff, same model as bloom lines. At falloff=255 the particle is a hard-edged disc. At lower values it has a soft glowing halo
With BLT_ADDITIVE | BLT_BLOOM, a particle glows exactly like a point light on a vector CRT — bright saturated core, soft halo, additive mixing with nearby particles and lines. Overlapping particles become brighter. This is the correct and authentic vector CRT behaviour.
A size-4 particle with bloom is visually similar to a filled circle but renders significantly faster — useful for explosion effects where exact circularity doesn't matter.
blit.particle #x, #y, #colour, #size, #intensity, #falloff, #BLT_ADDITIVE | BLT_BLOOM
Particle list Batch dispatch of many particles as a single blit job — the natural companion to the vector line list. Each entry specifies position, colour, size, intensity, and falloff. The blitter renders all particles sequentially with one completion interrupt at the end.
blit.particlelist A0, ; list base — array of (x, y, colour, size, intensity, falloff)
#500, ; particle count
A1, ; destination bitmap
#BLT_ADDITIVE | BLT_BLOOM
A complete vector explosion frame — line list plus particle list — both writing additively to the same bitmap layer. Lines and particles mix naturally: a bright particle near a bright line creates a hot spot where they overlap, exactly as on a real vector display.
Spark trail A particle list with a linearly or exponentially decreasing intensity along a path. Simulates the glowing tail of a missile, the decay arc of an explosion fragment, or the fading trace of a fast-moving object. The EE computes the intensity ramp and position array; the blitter renders the trail as a standard particle list.
; Spark trail: 20 particles along a path, intensity fading from 200 to 0
; EE pre-computes positions and intensities into a list at A0
blit.particlelist A0, #20, A1, #BLT_ADDITIVE | BLT_BLOOM
Combined vector frame example
A Tempest-style frame — web wireframe + enemy shots + explosions + particles — all rendered additively to one bitmap layer:
; Frame job queue:
Job 1: blit.linelist web_lines, #180, bitmap, #BLT_ADDITIVE|BLT_BLOOM ; web
Job 2: blit.linelist enemy_lines, #40, bitmap, #BLT_ADDITIVE|BLT_BLOOM ; enemies
Job 3: blit.particlelist shot_particles, #12, bitmap, #BLT_ADDITIVE|BLT_BLOOM ; shots
Job 4: blit.particlelist explosion, #80, bitmap, #BLT_ADDITIVE|BLT_BLOOM ; explosion
; One completion interrupt after Job 4. EE flips front buffer at V-blank.
Total for ~310 primitives at typical complexity: well under 500 microseconds. V-blank at 60fps on 1080p is approximately 1.3ms — comfortable headroom for game logic and audio in the same V-blank period.
Filled rectangle Fill a rectangle with solid colour or pattern. Faster than a general fill for rectangular regions.
Filled circle / ellipse Draw a filled circle or ellipse. Antialiased outline variant. Used for particles, explosions, simple UI elements.
Filled triangle Draw a solid filled triangle with a flat colour. The fundamental primitive for 2D polygon rendering. Lists of triangles can be dispatched as a single blit job for polygon meshes.
Textured triangle
Draw a filled triangle sampling colour from a texture source (see Section 3 — Texture Source System). Each vertex carries UV texture coordinates; the blitter interpolates U and V linearly across the triangle and samples texture[u][v] for each output pixel.
Texture sampling modes:
- Nearest-neighbour — pixel art, no softening, fastest
- Bilinear — smooth scaling, 4 samples per pixel
- Perspective-correct — divide U/V by W per pixel, eliminating affine warping on 3D geometry
The texture source can be any level of the hierarchy — permanent BSRAM resident for small textures, DDR3-backed cache for large texture atlases. Triangle lists with a shared texture source are dispatched as a single blit job; the blitter rasterises each triangle in sequence without re-loading the texture.
; Textured triangle
blit.triangle_tex #x0,#y0,#u0,#v0, ; vertex 0: position + UV
#x1,#y1,#u1,#v1, ; vertex 1
#x2,#y2,#u2,#v2, ; vertex 2
A0, ; texture source descriptor
#BLT_BILINEAR | BLT_PERSP_CORRECT
Rounded rectangle Rectangle with configurable corner radius. Common UI element.
Arc / sector Circular arc or filled sector (pie slice). Useful for health/progress indicators.
3.5 Flood Fill
Flood fill (boundary fill) Fill a connected region bounded by a specific colour, starting from a seed pixel. Uses a scanline-coherent algorithm for efficiency — processes horizontal spans rather than individual pixels, keeping the work queue small.
blit.flood_fill #seed_x, #seed_y, #fill_colour, #boundary_colour
Span fill (non-boundary) Fill a region defined by a left/right boundary table (same format as the layer clip table). More predictable performance than flood fill for programmatically-defined regions.
3.6 Bitmap Layer Operations
Bitmap-to-bitmap blit Copy one intermediate bitmap to another with optional transform. Used for double-buffering, compositing multiple blit passes, and applying the Stage 2 affine transform to a completed intermediate bitmap before it goes to the scanline mixer.
Clip table application Blit an intermediate bitmap to an output layer with a clip table applied. Pixels outside the clip table's left/right boundaries are not written. This is the mechanism for split-screen and portal effects — see the display system document for clip table format.
Colour space conversion Convert a bitmap from one format to another: RGB→YCbCr, RGBA→BGRA, 8bpp indexed→16bpp direct colour, 4bpp→8bpp expansion, etc.
Downscale Reduce a bitmap's resolution with box or bilinear filter. Used for the mixdown pipeline from 4K native content to HDMI/VGA output resolution.
3.7 Text Rendering
Glyph blit Blit a single glyph from a font bitmap (1bpp, 4bpp, or 8bpp) to a destination bitmap with optional colour tint and background transparency.
String blit Render a complete string — the blitter walks the character table, blits each glyph with appropriate advance width, handles line breaks. Returns the bounding box of the rendered text.
Proportional and monospace fonts Both supported. Font metrics (advance widths, kerning pairs) stored alongside the font bitmap.
3.8 Image Processing
Horizontal flip / vertical flip Mirror a bitmap region. Also available as flags on any blit operation.
Rotation by 90° / 180° / 270° Fast integer rotation — no interpolation needed for exact quarter-turns.
Arbitrary rotation Affine blit with rotation matrix. Uses bilinear sampling to reduce aliasing.
Scale with filter Nearest-neighbour (pixel art, no softening), bilinear (smooth scaling), or Lanczos (highest quality, sharpest).
Colour key extraction Create a 1bpp mask bitmap from a source by testing each pixel against a colour key. Used to pre-compute masks for masked sprite blits.
5. Blit Job Descriptor
Every blit operation is described by a job descriptor written to the blitter's command queue by the EE. The descriptor format varies by operation type but always includes:
| Field | Width | Notes |
|---|---|---|
| Operation | 8 bits | Operation type code — also determines which sub-unit handles the job |
| Flags | 16 bits | BLT_MASKED, BLT_ALPHA, BLT_ANTIALIAS, BLT_WRAP, BLT_BILINEAR, BLT_PERSP_CORRECT, BLT_ADDITIVE, BLT_BLOOM, BLT_SEQUENTIAL, BLT_DEPTH_TEST, BLT_DEPTH_WRITE, BLT_DEPTH_CLEAR, etc. |
| Dependency count | 4 bits | Number of predecessor job IDs declared (0–4) |
| Predecessor IDs | 0–4 × 16 bits | Job IDs that must complete before this job dispatches |
| Texture source type | 4 bits | TEX_BSRAM_DIRECT, TEX_BSRAM_CACHE, TEX_GFXSRAM, TEX_FLASH, TEX_DDR3, TEX_INLINE |
| Texture source address | 32 bits | Address or inline data depending on source type |
| Texture width / height / stride | 16+16+16 bits | Source texture dimensions |
| Texture pixel format | 8 bits | 1bpp, 4bpp, 8bpp, 16bpp, 32bpp |
| Destination address | 32 bits | Target bitmap pointer |
| Destination stride | 16 bits | Bytes per row in destination |
| Dest X, Y | 16+16 bits | Top-left of destination rectangle |
| Width / Height | 16+16 bits | Operation dimensions in output pixels |
| Completion task | 16 bits | EE task address to launch on completion (0 = none) |
| Job ID (out) | 16 bits | Assigned by blitter scheduler, returned to EE |
Transform-capable operations (affine blit, Mode 7, textured triangles, etc.) add a transform/UV parameter block after the base descriptor.
BLT_SEQUENTIAL in the flags field is equivalent to declaring a dependency on every currently in-flight job — it provides a full memory fence without requiring the programmer to enumerate specific job IDs.
5. Blitter vs Hardware Layer System
The blitter and the hardware layer system (tilemaps, hardware sprites) are independent and complementary:
| Hardware layer system | Blitter to bitmap layer | |
|---|---|---|
| Timing | Real-time, per-scanline during display | Pre-rendered during V-blank/H-blank |
| Sprite count | Hundreds per scanline (hardware pipeline) | Thousands per frame (throughput limited) |
| Transforms | None (pixel replication only) | Full affine per sprite |
| Tilemap scroll | Per-tile-row H / per-tile-col V (hardware) | Any affine including Mode 7 |
| Setup cost | Register writes only | Job descriptor queue |
| Latency | Zero — always live | One frame (rendered previous V-blank) |
| Colour depth | Indexed palette | Any (8bpp, 16bpp, 32bpp) |
| Clip | Layer clip table | Per-blit or clip table at composite time |
A typical game uses both: hardware sprites for the player character and key game objects (zero latency, always live), blitter sprites for large numbers of background objects, particles, or effects where affine transforms are needed or the count exceeds the hardware sprite budget.
6. Blitter in the Display Pipeline
The blitter's output is an intermediate bitmap layer. That layer then enters the scanline mixer at the same priority level as any other layer — hardware tilemap, hardware sprites, or framebuffer layers. The scanline mixer does not know or care whether a layer was rendered by the hardware tilemap engine or the blitter.
Previous frame V-blank:
EE dispatches blit jobs to blitter
Blitter renders primitive list → bitmap layer A
Blitter raises completion interrupt → EE launches "blit complete" task
EE dispatches Stage 2 blit: bitmap A → output layer, with clip table
Current frame display:
Scanline mixer composites all layers:
Hardware tilemap layer(s)
Hardware sprite layer
Blitter bitmap layer ← appears here
UI overlay
etc.
7. Clip Table Integration
The clip table (see display system document, Section 10) is applied when the blitter composites the intermediate bitmap to the output layer. This is the mechanism for split-screen, portals, and shaped display regions.
The EE generates the clip table during V-blank, then dispatches:
; Stage 2: blit intermediate bitmap to output layer with clip table
blit.composite A0, ; source: intermediate bitmap
A1, ; destination: output layer
A2, ; clip table pointer
#BLT_CLIPPED
The blitter walks the clip table scanline by scanline as it writes to the output layer, skipping pixels outside the left/right boundaries. The operation is otherwise identical to a 2D blit — the clip table adds no significant cost per pixel.
8. FireStorm EE and Blitter — Division of Labour
| Task | EE | Blitter |
|---|---|---|
| Compute sprite positions, animation | ✓ | — |
| Dispatch blit job descriptors | ✓ | — |
| Generate clip tables | ✓ | — |
| Generate Mode 7 matrix lists | ✓ | — |
| Fine-grained pixel ops (register blitting) | ✓ | — |
| Bulk pixel copy | — | ✓ |
| Affine transform rendering | — | ✓ |
| Line/shape/fill primitives | — | ✓ |
| Flood fill | — | ✓ |
| Text rendering | — | ✓ |
| Completion interrupt → EE task | — | ✓ (raises interrupt) |
| Job queue management | shared | shared |
The EE is the programmer-facing control surface. The blitter is the pixel-pushing engine. The EE tells the blitter what to draw; the blitter draws it and tells the EE when it's done.
9. Primitive List — Batch Dispatch
For rendering many sprites, lines, or particles in one frame, the EE builds a primitive list — a compact array of blit descriptors — and dispatches the entire list as a single job. The job's sub-unit processes the list sequentially from start to finish, in order, raising one completion interrupt when done.
; Sprite list: 500 sprites, 16x16, 4bpp, masked
; Executes in list order — correct Z ordering guaranteed
blit.primlist A0, ; list base address
#500, ; entry count
A1, ; destination bitmap
#BLT_4BPP | BLT_MASKED
; Particle list: additive blend — order irrelevant, but still in-order within job
blit.primlist A2, ; particle list
#800,
A3, ; destination bitmap
#BLT_ADDITIVE | BLT_BLOOM
The EE is free to dispatch other jobs — to different sub-units — while this job runs. A sprite list and a text pass and a memory copy dispatched at the same time all run in parallel on their respective sub-units. The sprite list's internal sequencing is preserved regardless.
Throughput: 500 × 16×16 4bpp masked sprites
500 sprites × 256 pixels = 128,000 pixels
At ~8 pixels/cycle at 380MHz ≈ 42 microseconds
Several thousand sprites per frame readily achievable,
in addition to the hardware sprite layer's hundreds per scanline.
11. Primitive Types — Complete List
| Category | Primitive | Texture source | Notes |
|---|---|---|---|
| Copy | Linear copy | Any | With optional format conversion |
| 2D rectangular blit | Any | With stride | |
| 3-source blit | Any | Any 3-input boolean function | |
| Sprites | Masked sprite | Any | Colour-key transparency |
| Alpha sprite | Any | Per-pixel or per-palette alpha | |
| Scaled sprite | Any | Nearest / bilinear | |
| Affine sprite | Any | Full scale/rotate/shear | |
| Triangles | Filled triangle | — | Flat colour |
| Textured triangle | Any | UV interpolated, nearest/bilinear/perspective-correct | |
| Triangle list | Any | Batch of triangles sharing one texture | |
| Tilemaps | Flat tilemap | Any | Standard scrolling tilemap to bitmap |
| Affine tilemap | Any | Mode 7-style with transform matrix | |
| Perspective tilemap | Any | Per-scanline matrix list | |
| Lines | Line | — | Bresenham integer |
| Antialiased line | — | Wu's algorithm | |
| Bloom line | — | Antialiased + phosphor glow, intensity + falloff, additive blend | |
| Polyline | — | Connected segment list, Bresenham/AA/bloom variants | |
| Vector line list | — | Batch bloom lines, additive blend, single completion interrupt | |
| Particles | Particle | — | Point sprite, size + colour + bloom falloff, additive blend |
| Particle list | — | Batch particles, additive blend, single completion interrupt | |
| Spark trail | — | Particle list with intensity ramp along path | |
| Shapes | Filled rectangle | Any (pattern) | Solid or texture-pattern fill |
| Filled circle | Any (pattern) | With optional antialiased outline | |
| Filled ellipse | Any (pattern) | With optional antialiased outline | |
| Filled triangle | — | Solid colour | |
| Rounded rectangle | Any (pattern) | Configurable corner radius | |
| Arc / sector | — | Circular arc or pie slice | |
| Fill | Flood fill | — | Boundary fill, scanline-coherent |
| Span fill | Any (pattern) | Left/right boundary table | |
| Bitmap ops | Bitmap composite | Any | With clip table and optional affine |
| Colour space convert | — | RGB↔YCbCr, format conversions | |
| Downscale | — | Box / bilinear filter | |
| Flip | — | Horizontal, vertical | |
| Rotate | — | 90°/180°/270° integer, arbitrary affine | |
| Text | Glyph blit | BSRAM (permanent) | Single character from font bitmap |
| String blit | BSRAM (permanent) | Full string with metrics | |
| Prefetch | Texture prefetch | DDR3 → cache | Warm cache before blit list runs |
| Misc | Colour key mask | — | Generate 1bpp mask from colour key |
12. Performance Counters and Profiler Overlay
FireStorm includes hardware performance counters for the blitter sub-system. All counters are memory-mapped registers readable via the FRAM bus by the SG2000, the EE, Pulse, or DeMon. They are designed to answer the practical question: which sub-unit is the bottleneck, and how much would adding another instance help?
The Hard RISC-V Core
The GoWin GW5AT-138 contains a hardened RISC-V core built into the FPGA silicon — not implemented in LUTs, so it consumes zero fabric resources. It exists specifically for internal hardware monitoring and debugging, with direct access to all FPGA-internal signals and registers. This core is the natural home for blitter performance counter collection.
The hard RISC-V:
- Runs entirely independently of the FireStorm EE, SG2000, Pulse, and DeMon
- Has zero impact on the system it is measuring — it does not share any execution resource with the EE or blitter sub-units
- Reads all internal blitter counters directly, without going through the FRAM bus
- Formats and renders the profiler UI into the ImGui overlay layer's backing BSRAM
- Can be activated or deactivated by a single register write from any chip in the system
Because it is hardened silicon rather than LUT logic, enabling the profiler has no effect on FPGA timing closure or resource utilisation. It is always present and always collecting — the only choice is whether to display the data.
Counter Snapshot Model
All counters run continuously during blitter operation. At each V-blank, the hardware latches a snapshot of all counters into a separate read register bank. The live counters continue accumulating; the snapshot holds the previous frame's values undisturbed until the next V-blank. The hard RISC-V reads the snapshot bank during the frame with no race conditions.
A reset-on-snapshot option clears the live counters at each V-blank latch, giving per-frame deltas rather than cumulative totals. Either mode is selectable via a control register.
Per-Sub-Unit Counters
For each sub-unit instance (pixel fill 0, pixel fill 1, memory/copy 0, etc.):
| Counter | Meaning |
|---|---|
ACTIVE_CYCLES |
Cycles this unit spent executing a primitive |
STALL_CYCLES |
Cycles a job was waiting for this unit to become free |
IDLE_CYCLES |
Cycles this unit had no work queued |
JOBS_COMPLETED |
Number of jobs (or job segments) completed by this unit |
PRIMITIVES_COMPLETED |
Number of individual primitives processed |
ACTIVE_CYCLES + STALL_CYCLES + IDLE_CYCLES = total frame cycles (sanity check).
Derived metrics:
Stall ratio = STALL_CYCLES / (STALL_CYCLES + ACTIVE_CYCLES)
Utilisation = ACTIVE_CYCLES / total_frame_cycles
Idle fraction = IDLE_CYCLES / total_frame_cycles
| Stall ratio | Interpretation |
|---|---|
| < 10% | Sub-unit not a bottleneck |
| 10–30% | Moderate contention — monitor |
| 30–60% | Significant bottleneck — second unit recommended |
| > 60% | Severe bottleneck — adding a unit would substantially reduce frame time |
A high idle fraction alongside a high stall ratio means jobs are queuing for this unit while it is actually doing nothing — a dependency ordering problem rather than a capacity problem. The fix is reviewing job dependency declarations, not adding units.
Per-Job Counters
For each job ID:
| Counter | Meaning |
|---|---|
DISPATCH_TIME |
Cycle timestamp when the job was submitted |
START_TIME |
Cycle timestamp when the job first acquired a sub-unit |
COMPLETE_TIME |
Cycle timestamp when the job's last primitive completed |
STALL_TIME |
Total cycles this job spent waiting for a sub-unit |
EXECUTE_TIME |
Total cycles this job spent actively executing primitives |
Queue latency = START_TIME - DISPATCH_TIME
Total latency = COMPLETE_TIME - DISPATCH_TIME
Stall fraction = STALL_TIME / EXECUTE_TIME
Global Blitter Counters
| Counter | Meaning |
|---|---|
FRAME_CYCLES |
Total cycles in the last frame |
BLITTER_BUSY_CYCLES |
Cycles at least one sub-unit was active |
ALL_IDLE_CYCLES |
Cycles all sub-units were idle simultaneously |
JOBS_DISPATCHED |
Jobs submitted this frame |
JOBS_COMPLETED |
Jobs completed this frame |
DEPENDENCY_STALL_CYCLES |
Cycles jobs spent held on dependency (not sub-unit contention) |
PEAK_QUEUE_DEPTH |
Maximum jobs simultaneously queued or in-flight |
DEPENDENCY_STALL_CYCLES separates programmer-imposed ordering overhead from genuine sub-unit contention. Only the latter is improved by adding units.
The ImGui Profiler Overlay
The hard RISC-V renders blitter profiling data into the ImGui overlay layer — a system-reserved layer in the FireStorm scanline mixer that sits above all application content. The layer is owned by the monitoring subsystem, composited in hardware, and visible on all display outputs simultaneously.
Activation: A single write to the profiler enable register, accessible via FRAM from the SG2000, EE, DeMon, or Copper. A keyboard shortcut, a debug menu option, a Copper trap on a specific raster line — any mechanism that can write a register can toggle the overlay. The hard RISC-V continues collecting data regardless of whether the overlay is visible.
Overlay contents (suggested layout):
┌─ FireStorm Blitter Profiler ────────────────────────────────┐
│ Frame: 16.7ms Busy: 12.3ms (73%) Idle: 4.4ms (26%) │
│ │
│ Sub-unit Active Stall Idle Stall% Jobs │
│ Pixel fill 0 8.2ms 1.1ms 7.4ms 12% 47 │
│ Pixel fill 1 7.9ms 0.8ms 7.8ms 9% 44 │
│ Line/part 0 2.1ms 0.0ms 14.6ms 0% 3 │
│ Memory/copy 0 9.8ms 3.2ms 3.7ms 25% ← watch │
│ Composite 0 1.2ms 0.0ms 15.5ms 0% 8 │
│ Ray/DDA 0 4.4ms 0.0ms 12.3ms 0% 12 │
│ │
│ Dependency stalls: 0.4ms Peak queue depth: 7 │
│ │
│ Last 8 frames: ▁▂▃▂▂▃▂▂ (frame time sparkline) │
└─────────────────────────────────────────────────────────────┘
The ← watch annotation is generated automatically when a unit's stall ratio exceeds the warning threshold. The hard RISC-V computes the stall ratios and derived metrics directly from the counter snapshot, applies the threshold logic, and renders the formatted text into the overlay BSRAM each frame — typically completing well within V-blank.
Impact on the system being profiled: zero. The hard RISC-V runs on its own clock domain. The ImGui layer sits above all application layers in the compositor priority stack — it never overwrites application BSRAM. The blitter sub-units are unaffected by the overlay rendering because the hard RISC-V writes to BSRAM regions not used by the blitter. The frame time numbers shown are the frame times of the system running normally, not the system running with profiling overhead.
Persistence: The overlay can also write to a circular buffer in FireStorm DDR3, giving a rolling history of counter snapshots that can be read out post-session by the SG2000, saved to disk, or streamed over the network via AntOS. Long-running profiling sessions that capture rare frame spikes are feasible this way.
Workflow: Identifying and Fixing a Bottleneck
-
Enable the overlay — write the profiler enable register. The UI appears immediately.
-
Find the bottleneck — look for the sub-unit with the highest stall% in the overlay. The
← watchannotation appears automatically at the threshold. -
Check dependency stalls — if
DEPENDENCY_STALL_CYCLESis large relative to sub-unit stall cycles, the problem is job ordering rather than capacity. Review dependency declarations. -
If it is a capacity problem — increment the HDL instance count for the bottleneck sub-unit, rebuild the bitstream. No code changes. Re-enable the overlay and confirm the stall% has dropped.
-
Check idle fractions — a unit with high idle% is using LUTs unnecessarily. Reducing its count recovers LUT budget for units that are actually busy.
-
Disable the overlay — write the profiler enable register. The layer disappears. The hard RISC-V continues collecting in the background.