Personality Cartridges — CPU recreation

Overview

An Ant64 personality cartridge recreates a historical computer or console as synthesised logic inside FireStorm's GoWin GW5AT-138K FPGA fabric. The target CPU and all companion chips are implemented as a modern, optimised microarchitecture that runs the original instruction set faithfully — same opcodes, same registers, same behaviour — but unconstrained by the transistor budgets, memory speeds, and bus limitations of the original era.

The result is not emulation. It is the original architecture rebuilt with the advantages of modern silicon:

Zero-wait-state SRAM with burst fetch — replacing DRAM that was 40–60× slower; burst mode delivers 32 bytes of sequential instructions in a single operation into a BSRAM prefetch buffer, keeping the decode stage fed at 380 MHz
Pipelined execution collapsing multi-cycle instructions to 1–2 stages
Barrel shifters making variable-time shifts unconditionally one cycle
No shared bus contention — CPU and display hardware access separate memory simultaneously
All custom chips running in parallel with no software budget spent on any of them
Optional cycle-accurate throttling for software that depends on exact original timing
Rewind — the rewind capture block makes it possible to run the machine backwards in time as well as forwards

When throttling is off, the system runs as fast as the fabric allows. When throttling is on, the cycle counter gate brings it back to exact original timing. Both modes use identical hardware; throttling is a pipeline stall signal, not a different implementation.

The SG2000 small core runs AntOS alongside the personality at all times, providing networking, storage, scripting, and general system services. The SG2000 big core is available to run a dedicated debug and development application — with the full 1 GHz C906 and no OS overhead — giving it read/write access to the softcore's registers, system RAM, and custom chipset registers via the QSPI FRAM interface.

The Ant64 Personality Interface Block

Every personality bitstream includes the Ant64 Personality Interface Block — pre-written, verified HDL supplied to personality developers that synthesises into the bitstream and uses LUTs from the 138K budget. It provides everything Ant64-specific so the personality developer's only job is implementing the chipset:

QSPI FRAM slave — the memory-mapped window through which the SG2000, DeMon, and Pulse communicate with the personality. Register reads and writes, memory access, chipset register access, debug control — all via ordinary memory-mapped reads and writes over QSPI.
Display output path — from the chipset's pixel and sync signals to FireStorm's display pipeline and on to HDMI / VGA / DisplayPort.
Audio output path — from the chipset's audio output to the FireStorm audio DSP chain and codec.
Clock domain management — PLLs and clock crossing logic for CPU, pixel, audio, and SRAM clock domains.
Debug register bank — halt, single-step, breakpoints, watchpoints, and full register file access, all mapped into the FRAM address space automatically. Every personality gets hardware debugging at no extra implementation cost.
Rewind capture block — monitors the CPU data bus, RAM writes, and chipset register writes, capturing XOR deltas of all three into a ring buffer in FireStorm DDR3, enabling full-system forward and backward time scrubbing with no CPU overhead.

The interface block is instantiated once per bitstream. In a multi-system bitstream a small mux sits between the chipsets and the single interface block — its LUT cost is constant regardless of how many personalities are present.

The exact LUT cost of the interface block is documented separately. All chipset LUT figures in this document are chipset only — add the interface block for total bitstream size.

Memory Architecture in Personality Mode

In personality mode, FireStorm's memory resources are assigned around the softcore chipset. See memory architecture for full specifications.

The FireStorm Execution Engine is not present in personality mode — the fabric is given over to the chipset. This frees the EE Code SRAM bus, making both SRAM buses available to the personality.

BSRAM — Video Pipeline

The 340 on-chip BSRAM blocks (~765 KB, 380 MHz, dual-port) serve the emulated video pipeline — sprite line buffers, palette RAM, tilemap data, copper lists, clip tables, CRT simulation LUTs. The video chips access BSRAM at full speed with zero contention.

Graphics SRAM — Guest System RAM

The Graphics SRAM (2× IS61LPS51236B, 4.5 MB combined, 200 MHz, 36-bit bus, 1-cycle pipeline latency, burst mode, no wait states, no refresh) becomes the emulated machine's system RAM. The 36-bit bus delivers 4 bytes per read cycle — for a 6502 that is four opcodes simultaneously; for a 68000 it is two 16-bit instruction words plus an extension word in a single fetch.

EE Code SRAM — ROM Shadow and Instruction Fetch

The EE Code SRAM (identical second pair of IS61LPS51236B chips, 4.5 MB, 200 MHz, 36-bit, on its own fully dedicated and independent bus) is freed from EE instruction fetch duty in personality mode and serves as the ROM shadow and instruction fetch bus. ROM is streamed from QSPI NOR flash into EE Code SRAM at boot and cached there permanently — ROM cannot be written, so the cache never needs invalidation. After the initial load the softcore fetches instructions entirely from this bus at 200 MHz with 1-cycle pipeline latency.

The two SRAM buses are completely independent — instruction fetch and data access never contend. The softcore's read of the next instruction and its read or write of a data address happen simultaneously on separate physical buses. This is a Harvard-style memory architecture that even the original machines never had — they shared the address and data bus between ROM and RAM regardless of whether the chips were physically separate.

Combined Fast SRAM

Note for early adopters: Initial Ant64 prototypes use the 138K FPGA with 32-bit memory rather than the 36-bit SRAM pairs described below. Transitioning to 36-bit SRAM is a priority immediately following the initial prototype phase — the dual independent 36-bit buses, hardware debug tags, and associated features are core to the Ant64 design and will arrive as early as possible in the hardware revision cycle. Prototype memory behaves similarly to the Ant64S architecture in the meantime.

Bus	Capacity	Clock	Width	Use in personality mode
Graphics SRAM	4.5 MB	200 MHz	36-bit	Guest system RAM
EE Code SRAM	4.5 MB	200 MHz	36-bit	ROM shadow + instruction fetch
Total	9 MB			Two independent buses, zero contention

Every retro system's combined ROM and RAM fits comfortably within this budget:

System	ROM	RAM	Total	% of 9 MB
NES	512 KB typical	2 KB	~514 KB	5.6%
ZX Spectrum 128	32 KB	128 KB	160 KB	1.7%
Commodore 64	64 KB	64 KB	128 KB	1.4%
BBC Micro	48 KB	32 KB	80 KB	0.9%
SNES	up to 6 MB	128 KB	~6.1 MB	68% — ROM on EE bus, RAM on Graphics bus
Amiga 500	512 KB Kickstart	512 KB chip + 512 KB slow	~1.5 MB	17%
Amiga 1200	512 KB Kickstart	2 MB chip + 8 MB fast	EE bus for ROM + Kickstart; chip in Graphics SRAM; fast RAM overflows to DDR3
386 PC	varies	up to 4 MB	fits with room	~50%

The Extra 4 Bits — Hardware Debug Tags

Production hardware only. Initial prototypes use 32-bit memory — the extra 4 bits and the zero-cost SRAM tag mechanism described here are not available on prototype units. Prototypes use fabric comparator logic for breakpoints and watchpoints instead, as described in the Ant64S Compatibility section.

The IS61LPS51236B is 36-bit wide: 32 data bits plus 4 bits. In native mode these 4 bits are fully utilised on both buses — the EE Code SRAM uses all 36 bits to carry the FireStorm Execution Engine's 36-bit instruction words exactly, and the Graphics SRAM uses all 36 bits because 36 is divisible by 3, enabling clean RGB pixel packing with no wasted bits or alignment overhead (36-bit words hold 3 pixels at 12bpp, 2 at 18bpp, or 1 at R12G12B12).

In personality mode the FireStorm EE is not instantiated — the fabric is given over to the chipset. The EE instruction format that used those 4 bits on the EE Code SRAM bus is absent. Those 4 bits are genuinely free on that bus and can be repurposed as per-word hardware debug flags. The Graphics SRAM 36-bit width remains useful for delivering 4 guest bytes per read cycle; its 4 bits can also carry debug tags on RAM addresses.

Each 36-bit word covers 4 consecutive guest bytes. The 4 extra bits are assigned as per-word hardware debug flags. The two buses have naturally different primary roles for their tags:

EE Code SRAM — ROM / instruction fetch bus:

Bit	Tag	Fires when
0	Execute breakpoint	This ROM address is fetched as an instruction
1	(reserved)	ROM cannot be written; data read watchpoint not applicable
2	(reserved)	ROM cannot be written
3	Trace flag	This ROM address is fetched — log to trace buffer without halting

Graphics SRAM — RAM / data bus:

Bit	Tag	Fires when
0	(reserved)	RAM is not used for instruction fetch on this bus
1	Read watchpoint	This RAM address is read as data
2	Write watchpoint	This RAM address is written
3	Trace flag	This RAM address is accessed — log to trace buffer without halting

The check is performed in the same cycle as the data read — the 4 extra bits arrive alongside the 32 data bits and feed directly into the debug logic. There is no comparison circuit, no cycle penalty, no LUT cost beyond a 4-input OR gate. A tag costs nothing to execution speed until it fires.

Setting a tag is a single QSPI write from the debug application: read the current 36-bit word, OR in the tag bit, write it back. The tag takes effect on the very next access to that address. Clearing is the same with AND NOT. The softcore does not need to halt for tag setup or removal — tags can be set and cleared while the machine is running at full speed.

Granularity by architecture:

Architecture	Word size	Tag covers
8-bit (6502, Z80)	1 byte	4 consecutive bytes — most instructions are 1–3 bytes
16-bit (68000)	2 bytes	2 instruction words — one 68000 opcode + extension word
32-bit RISC (MIPS, ARM)	4 bytes	Exactly one instruction

The trace flag on the ROM bus is particularly useful for understanding which code paths a program takes — mark a subroutine entry point and every call to it is logged. On the RAM bus it captures every read or write to a variable, building a complete history of how a memory location changes over time, stored in the FireStorm DDR3 trace buffer without halting execution.

FireStorm DDR3 — Rewind, Recording, Expanded RAM, and Trace Buffer

FireStorm DDR3 holds the rewind ring buffer, the trace buffer for tagged memory accesses, recording data, and any overflow for systems whose fast RAM exceeds 4.5 MB. It is not in the critical path for the softcore CPU. On the Ant64C, FireStorm has access to ~2 GB DDR3; on the Ant64 it is 1 GB.

The split between rewind buffer and expanded system RAM is user-configurable. A personality exposes a slider or option in its settings: more rewind depth at one end, more RAM available to the emulated machine at the other. This means the simulated system can be given far more memory than the original hardware ever had:

System	Original RAM	With DDR3 expansion (Ant64C, minimal rewind)
Commodore 64	64 KB	Up to ~1.9 GB — nearly 30,000× original
ZX Spectrum 128	128 KB	Up to ~1.9 GB
Amiga 500	512 KB chip + 512 KB slow	Up to ~1.9 GB fast RAM alongside chip RAM in SRAM
SNES	128 KB	Up to ~1.8 GB (after ROM)
386 PC	4 MB typical	Up to ~1.9 GB — full extended memory

Software that was written to probe the memory size at boot and use whatever it finds will simply see a much larger machine. Software that was written for a fixed memory map will need the expansion mapped appropriately for that system — this is part of the personality's memory map configuration, not something the user needs to set manually.

The rewind buffer and expanded RAM share the same DDR3 pool via the FireStorm DDR3 arbiter. A larger rewind allocation shrinks the RAM expansion ceiling; a larger RAM expansion shrinks rewind depth. The default split is set per personality — a game-focused personality might default to maximum rewind; a development personality might default to maximum expanded RAM. The user overrides this in the personality options.

On the Ant64C, a personality with modest DDR3 requirements can dedicate the majority of the 2 GB to the rewind ring buffer, giving the depth figures in the Ring Buffer Depth table.

Why Original Machines Were Slow — And Why That No Longer Applies

Memory Was the Real Constraint

Original CPU designers were brilliant engineers making optimal decisions within the technology of their era. Their microarchitectures were designed around one fundamental constraint: memory was slow.

Machine	CPU	Memory type	Access time	CPU clock period	Wait states / effect
Altair 8800 / S-100 machines	Intel 8080	DRAM	300–450 ns	500 ns @ 2 MHz	1–2 wait states per access
Apple I / II	MOS 6502	DRAM	300 ns	1000 ns @ 1 MHz	~2 wait states; bus cycle = 2 clocks minimum
Commodore PET	MOS 6502	DRAM	300 ns	1000 ns @ 1 MHz	~2 wait states
TRS-80 Model I / III	Zilog Z80	DRAM	250–300 ns	565 ns @ 1.77 MHz	1–2 wait states + refresh cycles
Atari 400 / 800	MOS 6502	DRAM	250 ns	559 ns @ 1.79 MHz	ANTIC DMA steals up to 50% of cycles in graphics modes
Atari XL / XE	MOS 6502	DRAM	250 ns	559 ns @ 1.79 MHz	ANTIC DMA steals; same constraint as 400/800
NES / Famicom	MOS RP2A03	DRAM	200–250 ns	559 ns @ 1.79 MHz	PPU bus contention during sprite/tile fetch
Commodore 64	MOS 6510	DRAM	200 ns	1000 ns @ 0.985 MHz	VIC-II bad line steals ~40 cycles/line; ~15% overhead
ZX Spectrum 48K	Zilog Z80	DRAM	200 ns	286 ns @ 3.5 MHz	ULA steals ~29% of cycles during display
ZX Spectrum 128K	Zilog Z80	DRAM	200 ns	282 ns @ 3.55 MHz	ULA contention; slightly less severe than 48K
BBC Micro Model B	MOS 6502	DRAM	200 ns	500 ns @ 2 MHz	6845 CRTC takes bus priority during display fetch
Acorn Electron	MOS 6502	DRAM	200 ns	500 ns @ 2 MHz	ULA contention worse than BBC; ~50% loss in high-res modes
Amstrad CPC	Zilog Z80	DRAM	150 ns	250 ns @ 4 MHz	Gate array inserts 1 wait state every single memory cycle
MSX	Zilog Z80	DRAM	150–200 ns	280 ns @ 3.58 MHz	DRAM wait states; VDP on separate bus but CPU still contended
Atari ST	Motorola 68000	DRAM	150 ns	125 ns @ 8 MHz	2 wait states per access typical
Amiga 500	Motorola 68000	DRAM (chip RAM)	280 ns	140 ns @ 7.16 MHz	Agnus owns bus; 68000 gets at most 50% bandwidth
Master System	Zilog Z80	DRAM	150 ns	280 ns @ 3.58 MHz	DRAM wait states; VDP on shared bus
IBM PC XT	Intel 8088	DRAM	200 ns	210 ns @ 4.77 MHz	8-bit external bus; 3–4 wait states per 16-bit access
IBM PC AT	Intel 80286	DRAM	120 ns	125–167 ns @ 6–8 MHz	1–2 wait states; no cache
IBM PC 386	Intel 386	DRAM	80–120 ns	40 ns @ 25 MHz	2–4 wait states; no on-chip cache
IBM PC 486	Intel 486	DRAM	60–80 ns	30 ns @ 33 MHz	On-chip cache helps for hot code; misses very expensive
ZX Spectrum Next KS2	Z80 softcore (Artix-7)	SRAM IS61WV5128-10	10 ns	36 ns @ 28 MHz	Already uses SRAM — ~20× faster than original Spectrum DRAM — but still has wait states at 28 MHz due to interface routing overhead

Every timing quirk that makes these CPUs interesting to work with was a direct consequence of slow memory. The 6502's minimum 2-cycle instruction — not because the logic needed two cycles, but because the bus needed one phase to drive the address and one phase to sample the data. The Z80's built-in refresh cycles — dedicated to DRAM refresh because the programmer shouldn't have to manage it. The 68000's 4–14 cycle effective address decode — each extension word a separate DRAM bus cycle.

With two independent 200 MHz SRAM buses at 5 ns access time, none of these constraints exist. The wait states are gone. The refresh cycles are gone. The extension words arrive in a single 36-bit fetch. Instruction fetch and data access run simultaneously on separate buses — something the original hardware could never achieve regardless of how fast its individual chips were.

Shared Bus Contention

Many popular machines shared a single memory bus between the CPU and the display hardware. The display always won — it had hard real-time deadlines to feed the video signal. The CPU was left waiting.

Commodore 64 — the VIC-II steals the 6510's bus for sprite and character data fetching. On "bad lines" (one per character row, 8 per frame) the VIC-II takes the bus for 40 cycles. The 6510 runs at approximately 90% speed on most lines, dropping to around 50% on bad lines. Programmers timed their code around bad lines for raster effects.

ZX Spectrum — the ULA steals the Z80's bus during display fetch for the active display area. Approximately 29% of Z80 cycles per frame are stolen. The Z80 is held in wait states while the ULA reads video RAM. The entire Spectrum demoscene is built around working with and around this constraint.

Acorn Electron — the ULA's contention was so severe that the 6502 at 2 MHz ran slower in practice than the BBC Micro's 6502 at 2 MHz with its separate CRTC bus. Programmers moved everything performance-critical to page zero and the small regions of RAM that escaped contention.

BBC Micro — the 6845 CRTC and 6502 share video RAM. The CRTC takes priority during display fetch. Programmers used shadow RAM and MODE 7 (which used a teletext chip with its own ROM, not contended RAM) for performance-critical work.

Amiga 500 — Agnus owns the chip RAM bus and grants the 68000 access in alternating cycles. The 68000 gets at best half the bus bandwidth, and less when DMA channels are active — bitplane fetch, sprite fetch, blitter, audio, disk. In a busy display frame the 68000 could be bus-starved to 30–40% of theoretical throughput. The distinction between "chip RAM" and "fast RAM" was fundamental to Amiga programming — code running in fast RAM (on the CPU's private bus) was dramatically faster.

Atari 2600 — the TIA chip drives the CPU's RDY line directly. The CPU is halted during horizontal blank and active display to synchronise with the video beam. Writing a 2600 game required cycle-counting every instruction against the TV beam position. There was no display list, no interrupt — the programmer was the display controller.

Atari 8-bit (400/800/XL/XE) — the ANTIC chip is a DMA processor that steals cycles from the 6502 for display list fetch and character/bitmap data. The number of cycles stolen per scanline depends on the display mode. In graphics-heavy modes the 6502 can lose 50% or more of its cycles.

Atari Falcon030 — the Motorola DSP 56001 shares the same memory bus as the 68030. DSP DMA transfers compete directly with CPU memory access. The Falcon's 16 MHz bus clock was also mismatched to the 68030's 32-bit internal architecture via only a 16-bit external data path, compounding the contention with a bus-width penalty on every 32-bit operation. Audio and video work that the original Falcon could do simultaneously with CPU tasks required careful partitioning between CPU and DSP time on a shared bus. On the Ant64, the DSP runs as a parallel hardware block accessing BSRAM on its own bus — no contention with the CPU at all.

In the Ant64 personality architecture, the CPU accesses Graphics SRAM on its own bus and the video pipeline accesses BSRAM on its own bus. There is no shared bus. There is no arbiter. There are no DMA steals. The contention that consumed 10–50% of original CPU throughput does not exist.

In cycle-accurate mode this contention can be reintroduced artificially for software that depends on it. In turbo mode it is gone entirely — and for the overwhelming majority of software that tried to work around contention rather than exploit it, this is a pure and unconditional gain.

Optimised Softcore Microarchitecture

The softcore implements the original instruction set exactly — same opcodes, same architectural registers, same observable behaviour — using a modern FPGA microarchitecture:

Barrel shifter — the FPGA's LUT-based barrel shifter shifts by any amount in one clock cycle. The 68000's LSL Dn, Dm took 2 + 2n cycles on original hardware — shifting by 8 cost 18 cycles. On the softcore it costs one pipeline stage unconditionally.

BSRAM register file — the softcore's register file lives in on-chip BSRAM at 380 MHz. Register read and write complete in one clock cycle with no external bus traffic.

Pipelined execution — fetch, decode, execute, and writeback overlap. Many instructions that took 6–12 original cycles execute in 2–4 pipeline stages. The 36-bit SRAM fetch delivers opcode and extension words together, collapsing what were multiple sequential bus cycles into a single fetch.

Fast carry chain — N-bit addition using FPGA carry chain primitives completes in one clock cycle regardless of operand width.

Parallel custom chips — all custom chips (PPU, VDP, Agnus/Denise/Paula, sound chips) are synthesised as separate hardware blocks in the same fabric, running simultaneously with the CPU. No software cycles are spent on them. No time-slicing between CPU emulation and chip emulation.

Prefetch queue and SRAM burst cache — the IS61LPS51236B supports burst mode, delivering consecutive addresses at one cycle per word after the initial access latency. A burst of 8 × 36-bit words (32 bytes) from the EE Code SRAM costs 1 + 8 cycles rather than 8 × 1 cycles. The personality developer can land this burst into a dedicated BSRAM prefetch buffer — a byte FIFO between the SRAM fetch unit and the decode stage. The decoder then consumes from BSRAM at 380 MHz with zero SRAM traffic until the buffer is exhausted, at which point the next burst is already in flight. Since BSRAM is dual-port, the burst controller writes the next block into one port while the decoder reads from the other — the refill is invisible to the pipeline.

For a 6502 tight inner loop, 32 bytes holds up to 32 one-byte instructions or a typical mix of 1–3 byte instructions. The entire loop body likely fits in one burst fetch and executes entirely from BSRAM — this is effectively a small instruction cache with fully predictable behaviour. For the 68000, 32 bytes covers several instructions including extension words, feeding the decoder without stalls even for variable-length instruction streams.

Branches and jumps invalidate the prefetch buffer, paying the initial SRAM latency plus a new burst on the taken path. For the tight inner loops that dominate retro CPU workloads, branches are infrequent and the sequential fetch win is substantial.

The optimal burst size and buffer depth vary by architecture — fixed-width ISAs (MIPS, ARM, 65816 native mode) benefit most predictably; variable-length ISAs (6502, Z80, 68000) still benefit significantly for sequential code. This is a recommended implementation pattern for personality developers rather than a base interface block feature. The original machines had no equivalent — any prefetch queue they had was limited to a few bytes and constrained by the same slow DRAM bus that limited everything else.

Optional CPU Enhancements and Dialects

Because the CPU is synthesised logic rather than a fixed chip, its instruction set and behaviour can be extended or upgraded at the HDL level. The spare LUT budget — substantial for all but the most complex personalities — accommodates these enhancements alongside the faithful core. Standard software always runs in the faithful mode; enhanced software uses extended registers or instructions that the original hardware never had.

Z80 — Three Mutually Exclusive Dialects

The Z80 softcore supports three CPU modes, selectable via a FRAM register and taking effect on the next CPU reset. All three share the same base Z80 HDL module; only the decode path changes:

Mode 0: Standard Z80  — authentic instruction set and timing
Mode 1: Z80N          — Spectrum Next extended opcodes ($ED prefix additions)
Mode 2: eZ80          — Zilog eZ80 extended opcodes ($ED prefix, 24-bit ADL mode)

Modes 1 and 2 are not supersets of each other — they are alternative dialects using the same encoding space differently. The LUT overhead of carrying both decode paths in fabric and switching between them is small. The bitstream does not need to be rebuilt to change CPU dialect.

Z80N (used by the ZX Spectrum Next) adds instructions for hardware multiply, pixel manipulation, and memory paging to support the Next's extended hardware. Z80N software runs correctly on any Spectrum or Z80-based personality with the mode register set accordingly.

eZ80 (Zilog's own extended Z80, used in TI graphing calculators and embedded systems) adds a 24-bit addressing mode (ADL — Address/Data Long) that expands the addressable memory space from 64KB to 16MB, plus additional instructions operating on 24-bit values. eZ80 mode is relevant for personalities targeting TI-83/84 series calculators and embedded eZ80 systems.

68000 — Optional ISA Extensions

The 68000 softcore's decode table is writable via the FRAM window. Undefined opcode slots can be mapped to extended instruction implementations in fabric:

68010 extensions — loop mode (DBcc optimisation) and MOVEC/MOVES instructions. The 68010 was the first minor revision; loop mode measurably speeds tight iteration.
68020 instruction set — full 32-bit multiply/divide (MULS.L, DIVS.L), 32-bit PC-relative addressing, BFINS/BFEXTS bit field instructions, PACK/UNPK. Software compiled for the 68020 runs on the enhanced 68000 softcore without a different CPU core being present.

The decode table switch is a FRAM register write — no bitstream reload. The extension level is a personality component selectable from the OSD. Standard 68000 software is unaffected when extensions are off.

65816 — Optional Upgrade for All 6502-Based Machines

The WDC 65816 is a 16-bit extension of the 6502 architecture, fully backward-compatible with 6502 code in emulation mode. It adds a 24-bit address space (16MB), 16-bit accumulator and index registers in native mode, a hardware stack of arbitrary depth, and additional addressing modes. It was designed explicitly as an upgrade path from the 6502 — any 6502 machine can optionally run a 65816 softcore instead, with existing software running without modification in the 65816's emulation mode.

Since the CPU is synthesised logic, swapping the 6502 for a 65816 is a FRAM register write selecting a different decode path in the same softcore module. The 65816 is available as an optional enhancement on every 6502-based personality:

Machine	Original CPU	65816 upgrade gain
SNES	WDC 65C816	Native — the 65816 is the SNES's actual CPU
Apple IIgs	WDC 65C816	Native — IIgs runs natively at 2.8 MHz (fast) or 1 MHz (compat)
Apple II / IIe / IIc	MOS 6502 / 65C02	65816 gives full IIgs-compatible mode with 24-bit addressing
C64 SuperCPU mode	MOS 6510	65816 at effective 20 MHz equivalent — SuperCPU-compatible register map
BBC Micro	MOS 6502	24-bit address space, 16-bit registers, proper hardware stack — replaces Tube co-processor need for most use cases
Atari 8-bit	MOS 6502C	24-bit addressing expands beyond the ANTIC/GTIA 64KB limit for enhanced software
NES / Famicom	MOS RP2A03	65816 native mode available; expanded addressing useful for homebrew beyond the 64KB map
Atari 2600	MOS 6507	65816 emulation mode available; limited practical gain given TIA-constrained architecture

In every case the 65816 runs in the 65816's emulation mode by default — behaviour identical to the original 6502, with the exception that the stack is fixed to page 1 and the D register is zero. Switching to native mode enables the full 16-bit registers and 24-bit addressing. The mode switch is transparent to software that never sets the native mode flag; standard 6502 code runs unchanged.

The 65816 softcore on the Ant64 runs at full FPGA speed in turbo mode — substantially faster than any 65816 hardware ever shipped. The Apple IIgs ran its 65816 at 2.8 MHz fast mode; the Ant64 equivalent runs at hundreds of MHz effective throughput.

SNES Cartridge Co-Processors

The SNES was designed from the outset with a co-processor strategy — 16 additional pins on the cartridge edge allowed game cartridges to include dedicated chips for capabilities the base hardware lacked. These chips are synthesised in spare GoWin LUTs and enabled automatically when a ROM is identified as requiring one (via the ROM header's co-processor type byte):

Co-processor	Description	Notable games
Super FX GSU-1	Argonaut RISC CPU @ 10.5 MHz — polygon renderer	Star Fox, Stunt Race FX
Super FX 2 GSU-2	Super FX at up to 21 MHz, more ROM support	Yoshi's Island, Doom
SA-1	Full 65C816 @ 10.7 MHz + DMA + decompression — effectively a second SNES CPU	Super Mario RPG, Kirby Super Star
DSP-1 / 1A / 1B	NEC µPD77C25 — 16-bit multiply, sin/cos, vector/rotation	Super Mario Kart, Pilotwings
DSP-2	Converts Atari ST bitmap format to SNES bitplane format	Dungeon Master
DSP-3 / DSP-4	Single-game chips — AI and procedural track rendering	SD Gundam GX, Top Gear 3000
CX4	Hitachi HG51B169 @ 20 MHz — trig, wireframe, rotation	Mega Man X2, X3
ST-010 / ST-011	NEC µPD96050 — AI for Shogi games	Various Shogi titles
ST-018	21.44 MHz ARM60 — a 32-bit ARMv3 processor inside a SNES cartridge	Hayazashi Nidan Morita Shogi 2
Super Game Boy	Sharp SM83 core — the complete Game Boy CPU in a SNES cartridge	Super Game Boy

All fit comfortably in the 138K fabric alongside the main SNES core. The SA-1 and Super FX are the largest; both fit with room to spare. The ST-018 is the most exotic — a genuine ARMv3 softcore running inside a personality that is itself inside an FPGA personality, which is a satisfying level of nesting. In the GoWin fabric the Super FX runs at full FPGA speed, removing the original clock throttling.

BBC Micro — Tube Co-Processor

The BBC Micro's Tube interface connected a second processor via four bidirectional FIFOs. The co-processor ran user programs; the host 6502 handled all I/O. On original hardware the co-processor's speed was unconstrained — the 2 MHz Tube bus was the only bottleneck, and it only activated during OS calls. Software that used OS calls for all I/O (BBC BASIC does; properly-written machine code does) ran at whatever rate the co-processor ran.

On the Ant64, the Tube co-processor is synthesised in spare LUTs of the BBC Micro bitstream — not a physical second board. The host 6502 communicates via the standard Tube register addresses; the co-processor runs at full FPGA clock rate. The effective speed uplift for BASIC and OS-calling code is substantial.

C64 — MEGA65 / GS4510 Mode

The MEGA65 is an open-source implementation of the never-released Commodore 65, developed by the Museum of Electronic Games and Art. Its GS4510 CPU is an enhanced 65CE02 derivative with a 28-bit address space, 32-bit far-JSR/JMP/RTS, and software-selectable speeds from 1 MHz (authentic C64) to 40.5 MHz. The Ant64 C64 personality's enhanced mode adopts the MEGA65's VIC-IV and GS4510 register interface — MEGA65-aware software runs without modification, and full C64 software is unaffected in the standard mode.

MSX — OCM-PLD / MSX++

The OCM-PLD project (originally the One Chip MSX, or 1chipMSX) is a mature FPGA reimplementation of the MSX2+ platform with an active community that has been enhancing it for well over a decade. The current firmware is branded MSX++ and runs on multiple FPGA boards (SX-2, SM-X, Zemmix Neo, and others). It is the reference enhancement target for the Ant64 MSX personality.

Key OCM-PLD enhancements the Ant64 MSX personality adopts:

Turbo CPU speeds — Z80 core switchable between 3.58 MHz (authentic), 5.37 MHz, 8 MHz, and faster, software-controlled via the switched I/O port scheme
PSG2 — a second AY-3-8910 PSG synthesised alongside the original, providing stereo audio for software that addresses the second PSG at its standard address
OPL3 — FM synthesis beyond the original MSX-Music (OPL1) — richer FM audio for software that detects and uses it
V9990 GPU — the Yamaha V9990 was designed as an MSX VDP successor but was too expensive to include in standard hardware; the OCM synthesises it in spare LUTs, adding 256-colour bitmap modes, hardware sprites, and a pattern generator. V9990-aware software runs without modification against the standard register map
MegaRAM expansion — up to 4096 KB ASCII-mapped MegaRAM in spare LUTs, accessible to MSX-DOS and OS-9

The OCM-PLD's switched I/O port scheme ($40–$4F) is the compatibility target — software written for any OCM-compatible MSX++ machine runs on the Ant64 MSX personality without modification.

Dragon 32/64 and CoCo — CoCo3 / GIME Enhancements

The Dragon 32/64 and TRS-80 Color Computer share the same underlying architecture — both derived from a Motorola 6809 reference design pairing the MC6809E CPU with the MC6847 VDG and MC6883 SAM. The CoCo3 introduced Motorola's GIME chip (Graphics Interrupt Memory Enhancement), which added paged MMU, extended video modes, and a software-selectable 1.79 MHz / 0.895 MHz CPU speed switch.

The CoCo3FPGA project by Gary Becker implemented the CoCo3 and its GIME chip in FPGA fabric, running the 6809 core at 25 MHz — over 13× the original speed — and adding 256-colour graphics modes including a 640×450 mode the original GIME never offered. Roger Taylor's RealCoCo (later ported to MiSTer as a combined Dragon32/64 + CoCo2/3 core) extends this work with further accuracy improvements.

The Ant64 Dragon/CoCo personality uses the GIME register interface as the compatibility target. Enhanced modes expose:

6809 at full FPGA turbo speed — the 6809's clean orthogonal architecture and 16-bit register operations benefit substantially from zero-wait-state SRAM and pipelined execution
GIME extended video — 256-colour modes, hardware text with true lowercase at 32/40/64/80 columns
Paged MMU — the GIME's 8 × 8KB page scheme gives the Dragon/CoCo a 512KB address space, extended to DDR3-backed memory for larger configurations
6809 → 6309 upgrade — Hitachi's HD6309 was a licensed 6809 clone with additional undocumented instructions (TFM block transfer, additional registers, hardware divide) that the Dragon/CoCo community has documented thoroughly. The 6309 mode is a FRAM register switch, and 6309-aware software runs at full speed

The Dragon 32/64 and CoCo share enough hardware that they coexist in a single bitstream, switchable from the OSD — the same combined-machine approach as the Atari 16/32-bit personality.

Throttling Modes

The softcore's execution speed is controlled by a fixed-point cycle budget accumulator in the interface block. Each host clock cycle the accumulator advances by a programmable speed_ratio value. When it reaches 1.0 the pipeline is allowed one guest CPU cycle; otherwise it stalls. Setting speed_ratio to ∞ (or simply disabling the gate) gives full turbo speed; setting it to 1.0 gives exact original timing.

// Each host clock cycle:
cycle_budget += speed_ratio;       // fixed-point register, written via FRAM
if (cycle_budget >= 1.0) {
    cycle_budget -= 1.0;
    allow_one_guest_cycle();       // pipeline gate opens
}
// else: pipeline stalls for one host cycle

speed_ratio is a single FRAM register write. Changing speed takes effect on the next host cycle — no reset, no reconfiguration, no glitch. The custom chips always run at their correct video-synchronised rate regardless of the CPU throttle setting.

Decoupled and Coupled Speed

Two modes control the relationship between CPU speed and custom chip speed:

Decoupled — the CPU softcore runs at maximum FPGA speed; the custom chips run at their original pixel/bus clock. The custom chips see correctly-timed bus cycles for any shared memory access via wait state insertion at the bus interface. Private fast RAM accesses are instant. This is equivalent to adding fast RAM to the original machine — software runs faster wherever it uses private memory, while timing-sensitive display hardware is unaffected.

Coupled — the custom chips are also clocked faster via their FPGA clock divider. The whole machine accelerates uniformly. The display output is re-timed by the overlay block so HDMI output stays at standard frame rates while the hardware runs multiple frames per display frame internally.

Preset	CPU	Custom chips	Notes
Authentic	Throttled to original	1×	Maximum compatibility — exact original timing
Turbo	Uncapped	1×	Decoupled — fastest compatible mode
2×	2×	2×	Coupled uniform speedup
4×	4×	4×	Coupled maximum speedup
Maximum	Full FPGA rate	Decoupled	Timing-sensitive software may break

Mode	Speed ratio	Effective guest speed	Primary use
Turbo	Gate disabled	Maximum (~200–400× original)	Normal production use
×10	10.0	10× original	Fast-forward through slow sections
×4	4.0	4× original	Rapid test of timing-sensitive code
×2	2.0	2× original	Slow-motion debugging — timing issues easier to spot
×1 (original)	1.0	Exact original speed	Full cycle accuracy — CPU timing mode
×0.5	0.5	Half original speed	Watch raster effects build line by line
×0.25	0.25	Quarter original speed	Single-scanline debugging
×0.1	0.1	One tenth speed	Instruction-by-instruction visual tracing
Step	0 (halted)	One instruction on demand	Deep instruction-level debugging

Any fractional value is valid — speed_ratio is a full fixed-point register, not a discrete selector. A developer can dial in 0.03× to step through code almost frame by frame.

An important property of fractional modes: the display output continues running at full frame rate even while the simulated CPU runs at a fraction of original speed. The video chips are locked to the pixel clock, not the CPU throttle. At ×0.25, the display updates 60 times per second but the CPU completes only a quarter of a scanline's worth of instructions per frame — you can literally watch a raster effect build up scanline by scanline on the live display while the CPU churns forward at a controlled pace.

Typical Debug Workflow

1. Start at Turbo — run at full speed, gameplay is normal
2. Approach region of interest
3. FRAM write: set speed to ×0.5 — everything slows, timing issues become visible
4. Tagged memory address fires watchpoint — pipeline halts automatically
5. Inspect registers and memory state via QSPI
6. FRAM write: set speed to Step
7. Single-step through instructions, watching registers and display update
8. FRAM write: set speed to ×1 — resume at original speed for cycle-accurate validation
9. FRAM write: Turbo — return to full speed

This transition from turbo to slow-motion to halted to stepping and back is entirely register writes — no bitstream reload, no reset, no loss of machine state at any point.

Interaction with SRAM Debug Tags

Speed throttling and SRAM debug tags compose naturally. A common pattern:

Run at Turbo with a trace tag on a RAM region — collect an access log without slowing down
Switch to ×0.25 when approaching a known-problematic area — slow enough to observe
Execute breakpoint on ROM fires — pipeline halts mid-instruction
Inspect state, modify a RAM value via QSPI, resume at ×0.5 to watch the corrected behaviour

Rewind — Running Backwards

The rewind capture block records CPU and RAM state changes into a ring buffer in FireStorm DDR3. Each entry stores an XOR delta — the old value XOR'd with the new value.

The reason XOR works symmetrically in both directions:

delta = old XOR new

Undo:  current_value XOR delta  =  new XOR (old XOR new)  =  old   ✓
Redo:  current_value XOR delta  =  old XOR (old XOR new)  =  new   ✓

The same stored entry applies equally for undo and redo. Directionality comes entirely from which way the ring buffer pointer moves — backwards undoes, forwards redoes. The ring buffer pointer position is the current point in time.

Entry Format — Bitmask Frames

Rather than emitting a separate entry per register or per memory write, the capture block packages each instruction's changes into a single delta frame:

Delta frame:
  [entry type: 2 bits]
  [changed_mask: N bits]       ← one bit per CPU register/flag, CPU-specific
  [PC delta: 16 bits]          ← signed; absolute entry emitted for jumps > ±32KB
  [XOR delta per set bit]      ← only CPU registers that actually changed
  [RAM writes: address + XOR delta, one per write in this instruction]
  [chipset reg writes: reg ID + XOR delta, one per chipset write]

Checkpoint frame (emitted periodically):
  [entry type: 2 bits]
  [full CPU register snapshot]
  [full chipset register snapshot]

PC absolute frame (for long jumps):
  [entry type: 2 bits]
  [full PC value]

The changed_mask means zero-delta registers are never stored — if X, SP and most flags don't change in a tight loop, they contribute nothing to the ring buffer. The capture block maintains a small register mirror in BSRAM (6 bytes for a 6502, 72 bytes for a 68000) and compares on each instruction retirement to build the mask.

The capture block computes RAM write deltas by reading the old value from SRAM before the write completes — this read-before-write happens in the capture block's own independent pipeline and adds no stall to the CPU.

Chipset Registers Are Fully Reversible

On original hardware, many chipset registers were write-only — the VIC-II's sprite coordinates, the SID's envelope parameters, Agnus DMA pointers, the SNES PPU scroll registers. The CPU could write them but never read them back, so their internal state was inaccessible.

Since the chipset is synthesised logic running in the FPGA, every register has a readable internal state regardless of what the original chip exposed. The personality's address decoder routes reads on write-only addresses to the actual internal flip-flops rather than the original chip's external read path — transparently, with no special addressing required. This is described in detail in Chipset Register Read-Back.

The capture block uses this read-back path internally to fetch the old value before each chipset write, enabling correct XOR delta computation even for registers that were write-only on the original hardware. Chipset register writes are captured in the same ring buffer as CPU registers and RAM writes. They are far less frequent than RAM writes — a busy C64 frame might involve a few hundred VIC-II register writes compared to millions of RAM accesses — so the additional ring buffer bandwidth is modest.

Bytes per instruction after compression:

CPU	Registers tracked	Typical bytes/instruction	Example instruction
6502	A, X, Y, SP, PC, 5 flags	~4–6 bytes	`LDA #$42` = mask + PC delta + A delta + NZ delta
Z80	14 main + 4 alt + 2 index + flags	~5–8 bytes	`LD A,n` = mask + PC delta + A delta + flags delta
68000	8 D + 7 A + PC + SR	~8–14 bytes	`MOVE.L D0,D1` = mask + PC delta + D1 delta + SR delta
MIPS R3000	32 GPR + PC + HI/LO	~10–16 bytes	`ADDU D,S,T` = mask + PC delta + dest delta

Chipset register writes add a few bytes per write event, but as these are infrequent compared to CPU instructions the per-instruction average impact is small.

Checkpoint Entries for Fast Seeking

A full snapshot — CPU registers and full chipset register state — is emitted every N instructions (configurable). DeMon's jog dial fast-rotation mode jumps checkpoint-to-checkpoint rather than entry-by-entry, enabling coarse seeking across long timelines. The debug application displays a timeline bar with checkpoint markers as visible anchors.

Ring Buffer Depth

The ring buffer is a power-of-2 block of FireStorm DDR3. On the Ant64C, FireStorm has access to ~2 GB DDR3 — the ring buffer can occupy all of it if the personality does not need DDR3 for other purposes. On the Ant64 it is 1 GB. The Ant64S has 8 MB PSRAM, giving much shallower depth.

CPU	Compressed rate	Ant64S (8 MB)	Ant64 (1 GB)	Ant64C (~2 GB)
6502 @ 1 MHz	~1.1 MB/s	~7 seconds	~15 minutes	~29 minutes
Z80 @ 3.5 MHz	~3.5 MB/s	~2 seconds	~5 minutes	~9.5 minutes
68000 @ 7.16 MHz	~28 MB/s	<1 second	~36 seconds	~72 seconds
MIPS R3000 @ 33.8 MHz	~120 MB/s	<1 second	~8 seconds	~17 seconds

These figures assume the full DDR3 pool is allocated to the rewind buffer. Users can trade rewind depth for expanded system RAM — giving the emulated machine more memory than the original hardware ever had. See FireStorm DDR3 for the full tradeoff options.

What rewind can and cannot do. The capture block records CPU registers, RAM writes, and chipset register writes — which together constitute the complete internal state of the emulated machine. Rewinding fully restores the CPU, all RAM, and all chipset state including display registers, audio parameters, DMA pointers, and sprite tables. The display and audio output resettles to match the restored state within one frame.

Interrupts are handled correctly. An interrupt firing between two instructions produces a sequence of RAM writes (return address and pushed registers onto the stack), a PC delta (jump to the interrupt vector), and register deltas (status flags changed). From the capture block's perspective this is indistinguishable from any other sequence of writes — it records what the hardware actually did, regardless of why. Rewinding through an interrupt restores the stack to its pre-interrupt state, restores the PC to the instruction that was about to execute, and restores the flags. The interrupt appears to un-fire cleanly.

The interrupt pending state in the chipset is also restored — the VIC-II raster interrupt flag, the Z80 interrupt acknowledge, the timer register that caused the interrupt — because chipset register writes are captured alongside everything else. When the machine runs forward again from the rewound point, the same interrupt will re-fire at the same instruction boundary, which is the correct and expected behaviour.

Multi-CPU systems. Machines with more than one CPU — the Mega Drive (68000 + Z80), SNES (65816 + SPC700), Saturn (two SH-2s) — need all CPUs captured in the same ring buffer. Each delta frame includes a cpu_id field identifying which CPU produced the entry:

Delta frame:
  [entry type: 2 bits]
  [cpu_id: 2 bits]         ← 0 = main CPU, 1 = sub CPU, 2+ = additional
  [changed_mask: N bits]
  [PC delta + register deltas + RAM/chipset writes]

A single ordered buffer is important for machines where the two CPUs share memory or interact via bus arbitration. On the Mega Drive, the 68000 and Z80 share the Z80 RAM region and communicate via the bus request/grant mechanism — their writes to shared memory are interleaved, and the relative ordering is semantically meaningful. A single buffer preserves that ordering exactly; two separate buffers cannot. Rewinding replays the interleaved stream in reverse, restoring both CPUs together.

For loosely coupled systems like the SNES — where the 65816 and SPC700 interact only through four dedicated communication ports — two separate buffers are optionally supported, allowing each CPU's timeline to be scrubbed independently. Single-buffer mode remains the default and is always correct.

The only state that cannot be reversed is external I/O — signals that left the FPGA into the real world: MIDI output to external hardware, serial data, disk writes to physical media. These happened and cannot be un-happened. For the overwhelming majority of debug and game use cases this is irrelevant — the interesting state is internal.

Jog Dial Time Scrubbing

DeMon has a dedicated jog dial used for system supervision and debug tasks — separate from the 8 jog dials on Pulse that control sequencer and MIDI parameters. In personality debug mode this dial becomes a physical time scrub control — no PC, no debug application window required, hands directly on the timeline:

Rotate clockwise — advance time (positive speed_ratio); faster rotation = higher speed
Rotate anticlockwise — rewind (negative speed_ratio); faster rotation = faster rewind
Push — pause / resume (toggle between halted and last active speed)
Push and hold + rotate — fine single-step in either direction; each detent = one guest instruction
Fast spin — jumps checkpoint to checkpoint for coarse timeline navigation

DeMon translates jog dial events into FRAM register writes to the speed_ratio register via QSPI — the same mechanism the debug application uses. The transition from running forward to halted to rewinding to stepping is entirely physical, with the display updating in real time as the ring buffer is traversed in either direction.

This makes the debug workflow tactile: overshoot a breakpoint, spin the dial back, find the exact instruction, push to pause, inspect via the debug application. The combination of hardware-speed execution, XOR delta rewind, bitmask-compressed frames, and physical scrub control gives a debugging experience that no original hardware developer of any of these machines could have imagined.

Debugging and Development via the FRAM Interface

The SG2000 small core runs AntOS and has access to the personality via the QSPI FRAM interface for system-level tasks — save states, rewind management, scripted automation, system switching. The SG2000 big core, when a debug session is active, runs a dedicated debug application with the full 1 GHz C906 to itself and no OS overhead. This application has read/write access to the entire personality via the same FRAM interface:

Softcore CPU registers — read or write any register at any time. When the softcore is halted, register values are stable and coherent.

Guest system memory — read or write any address in the emulated machine's RAM or ROM shadow. Inspect the stack, patch variables, inject test data, verify game state.

Custom chipset registers — read or write any register in the emulated custom chips via the FRAM interface. This includes registers that were write-only on the original hardware — the VIC-II's sprite coordinates, the SID's envelope state, Agnus DMA pointers, SNES PPU scroll registers. See Chipset Register Read-Back below.

Debug control — halt, run, single-step, reset. Read the cycle counter.

Speed throttle and time scrubbing — write the speed_ratio FRAM register to switch between Turbo, any fractional speed, Step, or negative (rewind) values instantly without losing machine state. DeMon's dedicated jog dial maps directly to this register for hands-on time scrubbing without a PC. See Throttling Modes.

SRAM debug tags — set execute breakpoints, read/write watchpoints, and trace flags on any memory address by writing the 4 extra bits of the relevant 36-bit SRAM word. Tags take effect on the very next access. No LUT cost; no cycle penalty until fired. See The Extra 4 Bits — Hardware Debug Tags.

Trace buffer — read the log of all accesses to trace-tagged addresses, stored in FireStorm DDR3.

The big core debug application exposes a GDB remote stub over TCP/WiFi via the AntOS network stack, so a developer connects from a standard IDE or debugger on a PC with no special hardware. The personality runs at full hardware speed and halts on command — the entire machine state readable and writable in microseconds over QSPI.

Chipset Register Read-Back

On original hardware, chipset register access was highly asymmetric. Many registers were write-only — reading the same address returned open bus noise, the last data bus value, or nothing meaningful. In some cases read and write were mapped to entirely different addresses. Some familiar examples:

Machine	Register	Original write	Original read
C64	SID frequency	`$D400` — sets voice 1 freq low	`$D400` — open bus or last byte
Amiga	Colour register	`$DFF180` (COLOR00) — sets colour	`$DFF180` — no-op on original hardware
Amiga	Blitter status	write registers only	`$DFF002` (DMACONR) — separate read address
SNES	BGMode	`$2105` — write-only	`$213C–$213F` — completely separate status ports
NES	PPUCTRL	`$2000` — write-only	`$2002` (PPUSTATUS) — different address, different data

Since the entire chipset address space is defined by the personality developer at implementation time, the interface block's address decoder already knows which addresses correspond to write-only registers. There is no need for any special convention — reading $D400 simply returns the actual SID flip-flop value, because the decoder routes reads on that address to the internal flip-flop rather than the original chip's external read path. This is a detail of the personality's address decoder wiring, invisible to everything above it.

From the debug application's perspective, every chipset register is readable at its normal address. From the capture block's perspective, it always reads internal flip-flops — it has no concept of the original chip's external interface at all. The "hidden read" is just how the address decoder is built; no extra bits, no address space doubling, no user-visible change to the memory map.

This serves two distinct purposes:

For the rewind capture block — to compute delta = old XOR new on a write to a write-only register, the capture block reads the current flip-flop state through the internal path before each write. This is what makes XOR delta rewind correct for the full chipset, including registers the original programmer could never read.

For the debug application — the SG2000 big core can inspect the complete internal state of the emulated chipset at any normal address. The SID's internal envelope phase, the Copper's current instruction pointer, the blitter's internal accumulator, the PPU's internal scroll latch — anything that exists as a flip-flop in the design is readable at its normal address.

The personality developer wires the internal flip-flop outputs to both the capture block and the read path during chipset implementation. Since the full address map is known at synthesis time, the routing is determined once and baked into the bitstream.

All figures represent the estimated maximum speed multiplier of the Ant64 FPGA personality relative to the original hardware running at its original specification. The multiplier accounts for three compounding factors: clock speed ratio, CPI improvement from pipelined zero-wait-state execution, and removal of bus contention where it applied on the original hardware.

Turbo mode only — the multiplier shown is the maximum speed the softcore achieves with throttling disabled. Cycle-accurate mode always runs at 1× original speed by definition and is not listed.

A note on fabric speed. The GoWin GW5AT-138K uses a 22nm process — the same generation as Xilinx Kintex-7 and Intel Arria 10. The hard RISC-V A25 core embedded in the related GW5AST variant runs at 400 MHz in silicon; BSRAM is rated at 380 MHz. For pipelined CPU softcores on this fabric, achievable synthesis frequencies are substantially higher than on the older GoWin GW1N (~55nm) devices where the open-source A500 and SNES cores were originally demonstrated. The multipliers below reflect the 22nm capability.

Original RAM speed, bus contention severity, and clock speed all feed into the figure — which is why some slower machines have higher multipliers than faster ones. A C64 at 1 MHz with severe bus contention gains more from the architecture than a machine that was already running from fast RAM.

A note on bus width and prefetch queues. Several CPUs had internal registers wider than their external data bus — the 68008's 32-bit registers on an 8-bit bus, the 65816's 16-bit registers on an 8-bit bus, the 8088's 16-bit registers on an 8-bit bus. On the Ant64, the 36-bit SRAM delivers 4 bytes per cycle, collapsing multiple original bus cycles into one. Where multipliers reflect this gain, it is noted per machine.

Some of these CPUs had instruction prefetch queues (68000: 4 bytes, 8088: 4 bytes, 8086/286: 6 bytes, 386+: 16 bytes) which partially hid the instruction fetch bus width penalty on original hardware by fetching ahead during execution. This means the bus-width gain on instruction fetch is slightly smaller than it would appear — the prefetch queue was already doing useful work. However prefetch queues only buffer instruction fetches, not data accesses. A MOVE.L reading or writing a data address still required multiple bus cycles on original hardware regardless of the prefetch queue. The multiplier adjustments for bus width are therefore most accurate for data-heavy code and modestly conservative for pure instruction throughput.

1970s

Machine	CPU	Original clock	Key bottlenecks	Est. speed multiplier	Est. effective speed
Altair 8800	Intel 8080	2 MHz	DRAM wait states, no cache	~280×	~560 MHz
Apple I	MOS 6502	1 MHz	DRAM, 2-cycle minimum enforced by bus	~480×	~480 MHz
Commodore PET 2001	MOS 6502	1 MHz	DRAM wait states	~480×	~480 MHz
TRS-80 Model I	Zilog Z80	1.77 MHz	DRAM wait states, refresh cycles	~330×	~584 MHz
Apple II	MOS 6502	1.023 MHz	DRAM, soft switch contention	~460×	~471 MHz
Atari 2600	MOS 6507	1.19 MHz	TIA halts CPU every scanline — severe	~570×	~678 MHz

1980s — 8-bit home computers

Machine	CPU	Original clock	Key bottlenecks	Est. speed multiplier	Est. effective speed
ZX80	Zilog Z80	3.25 MHz	CPU HALTed during entire display generation	~500×	~1.6 GHz
ZX81	Zilog Z80	3.25 MHz	CPU HALTed during display; ~75% of cycles lost	~540×	~1.8 GHz
Sinclair ZX Spectrum 48K	Zilog Z80	3.5 MHz	ULA steals ~29% of cycles during display	~310×	~1.1 GHz
Sinclair ZX Spectrum 128K	Zilog Z80	3.5469 MHz	ULA contention; slightly less severe than 48K	~300×	~1.1 GHz
BBC Micro Model B	MOS 6502	2 MHz	CRTC bus steal; less severe than Spectrum	~410×	~820 MHz
Acorn Electron	MOS 6502	2 MHz	ULA contention — worse than BBC Micro	~440×	~880 MHz
Commodore 64	MOS 6510	0.985 MHz	VIC-II bad line steals; ~15% overhead	~660×	~650 MHz
Commodore 128	MOS 8502	2 MHz (fast mode)	Less contention than C64	~365×	~730 MHz
Atari 400 / 800	MOS 6502	1.79 MHz	ANTIC DMA steals up to 50% in graphics modes	~450×	~806 MHz
Atari XL / XE	MOS 6502	1.79 MHz	ANTIC DMA steals	~450×	~806 MHz
Dragon 32 / 64	Motorola 6809	0.89 MHz	DRAM wait states, SAM chip contention; 16-bit D and index registers on 8-bit bus — 16-bit ops took 2 cycles	~650×	~579 MHz
TRS-80 Color Computer	Motorola 6809	0.89 MHz	DRAM wait states; same 8-bit bus improvement as Dragon	~640×	~570 MHz
TRS-80 Model III / 4	Zilog Z80	2–4 MHz	DRAM wait states, refresh	~190–280×	~380–1,120 MHz
Oric-1 / Atmos	MOS 6502	1 MHz	DRAM wait states	~480×	~480 MHz
Amstrad CPC 464	Zilog Z80	4 MHz	Gate array inserts 1 wait state every cycle	~255×	~1.0 GHz
Amstrad CPC 6128	Zilog Z80	4 MHz	Same as 464	~255×	~1.0 GHz
MSX (standard)	Zilog Z80	3.58 MHz	DRAM wait states; VDP contention	~225×	~806 MHz
MSX2	Zilog Z80	3.58 MHz	Similar to MSX1	~225×	~806 MHz
Thomson MO5 / TO7	Motorola 6809	1 MHz	DRAM wait states	~540×	~540 MHz
Mattel Aquarius	Zilog Z80	3.5 MHz	DRAM wait states	~230×	~805 MHz
Sinclair QL	Motorola 68008	7.5 MHz	8-bit external bus on 32-bit CPU — every MOVE.L required 4 bus cycles; FPGA collapses this to 1	~250×	~1.9 GHz
Sam Coupé	Zilog Z80	6 MHz	ASIC contention during display	~155×	~930 MHz
ZX Spectrum Next (KS2)	Z80 softcore @ 28 MHz	3.5–28 MHz	SRAM (10 ns) but wait states remain at 28 MHz turbo	~35× at 28 MHz turbo	~980 MHz

1980s — 16-bit home computers and workstations

Machine	CPU	Original clock	Key bottlenecks	Est. speed multiplier	Est. effective speed
Atari ST	Motorola 68000	8 MHz	DRAM wait states; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1	~200×	~1.6 GHz
Atari STE	Motorola 68000	8 MHz	Same as ST; slightly improved DMA	~200×	~1.6 GHz
Amiga 500	Motorola 68000	7.16 MHz	Chip RAM shared with Agnus — 50% max bandwidth; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1	~230×	~1.6 GHz
Amiga 1000	Motorola 68000	7.16 MHz	Same as A500	~230×	~1.6 GHz
Amiga 2000	Motorola 68000	7.16 MHz	Chip RAM contention; fast RAM optional; 16-bit bus — same improvement as A500	~230×	~1.6 GHz
IBM PC XT	Intel 8088	4.77 MHz	8-bit external bus on 16-bit CPU — word ops took 2 bus cycles; FPGA collapses to 1; 3–4 wait states	~310×	~1.5 GHz
IBM PC AT	Intel 80286	6–8 MHz	16-bit registers and 16-bit bus — no bus-width gain; 1–2 wait states	~185×	~1.1–1.5 GHz
IBM PC AT 286	Intel 80286	10–12 MHz	Same as above; no cache	~130×	~1.3–1.6 GHz
Acorn Archimedes A305/A310	ARM2	8 MHz	DRAM wait states; relatively clean bus	~155×	~1.2 GHz

1980s — Consoles and handhelds

Machine	CPU	Original clock	Key bottlenecks	Est. speed multiplier	Est. effective speed
Atari 5200	MOS 6502C	1.79 MHz	DRAM; ANTIC DMA steals	~440×	~788 MHz
ColecoVision	Zilog Z80	3.58 MHz	DRAM wait states	~225×	~806 MHz
Intellivision	GI CP1610	0.894 MHz	16-bit bus but very slow DRAM	~500×	~447 MHz
Vectrex	Motorola 6809	1.5 MHz	DRAM wait states; 16-bit D/index registers on 8-bit bus — modest additional gain	~375×	~563 MHz
NES / Famicom	MOS 6502 (RP2A03)	1.79 MHz	PPU shares bus; DRAM	~415×	~743 MHz
Master System	Zilog Z80	3.58 MHz	DRAM wait states	~225×	~806 MHz
Game Boy (DMG)	Sharp SM83	4.19 MHz	DRAM wait states; PPU steals during fetch	~235×	~985 MHz
Atari 7800	MOS 6502C	1.79 MHz	MARIA DMA steals — severe in graphics-heavy games	~430×	~769 MHz
TurboGrafx-16	Hudson HuC6280	7.16 MHz	Fast for era; dedicated video bus helps	~135×	~967 MHz
Mega Drive / Genesis	Motorola 68000 + Z80	7.67 MHz + 3.58 MHz	DRAM wait states; VDP contention; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1	~195×	~1.5 GHz + ~698 MHz
SNES / Super Famicom	WDC 65816	3.58 MHz (2.68 slow)	8-bit bus on 16-bit CPU — 16-bit ops took 2 bus cycles; PPU DMA; slow ROM	~310×	~1.1 GHz
Neo Geo AES	Motorola 68000	12 MHz	Fast SRAM on cartridge; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1	~120×	~1.4 GHz

1990s — Home computers and workstations

Machine	CPU	Original clock	Key bottlenecks	Est. speed multiplier	Est. effective speed
Amiga 1200	Motorola 68020	14 MHz	AGA chip RAM is 32-bit and 68020 bus is 32-bit — no bus-width gain; Agnus/Alice bus contention remains	~105×	~1.5 GHz
Amiga 4000	Motorola 68030	25 MHz	AGA chip RAM is 32-bit; fast RAM 32-bit — no bus-width gain anywhere; contention remains on chip RAM	~65×	~1.6 GHz
Atari Falcon030	Motorola 68030	16 MHz	32-bit CPU on 16-bit external bus — MOVE.L took 2 bus cycles; DRAM wait states; DSP on same bus	~110×	~1.8 GHz
Acorn Archimedes A3000	ARM2	8 MHz	DRAM wait states	~155×	~1.2 GHz
Acorn RiscPC 600	ARM610	30 MHz	DRAM; reasonable for era	~53×	~1.6 GHz
IBM PC 386DX	Intel 386	16–40 MHz	DRAM 2–4 wait states	~45–80×	~720 MHz–3.2 GHz
IBM PC 486DX	Intel 486	25–66 MHz	On-chip cache helps; misses expensive	~20–40×	~500 MHz–2.6 GHz
IBM PC 486DX2	Intel 486	66 MHz	Cache-hot: good; cold: expensive	~20×	~1.3 GHz

1990s — Consoles and handhelds

Machine	CPU	Original clock	Key bottlenecks	Est. speed multiplier	Est. effective speed
Game Boy Color	Sharp SM83	8 MHz (double speed)	DRAM; PPU contention	~120×	~960 MHz
Game Gear	Zilog Z80	3.58 MHz	DRAM wait states	~225×	~806 MHz
Lynx	MOS 65C02	4 MHz	DRAM; Mikey DMA steals	~248×	~992 MHz
Neo Geo Pocket	Toshiba TLCS-900H	6.144 MHz	SRAM-based; 16/32-bit registers on 8-bit external bus — 16-bit ops 2 cycles, 32-bit ops 4 cycles	~110×	~676 MHz
PlayStation	MIPS R3000A	33.8 MHz	DRAM; scratchpad SRAM limited	~55×	~1.9 GHz
Saturn	Hitachi SH-2 × 2	28.6 MHz each	DRAM; complex bus arbitration between CPUs	~53× per CPU	~1.5 GHz per CPU
Nintendo 64	MIPS R4300i	93.75 MHz	64-bit registers on 32-bit bus — 64-bit ops took 2 bus cycles (though most N64 code used 32-bit ops); RDRAM high latency	~27×	~2.5 GHz
Virtual Boy	NEC V810	20 MHz	32-bit registers on 16-bit external bus — 32-bit ops took 2 bus cycles; DRAM wait states	~95×	~1.9 GHz
Game Boy Advance	ARM7TDMI	16.78 MHz	32-bit registers on 16-bit bus to ROM and EWRAM — 32-bit ops took 2 cycles on those; IWRAM was 32-bit	~90×	~1.5 GHz
WonderSwan	NEC V30MZ	3.072 MHz	16-bit registers on 8-bit external bus — word ops took 2 bus cycles; DRAM	~280×	~860 MHz

Multi-System Bitstreams

The GoWin GW5AT-138K's 138K LUT budget is large enough to hold multiple complete chipsets simultaneously in a single bitstream, switchable instantly via AntOS or the boot menu with no bitstream reload. The full LUT budget table, case studies for the Amiga, Atari 16/32-bit, and Sinclair/ZX lineage, and the manifest.json component switching examples are documented in personality.

As a reference: a complete late-era 16-bit chipset (full Amiga 500 ECS, full SNES, full Mega Drive) costs approximately 18–22K LUTs chipset-only. The combined Nintendo NES+SNES bitstream fits in ~28–30K LUTs and runs on both Ant64 and Ant64S. The combined Atari ST + STE + TT + Falcon030 bitstream runs to ~65–80K LUTs; the ZX lineage (ZX80 through Spectrum Next including QL and Jupiter Ace) fits in ~35–45K LUTs — comfortably within the 138K budget alongside the interface block.

Ant64S Compatibility

The Ant64S uses a GoWin 60K FPGA and a fundamentally different memory architecture from the Ant64 and Ant64C. Where the Ant64 uses twin pairs of 36-bit IS61LPS51236B SRAM giving two independent 200 MHz buses, the Ant64S uses 32-bit DDR3 and PSRAM. Initial Ant64 prototypes also use 32-bit memory with the 138K fabric, so prototype personality bitstreams share this memory architecture with the Ant64S — making the Ant64S bitstream a useful starting point for prototype development. This difference runs deeper than fabric size — the memory controller HDL, the bus interface, and the memory-mapped layout are all different. Every personality therefore requires a dedicated Ant64S bitstream written against the Ant64S memory interface, not a reduced version of the Ant64 bitstream.

The personality cartridge ships both bitstreams. DeMon detects which model is present at boot and loads the appropriate one automatically.

The Ant64S memory architecture has different characteristics from the SRAM-based Ant64:

No 36-bit bus width — DDR3 is 32-bit, so the extra 4 bits used for SRAM hardware debug tags are not available. Execute breakpoints and read/write watchpoints on the Ant64S use conventional comparator logic in fabric rather than the zero-cost SRAM tag approach
No dual independent buses — DDR3 is a single shared bus, so the Harvard-style simultaneous instruction fetch and data access that eliminates contention on the Ant64 is not available in the same form. The Ant64S bitstream handles this through burst prefetch buffering into BSRAM
Higher initial latency — DDR3 has longer initial access latency than SRAM, partially offset by burst mode and BSRAM prefetch buffering
Rewind ring buffer from PSRAM — 8 MB of PSRAM gives a shallow rewind buffer compared to the Ant64's 1 GB DDR3 or the Ant64C's ~2 GB

Most 8-bit and simple 16-bit chipsets fit within the 60K fabric alongside the Ant64S interface block. There is no software fallback — if a chipset cannot be made to fit, that personality is not available on the Ant64S.

What may differ between Ant64 and Ant64S bitstreams for the same personality:

Feature	Ant64 / Ant64C	Ant64S
Memory architecture	Twin 36-bit SRAM buses	32-bit DDR3 + PSRAM
Instruction / data bus	Independent — zero contention	Shared DDR3 — managed via burst prefetch
Hardware debug tags	Zero-cost SRAM bit tags	Fabric comparator logic
Rewind depth	Up to ~29 min (Ant64C) / ~15 min (Ant64)	~7 seconds (8 MB PSRAM)
Multi-system simultaneous	Up to ~6 systems	1–2 systems
Enhanced display features	Full FireStorm integration	May be reduced
Audio post-processing	Full DSP chain	May be reduced

Summary

A personality cartridge for the Ant64 recreates a historical computer or console as an optimised FPGA softcore — the original instruction set implemented with modern microarchitecture, zero-wait-state SRAM replacing DRAM that was 30–60× slower, all custom chips running in parallel with no shared bus contention, and optional cycle-accurate throttling for software that depends on exact original timing.

The rewind capture block records XOR deltas of CPU register changes, RAM writes, and chipset register writes into a bitmask-compressed ring buffer in FireStorm DDR3 — up to ~2 GB on the Ant64C, giving up to 29 minutes of full rewind depth for an 8-bit system and over a minute for a 68000. Since the entire chipset is synthesised logic, every register is readable and reversible regardless of whether the original hardware exposed it — display state, audio state, DMA pointers, sprite tables, all of it. Because XOR is self-inverse, the same stored entry applies equally for undo and redo. Combined with the programmable throttle and DeMon's dedicated jog dial, the complete internal state of the machine is navigable physically with one hand on the dial.

The SG2000 small core runs AntOS alongside the personality at all times, providing networking, storage, and system services. The SG2000 big core runs an optional debug and development application — with the full 1 GHz C906 dedicated to it — providing full read/write access to the softcore's registers, system memory, and custom chipset registers via the QSPI FRAM interface. Hardware debugging at full hardware speed, via the same memory-mapped mechanism used for every other Ant64 subsystem.

The Ant64 Personality Interface Block handles all Ant64-specific interfacing. The personality developer implements the chipset and connects it to known ports. Everything else — HDMI output, audio output, FRAM windowing, debug register bank, SRAM debug tags, rewind capture — is provided.

For the Ant64S, personalities ship a dedicated bitstream written against the Ant64S's 32-bit DDR3 + PSRAM memory architecture — a different memory interface from the Ant64's twin 36-bit SRAM buses, requiring its own HDL rather than a reduced version of the Ant64 bitstream. If the chipset fits in the 60K fabric, it runs in hardware. If it doesn't fit, that personality is not available on the Ant64S — there is no software fallback.