Personality Cartridges — CPU recreation
Overview
An Ant64 personality cartridge recreates a historical computer or console as synthesised logic inside FireStorm's GoWin GW5AT-138K FPGA fabric. The target CPU and all companion chips are implemented as a modern, optimised microarchitecture that runs the original instruction set faithfully — same opcodes, same registers, same behaviour — but unconstrained by the transistor budgets, memory speeds, and bus limitations of the original era.
The result is not emulation. It is the original architecture rebuilt with the advantages of modern silicon:
- Zero-wait-state SRAM with burst fetch — replacing DRAM that was 40–60× slower; burst mode delivers 32 bytes of sequential instructions in a single operation into a BSRAM prefetch buffer, keeping the decode stage fed at 380 MHz
- Pipelined execution collapsing multi-cycle instructions to 1–2 stages
- Barrel shifters making variable-time shifts unconditionally one cycle
- No shared bus contention — CPU and display hardware access separate memory simultaneously
- All custom chips running in parallel with no software budget spent on any of them
- Optional cycle-accurate throttling for software that depends on exact original timing
- Rewind — the rewind capture block makes it possible to run the machine backwards in time as well as forwards
When throttling is off, the system runs as fast as the fabric allows. When throttling is on, the cycle counter gate brings it back to exact original timing. Both modes use identical hardware; throttling is a pipeline stall signal, not a different implementation.
The SG2000 small core runs AntOS alongside the personality at all times, providing networking, storage, scripting, and general system services. The SG2000 big core is available to run a dedicated debug and development application — with the full 1 GHz C906 and no OS overhead — giving it read/write access to the softcore's registers, system RAM, and custom chipset registers via the QSPI FRAM interface.
The Ant64 Personality Interface Block
Every personality bitstream includes the Ant64 Personality Interface Block — pre-written, verified HDL supplied to personality developers that synthesises into the bitstream and uses LUTs from the 138K budget. It provides everything Ant64-specific so the personality developer's only job is implementing the chipset:
- QSPI FRAM slave — the memory-mapped window through which the SG2000, DeMon, and Pulse communicate with the personality. Register reads and writes, memory access, chipset register access, debug control — all via ordinary memory-mapped reads and writes over QSPI.
- Display output path — from the chipset's pixel and sync signals to FireStorm's display pipeline and on to HDMI / VGA / DisplayPort.
- Audio output path — from the chipset's audio output to the FireStorm audio DSP chain and codec.
- Clock domain management — PLLs and clock crossing logic for CPU, pixel, audio, and SRAM clock domains.
- Debug register bank — halt, single-step, breakpoints, watchpoints, and full register file access, all mapped into the FRAM address space automatically. Every personality gets hardware debugging at no extra implementation cost.
- Rewind capture block — monitors the CPU data bus, RAM writes, and chipset register writes, capturing XOR deltas of all three into a ring buffer in FireStorm DDR3, enabling full-system forward and backward time scrubbing with no CPU overhead.
The interface block is instantiated once per bitstream. In a multi-system bitstream a small mux sits between the chipsets and the single interface block — its LUT cost is constant regardless of how many personalities are present.
The exact LUT cost of the interface block is documented separately. All chipset LUT figures in this document are chipset only — add the interface block for total bitstream size.
Memory Architecture in Personality Mode
In personality mode, FireStorm's memory resources are assigned around the softcore chipset. See memory architecture for full specifications.
The FireStorm Execution Engine is not present in personality mode — the fabric is given over to the chipset. This frees the EE Code SRAM bus, making both SRAM buses available to the personality.
BSRAM — Video Pipeline
The 340 on-chip BSRAM blocks (~765 KB, 380 MHz, dual-port) serve the emulated video pipeline — sprite line buffers, palette RAM, tilemap data, copper lists, clip tables, CRT simulation LUTs. The video chips access BSRAM at full speed with zero contention.
Graphics SRAM — Guest System RAM
The Graphics SRAM (2× IS61LPS51236B, 4.5 MB combined, 200 MHz, 36-bit bus, 1-cycle pipeline latency, burst mode, no wait states, no refresh) becomes the emulated machine's system RAM. The 36-bit bus delivers 4 bytes per read cycle — for a 6502 that is four opcodes simultaneously; for a 68000 it is two 16-bit instruction words plus an extension word in a single fetch.
EE Code SRAM — ROM Shadow and Instruction Fetch
The EE Code SRAM (identical second pair of IS61LPS51236B chips, 4.5 MB, 200 MHz, 36-bit, on its own fully dedicated and independent bus) is freed from EE instruction fetch duty in personality mode and serves as the ROM shadow and instruction fetch bus. ROM is streamed from QSPI NOR flash into EE Code SRAM at boot and cached there permanently — ROM cannot be written, so the cache never needs invalidation. After the initial load the softcore fetches instructions entirely from this bus at 200 MHz with 1-cycle pipeline latency.
The two SRAM buses are completely independent — instruction fetch and data access never contend. The softcore's read of the next instruction and its read or write of a data address happen simultaneously on separate physical buses. This is a Harvard-style memory architecture that even the original machines never had — they shared the address and data bus between ROM and RAM regardless of whether the chips were physically separate.
Combined Fast SRAM
Note for early adopters: Initial Ant64 prototypes use the 138K FPGA with 32-bit memory rather than the 36-bit SRAM pairs described below. Transitioning to 36-bit SRAM is a priority immediately following the initial prototype phase — the dual independent 36-bit buses, hardware debug tags, and associated features are core to the Ant64 design and will arrive as early as possible in the hardware revision cycle. Prototype memory behaves similarly to the Ant64S architecture in the meantime.
| Bus | Capacity | Clock | Width | Use in personality mode |
|---|---|---|---|---|
| Graphics SRAM | 4.5 MB | 200 MHz | 36-bit | Guest system RAM |
| EE Code SRAM | 4.5 MB | 200 MHz | 36-bit | ROM shadow + instruction fetch |
| Total | 9 MB | Two independent buses, zero contention |
Every retro system's combined ROM and RAM fits comfortably within this budget:
| System | ROM | RAM | Total | % of 9 MB |
|---|---|---|---|---|
| NES | 512 KB typical | 2 KB | ~514 KB | 5.6% |
| ZX Spectrum 128 | 32 KB | 128 KB | 160 KB | 1.7% |
| Commodore 64 | 64 KB | 64 KB | 128 KB | 1.4% |
| BBC Micro | 48 KB | 32 KB | 80 KB | 0.9% |
| SNES | up to 6 MB | 128 KB | ~6.1 MB | 68% — ROM on EE bus, RAM on Graphics bus |
| Amiga 500 | 512 KB Kickstart | 512 KB chip + 512 KB slow | ~1.5 MB | 17% |
| Amiga 1200 | 512 KB Kickstart | 2 MB chip + 8 MB fast | EE bus for ROM + Kickstart; chip in Graphics SRAM; fast RAM overflows to DDR3 | |
| 386 PC | varies | up to 4 MB | fits with room | ~50% |
The Extra 4 Bits — Hardware Debug Tags
Production hardware only. Initial prototypes use 32-bit memory — the extra 4 bits and the zero-cost SRAM tag mechanism described here are not available on prototype units. Prototypes use fabric comparator logic for breakpoints and watchpoints instead, as described in the Ant64S Compatibility section.
The IS61LPS51236B is 36-bit wide: 32 data bits plus 4 bits. In native mode these 4 bits are fully utilised on both buses — the EE Code SRAM uses all 36 bits to carry the FireStorm Execution Engine's 36-bit instruction words exactly, and the Graphics SRAM uses all 36 bits because 36 is divisible by 3, enabling clean RGB pixel packing with no wasted bits or alignment overhead (36-bit words hold 3 pixels at 12bpp, 2 at 18bpp, or 1 at R12G12B12).
In personality mode the FireStorm EE is not instantiated — the fabric is given over to the chipset. The EE instruction format that used those 4 bits on the EE Code SRAM bus is absent. Those 4 bits are genuinely free on that bus and can be repurposed as per-word hardware debug flags. The Graphics SRAM 36-bit width remains useful for delivering 4 guest bytes per read cycle; its 4 bits can also carry debug tags on RAM addresses.
Each 36-bit word covers 4 consecutive guest bytes. The 4 extra bits are assigned as per-word hardware debug flags. The two buses have naturally different primary roles for their tags:
EE Code SRAM — ROM / instruction fetch bus:
| Bit | Tag | Fires when |
|---|---|---|
| 0 | Execute breakpoint | This ROM address is fetched as an instruction |
| 1 | (reserved) | ROM cannot be written; data read watchpoint not applicable |
| 2 | (reserved) | ROM cannot be written |
| 3 | Trace flag | This ROM address is fetched — log to trace buffer without halting |
Graphics SRAM — RAM / data bus:
| Bit | Tag | Fires when |
|---|---|---|
| 0 | (reserved) | RAM is not used for instruction fetch on this bus |
| 1 | Read watchpoint | This RAM address is read as data |
| 2 | Write watchpoint | This RAM address is written |
| 3 | Trace flag | This RAM address is accessed — log to trace buffer without halting |
The check is performed in the same cycle as the data read — the 4 extra bits arrive alongside the 32 data bits and feed directly into the debug logic. There is no comparison circuit, no cycle penalty, no LUT cost beyond a 4-input OR gate. A tag costs nothing to execution speed until it fires.
Setting a tag is a single QSPI write from the debug application: read the current 36-bit word, OR in the tag bit, write it back. The tag takes effect on the very next access to that address. Clearing is the same with AND NOT. The softcore does not need to halt for tag setup or removal — tags can be set and cleared while the machine is running at full speed.
Granularity by architecture:
| Architecture | Word size | Tag covers |
|---|---|---|
| 8-bit (6502, Z80) | 1 byte | 4 consecutive bytes — most instructions are 1–3 bytes |
| 16-bit (68000) | 2 bytes | 2 instruction words — one 68000 opcode + extension word |
| 32-bit RISC (MIPS, ARM) | 4 bytes | Exactly one instruction |
The trace flag on the ROM bus is particularly useful for understanding which code paths a program takes — mark a subroutine entry point and every call to it is logged. On the RAM bus it captures every read or write to a variable, building a complete history of how a memory location changes over time, stored in the FireStorm DDR3 trace buffer without halting execution.
FireStorm DDR3 — Rewind, Recording, Expanded RAM, and Trace Buffer
FireStorm DDR3 holds the rewind ring buffer, the trace buffer for tagged memory accesses, recording data, and any overflow for systems whose fast RAM exceeds 4.5 MB. It is not in the critical path for the softcore CPU. On the Ant64C, FireStorm has access to ~2 GB DDR3; on the Ant64 it is 1 GB.
The split between rewind buffer and expanded system RAM is user-configurable. A personality exposes a slider or option in its settings: more rewind depth at one end, more RAM available to the emulated machine at the other. This means the simulated system can be given far more memory than the original hardware ever had:
| System | Original RAM | With DDR3 expansion (Ant64C, minimal rewind) |
|---|---|---|
| Commodore 64 | 64 KB | Up to ~1.9 GB — nearly 30,000× original |
| ZX Spectrum 128 | 128 KB | Up to ~1.9 GB |
| Amiga 500 | 512 KB chip + 512 KB slow | Up to ~1.9 GB fast RAM alongside chip RAM in SRAM |
| SNES | 128 KB | Up to ~1.8 GB (after ROM) |
| 386 PC | 4 MB typical | Up to ~1.9 GB — full extended memory |
Software that was written to probe the memory size at boot and use whatever it finds will simply see a much larger machine. Software that was written for a fixed memory map will need the expansion mapped appropriately for that system — this is part of the personality's memory map configuration, not something the user needs to set manually.
The rewind buffer and expanded RAM share the same DDR3 pool via the FireStorm DDR3 arbiter. A larger rewind allocation shrinks the RAM expansion ceiling; a larger RAM expansion shrinks rewind depth. The default split is set per personality — a game-focused personality might default to maximum rewind; a development personality might default to maximum expanded RAM. The user overrides this in the personality options.
On the Ant64C, a personality with modest DDR3 requirements can dedicate the majority of the 2 GB to the rewind ring buffer, giving the depth figures in the Ring Buffer Depth table.
Why Original Machines Were Slow — And Why That No Longer Applies
Memory Was the Real Constraint
Original CPU designers were brilliant engineers making optimal decisions within the technology of their era. Their microarchitectures were designed around one fundamental constraint: memory was slow.
| Machine | CPU | Memory type | Access time | CPU clock period | Wait states / effect |
|---|---|---|---|---|---|
| Altair 8800 / S-100 machines | Intel 8080 | DRAM | 300–450 ns | 500 ns @ 2 MHz | 1–2 wait states per access |
| Apple I / II | MOS 6502 | DRAM | 300 ns | 1000 ns @ 1 MHz | ~2 wait states; bus cycle = 2 clocks minimum |
| Commodore PET | MOS 6502 | DRAM | 300 ns | 1000 ns @ 1 MHz | ~2 wait states |
| TRS-80 Model I / III | Zilog Z80 | DRAM | 250–300 ns | 565 ns @ 1.77 MHz | 1–2 wait states + refresh cycles |
| Atari 400 / 800 | MOS 6502 | DRAM | 250 ns | 559 ns @ 1.79 MHz | ANTIC DMA steals up to 50% of cycles in graphics modes |
| Atari XL / XE | MOS 6502 | DRAM | 250 ns | 559 ns @ 1.79 MHz | ANTIC DMA steals; same constraint as 400/800 |
| NES / Famicom | MOS RP2A03 | DRAM | 200–250 ns | 559 ns @ 1.79 MHz | PPU bus contention during sprite/tile fetch |
| Commodore 64 | MOS 6510 | DRAM | 200 ns | 1000 ns @ 0.985 MHz | VIC-II bad line steals ~40 cycles/line; ~15% overhead |
| ZX Spectrum 48K | Zilog Z80 | DRAM | 200 ns | 286 ns @ 3.5 MHz | ULA steals ~29% of cycles during display |
| ZX Spectrum 128K | Zilog Z80 | DRAM | 200 ns | 282 ns @ 3.55 MHz | ULA contention; slightly less severe than 48K |
| BBC Micro Model B | MOS 6502 | DRAM | 200 ns | 500 ns @ 2 MHz | 6845 CRTC takes bus priority during display fetch |
| Acorn Electron | MOS 6502 | DRAM | 200 ns | 500 ns @ 2 MHz | ULA contention worse than BBC; ~50% loss in high-res modes |
| Amstrad CPC | Zilog Z80 | DRAM | 150 ns | 250 ns @ 4 MHz | Gate array inserts 1 wait state every single memory cycle |
| MSX | Zilog Z80 | DRAM | 150–200 ns | 280 ns @ 3.58 MHz | DRAM wait states; VDP on separate bus but CPU still contended |
| Atari ST | Motorola 68000 | DRAM | 150 ns | 125 ns @ 8 MHz | 2 wait states per access typical |
| Amiga 500 | Motorola 68000 | DRAM (chip RAM) | 280 ns | 140 ns @ 7.16 MHz | Agnus owns bus; 68000 gets at most 50% bandwidth |
| Master System | Zilog Z80 | DRAM | 150 ns | 280 ns @ 3.58 MHz | DRAM wait states; VDP on shared bus |
| IBM PC XT | Intel 8088 | DRAM | 200 ns | 210 ns @ 4.77 MHz | 8-bit external bus; 3–4 wait states per 16-bit access |
| IBM PC AT | Intel 80286 | DRAM | 120 ns | 125–167 ns @ 6–8 MHz | 1–2 wait states; no cache |
| IBM PC 386 | Intel 386 | DRAM | 80–120 ns | 40 ns @ 25 MHz | 2–4 wait states; no on-chip cache |
| IBM PC 486 | Intel 486 | DRAM | 60–80 ns | 30 ns @ 33 MHz | On-chip cache helps for hot code; misses very expensive |
| ZX Spectrum Next KS2 | Z80 softcore (Artix-7) | SRAM IS61WV5128-10 | 10 ns | 36 ns @ 28 MHz | Already uses SRAM — ~20× faster than original Spectrum DRAM — but still has wait states at 28 MHz due to interface routing overhead |
Every timing quirk that makes these CPUs interesting to work with was a direct consequence of slow memory. The 6502's minimum 2-cycle instruction — not because the logic needed two cycles, but because the bus needed one phase to drive the address and one phase to sample the data. The Z80's built-in refresh cycles — dedicated to DRAM refresh because the programmer shouldn't have to manage it. The 68000's 4–14 cycle effective address decode — each extension word a separate DRAM bus cycle.
With two independent 200 MHz SRAM buses at 5 ns access time, none of these constraints exist. The wait states are gone. The refresh cycles are gone. The extension words arrive in a single 36-bit fetch. Instruction fetch and data access run simultaneously on separate buses — something the original hardware could never achieve regardless of how fast its individual chips were.
Shared Bus Contention
Many popular machines shared a single memory bus between the CPU and the display hardware. The display always won — it had hard real-time deadlines to feed the video signal. The CPU was left waiting.
Commodore 64 — the VIC-II steals the 6510's bus for sprite and character data fetching. On "bad lines" (one per character row, 8 per frame) the VIC-II takes the bus for 40 cycles. The 6510 runs at approximately 90% speed on most lines, dropping to around 50% on bad lines. Programmers timed their code around bad lines for raster effects.
ZX Spectrum — the ULA steals the Z80's bus during display fetch for the active display area. Approximately 29% of Z80 cycles per frame are stolen. The Z80 is held in wait states while the ULA reads video RAM. The entire Spectrum demoscene is built around working with and around this constraint.
Acorn Electron — the ULA's contention was so severe that the 6502 at 2 MHz ran slower in practice than the BBC Micro's 6502 at 2 MHz with its separate CRTC bus. Programmers moved everything performance-critical to page zero and the small regions of RAM that escaped contention.
BBC Micro — the 6845 CRTC and 6502 share video RAM. The CRTC takes priority during display fetch. Programmers used shadow RAM and MODE 7 (which used a teletext chip with its own ROM, not contended RAM) for performance-critical work.
Amiga 500 — Agnus owns the chip RAM bus and grants the 68000 access in alternating cycles. The 68000 gets at best half the bus bandwidth, and less when DMA channels are active — bitplane fetch, sprite fetch, blitter, audio, disk. In a busy display frame the 68000 could be bus-starved to 30–40% of theoretical throughput. The distinction between "chip RAM" and "fast RAM" was fundamental to Amiga programming — code running in fast RAM (on the CPU's private bus) was dramatically faster.
Atari 2600 — the TIA chip drives the CPU's RDY line directly. The CPU is halted during horizontal blank and active display to synchronise with the video beam. Writing a 2600 game required cycle-counting every instruction against the TV beam position. There was no display list, no interrupt — the programmer was the display controller.
Atari 8-bit (400/800/XL/XE) — the ANTIC chip is a DMA processor that steals cycles from the 6502 for display list fetch and character/bitmap data. The number of cycles stolen per scanline depends on the display mode. In graphics-heavy modes the 6502 can lose 50% or more of its cycles.
Atari Falcon030 — the Motorola DSP 56001 shares the same memory bus as the 68030. DSP DMA transfers compete directly with CPU memory access. The Falcon's 16 MHz bus clock was also mismatched to the 68030's 32-bit internal architecture via only a 16-bit external data path, compounding the contention with a bus-width penalty on every 32-bit operation. Audio and video work that the original Falcon could do simultaneously with CPU tasks required careful partitioning between CPU and DSP time on a shared bus. On the Ant64, the DSP runs as a parallel hardware block accessing BSRAM on its own bus — no contention with the CPU at all.
In the Ant64 personality architecture, the CPU accesses Graphics SRAM on its own bus and the video pipeline accesses BSRAM on its own bus. There is no shared bus. There is no arbiter. There are no DMA steals. The contention that consumed 10–50% of original CPU throughput does not exist.
In cycle-accurate mode this contention can be reintroduced artificially for software that depends on it. In turbo mode it is gone entirely — and for the overwhelming majority of software that tried to work around contention rather than exploit it, this is a pure and unconditional gain.
Optimised Softcore Microarchitecture
The softcore implements the original instruction set exactly — same opcodes, same architectural registers, same observable behaviour — using a modern FPGA microarchitecture:
Barrel shifter — the FPGA's LUT-based barrel shifter shifts by any amount in one clock cycle. The 68000's LSL Dn, Dm took 2 + 2n cycles on original hardware — shifting by 8 cost 18 cycles. On the softcore it costs one pipeline stage unconditionally.
BSRAM register file — the softcore's register file lives in on-chip BSRAM at 380 MHz. Register read and write complete in one clock cycle with no external bus traffic.
Pipelined execution — fetch, decode, execute, and writeback overlap. Many instructions that took 6–12 original cycles execute in 2–4 pipeline stages. The 36-bit SRAM fetch delivers opcode and extension words together, collapsing what were multiple sequential bus cycles into a single fetch.
Fast carry chain — N-bit addition using FPGA carry chain primitives completes in one clock cycle regardless of operand width.
Parallel custom chips — all custom chips (PPU, VDP, Agnus/Denise/Paula, sound chips) are synthesised as separate hardware blocks in the same fabric, running simultaneously with the CPU. No software cycles are spent on them. No time-slicing between CPU emulation and chip emulation.
Prefetch queue and SRAM burst cache — the IS61LPS51236B supports burst mode, delivering consecutive addresses at one cycle per word after the initial access latency. A burst of 8 × 36-bit words (32 bytes) from the EE Code SRAM costs 1 + 8 cycles rather than 8 × 1 cycles. The personality developer can land this burst into a dedicated BSRAM prefetch buffer — a byte FIFO between the SRAM fetch unit and the decode stage. The decoder then consumes from BSRAM at 380 MHz with zero SRAM traffic until the buffer is exhausted, at which point the next burst is already in flight. Since BSRAM is dual-port, the burst controller writes the next block into one port while the decoder reads from the other — the refill is invisible to the pipeline.
For a 6502 tight inner loop, 32 bytes holds up to 32 one-byte instructions or a typical mix of 1–3 byte instructions. The entire loop body likely fits in one burst fetch and executes entirely from BSRAM — this is effectively a small instruction cache with fully predictable behaviour. For the 68000, 32 bytes covers several instructions including extension words, feeding the decoder without stalls even for variable-length instruction streams.
Branches and jumps invalidate the prefetch buffer, paying the initial SRAM latency plus a new burst on the taken path. For the tight inner loops that dominate retro CPU workloads, branches are infrequent and the sequential fetch win is substantial.
The optimal burst size and buffer depth vary by architecture — fixed-width ISAs (MIPS, ARM, 65816 native mode) benefit most predictably; variable-length ISAs (6502, Z80, 68000) still benefit significantly for sequential code. This is a recommended implementation pattern for personality developers rather than a base interface block feature. The original machines had no equivalent — any prefetch queue they had was limited to a few bytes and constrained by the same slow DRAM bus that limited everything else.
Optional CPU Enhancements and Dialects
Because the CPU is synthesised logic rather than a fixed chip, its instruction set and behaviour can be extended or upgraded at the HDL level. The spare LUT budget — substantial for all but the most complex personalities — accommodates these enhancements alongside the faithful core. Standard software always runs in the faithful mode; enhanced software uses extended registers or instructions that the original hardware never had.
Z80 — Three Mutually Exclusive Dialects
The Z80 softcore supports three CPU modes, selectable via a FRAM register and taking effect on the next CPU reset. All three share the same base Z80 HDL module; only the decode path changes:
Mode 0: Standard Z80 — authentic instruction set and timing
Mode 1: Z80N — Spectrum Next extended opcodes ($ED prefix additions)
Mode 2: eZ80 — Zilog eZ80 extended opcodes ($ED prefix, 24-bit ADL mode)
Modes 1 and 2 are not supersets of each other — they are alternative dialects using the same encoding space differently. The LUT overhead of carrying both decode paths in fabric and switching between them is small. The bitstream does not need to be rebuilt to change CPU dialect.
Z80N (used by the ZX Spectrum Next) adds instructions for hardware multiply, pixel manipulation, and memory paging to support the Next's extended hardware. Z80N software runs correctly on any Spectrum or Z80-based personality with the mode register set accordingly.
eZ80 (Zilog's own extended Z80, used in TI graphing calculators and embedded systems) adds a 24-bit addressing mode (ADL — Address/Data Long) that expands the addressable memory space from 64KB to 16MB, plus additional instructions operating on 24-bit values. eZ80 mode is relevant for personalities targeting TI-83/84 series calculators and embedded eZ80 systems.
68000 — Optional ISA Extensions
The 68000 softcore's decode table is writable via the FRAM window. Undefined opcode slots can be mapped to extended instruction implementations in fabric:
- 68010 extensions — loop mode (
DBccoptimisation) andMOVEC/MOVESinstructions. The 68010 was the first minor revision; loop mode measurably speeds tight iteration. - 68020 instruction set — full 32-bit multiply/divide (
MULS.L,DIVS.L), 32-bit PC-relative addressing,BFINS/BFEXTSbit field instructions,PACK/UNPK. Software compiled for the 68020 runs on the enhanced 68000 softcore without a different CPU core being present.
The decode table switch is a FRAM register write — no bitstream reload. The extension level is a personality component selectable from the OSD. Standard 68000 software is unaffected when extensions are off.
65816 — Optional Upgrade for All 6502-Based Machines
The WDC 65816 is a 16-bit extension of the 6502 architecture, fully backward-compatible with 6502 code in emulation mode. It adds a 24-bit address space (16MB), 16-bit accumulator and index registers in native mode, a hardware stack of arbitrary depth, and additional addressing modes. It was designed explicitly as an upgrade path from the 6502 — any 6502 machine can optionally run a 65816 softcore instead, with existing software running without modification in the 65816's emulation mode.
Since the CPU is synthesised logic, swapping the 6502 for a 65816 is a FRAM register write selecting a different decode path in the same softcore module. The 65816 is available as an optional enhancement on every 6502-based personality:
| Machine | Original CPU | 65816 upgrade gain |
|---|---|---|
| SNES | WDC 65C816 | Native — the 65816 is the SNES's actual CPU |
| Apple IIgs | WDC 65C816 | Native — IIgs runs natively at 2.8 MHz (fast) or 1 MHz (compat) |
| Apple II / IIe / IIc | MOS 6502 / 65C02 | 65816 gives full IIgs-compatible mode with 24-bit addressing |
| C64 SuperCPU mode | MOS 6510 | 65816 at effective 20 MHz equivalent — SuperCPU-compatible register map |
| BBC Micro | MOS 6502 | 24-bit address space, 16-bit registers, proper hardware stack — replaces Tube co-processor need for most use cases |
| Atari 8-bit | MOS 6502C | 24-bit addressing expands beyond the ANTIC/GTIA 64KB limit for enhanced software |
| NES / Famicom | MOS RP2A03 | 65816 native mode available; expanded addressing useful for homebrew beyond the 64KB map |
| Atari 2600 | MOS 6507 | 65816 emulation mode available; limited practical gain given TIA-constrained architecture |
In every case the 65816 runs in the 65816's emulation mode by default — behaviour identical to the original 6502, with the exception that the stack is fixed to page 1 and the D register is zero. Switching to native mode enables the full 16-bit registers and 24-bit addressing. The mode switch is transparent to software that never sets the native mode flag; standard 6502 code runs unchanged.
The 65816 softcore on the Ant64 runs at full FPGA speed in turbo mode — substantially faster than any 65816 hardware ever shipped. The Apple IIgs ran its 65816 at 2.8 MHz fast mode; the Ant64 equivalent runs at hundreds of MHz effective throughput.
SNES Cartridge Co-Processors
The SNES was designed from the outset with a co-processor strategy — 16 additional pins on the cartridge edge allowed game cartridges to include dedicated chips for capabilities the base hardware lacked. These chips are synthesised in spare GoWin LUTs and enabled automatically when a ROM is identified as requiring one (via the ROM header's co-processor type byte):
| Co-processor | Description | Notable games |
|---|---|---|
| Super FX GSU-1 | Argonaut RISC CPU @ 10.5 MHz — polygon renderer | Star Fox, Stunt Race FX |
| Super FX 2 GSU-2 | Super FX at up to 21 MHz, more ROM support | Yoshi's Island, Doom |
| SA-1 | Full 65C816 @ 10.7 MHz + DMA + decompression — effectively a second SNES CPU | Super Mario RPG, Kirby Super Star |
| DSP-1 / 1A / 1B | NEC µPD77C25 — 16-bit multiply, sin/cos, vector/rotation | Super Mario Kart, Pilotwings |
| DSP-2 | Converts Atari ST bitmap format to SNES bitplane format | Dungeon Master |
| DSP-3 / DSP-4 | Single-game chips — AI and procedural track rendering | SD Gundam GX, Top Gear 3000 |
| CX4 | Hitachi HG51B169 @ 20 MHz — trig, wireframe, rotation | Mega Man X2, X3 |
| ST-010 / ST-011 | NEC µPD96050 — AI for Shogi games | Various Shogi titles |
| ST-018 | 21.44 MHz ARM60 — a 32-bit ARMv3 processor inside a SNES cartridge | Hayazashi Nidan Morita Shogi 2 |
| Super Game Boy | Sharp SM83 core — the complete Game Boy CPU in a SNES cartridge | Super Game Boy |
All fit comfortably in the 138K fabric alongside the main SNES core. The SA-1 and Super FX are the largest; both fit with room to spare. The ST-018 is the most exotic — a genuine ARMv3 softcore running inside a personality that is itself inside an FPGA personality, which is a satisfying level of nesting. In the GoWin fabric the Super FX runs at full FPGA speed, removing the original clock throttling.
BBC Micro — Tube Co-Processor
The BBC Micro's Tube interface connected a second processor via four bidirectional FIFOs. The co-processor ran user programs; the host 6502 handled all I/O. On original hardware the co-processor's speed was unconstrained — the 2 MHz Tube bus was the only bottleneck, and it only activated during OS calls. Software that used OS calls for all I/O (BBC BASIC does; properly-written machine code does) ran at whatever rate the co-processor ran.
On the Ant64, the Tube co-processor is synthesised in spare LUTs of the BBC Micro bitstream — not a physical second board. The host 6502 communicates via the standard Tube register addresses; the co-processor runs at full FPGA clock rate. The effective speed uplift for BASIC and OS-calling code is substantial.
C64 — MEGA65 / GS4510 Mode
The MEGA65 is an open-source implementation of the never-released Commodore 65, developed by the Museum of Electronic Games and Art. Its GS4510 CPU is an enhanced 65CE02 derivative with a 28-bit address space, 32-bit far-JSR/JMP/RTS, and software-selectable speeds from 1 MHz (authentic C64) to 40.5 MHz. The Ant64 C64 personality's enhanced mode adopts the MEGA65's VIC-IV and GS4510 register interface — MEGA65-aware software runs without modification, and full C64 software is unaffected in the standard mode.
MSX — OCM-PLD / MSX++
The OCM-PLD project (originally the One Chip MSX, or 1chipMSX) is a mature FPGA reimplementation of the MSX2+ platform with an active community that has been enhancing it for well over a decade. The current firmware is branded MSX++ and runs on multiple FPGA boards (SX-2, SM-X, Zemmix Neo, and others). It is the reference enhancement target for the Ant64 MSX personality.
Key OCM-PLD enhancements the Ant64 MSX personality adopts:
- Turbo CPU speeds — Z80 core switchable between 3.58 MHz (authentic), 5.37 MHz, 8 MHz, and faster, software-controlled via the switched I/O port scheme
- PSG2 — a second AY-3-8910 PSG synthesised alongside the original, providing stereo audio for software that addresses the second PSG at its standard address
- OPL3 — FM synthesis beyond the original MSX-Music (OPL1) — richer FM audio for software that detects and uses it
- V9990 GPU — the Yamaha V9990 was designed as an MSX VDP successor but was too expensive to include in standard hardware; the OCM synthesises it in spare LUTs, adding 256-colour bitmap modes, hardware sprites, and a pattern generator. V9990-aware software runs without modification against the standard register map
- MegaRAM expansion — up to 4096 KB ASCII-mapped MegaRAM in spare LUTs, accessible to MSX-DOS and OS-9
The OCM-PLD's switched I/O port scheme ($40–$4F) is the compatibility target — software written for any OCM-compatible MSX++ machine runs on the Ant64 MSX personality without modification.
Dragon 32/64 and CoCo — CoCo3 / GIME Enhancements
The Dragon 32/64 and TRS-80 Color Computer share the same underlying architecture — both derived from a Motorola 6809 reference design pairing the MC6809E CPU with the MC6847 VDG and MC6883 SAM. The CoCo3 introduced Motorola's GIME chip (Graphics Interrupt Memory Enhancement), which added paged MMU, extended video modes, and a software-selectable 1.79 MHz / 0.895 MHz CPU speed switch.
The CoCo3FPGA project by Gary Becker implemented the CoCo3 and its GIME chip in FPGA fabric, running the 6809 core at 25 MHz — over 13× the original speed — and adding 256-colour graphics modes including a 640×450 mode the original GIME never offered. Roger Taylor's RealCoCo (later ported to MiSTer as a combined Dragon32/64 + CoCo2/3 core) extends this work with further accuracy improvements.
The Ant64 Dragon/CoCo personality uses the GIME register interface as the compatibility target. Enhanced modes expose:
- 6809 at full FPGA turbo speed — the 6809's clean orthogonal architecture and 16-bit register operations benefit substantially from zero-wait-state SRAM and pipelined execution
- GIME extended video — 256-colour modes, hardware text with true lowercase at 32/40/64/80 columns
- Paged MMU — the GIME's 8 × 8KB page scheme gives the Dragon/CoCo a 512KB address space, extended to DDR3-backed memory for larger configurations
- 6809 → 6309 upgrade — Hitachi's HD6309 was a licensed 6809 clone with additional undocumented instructions (TFM block transfer, additional registers, hardware divide) that the Dragon/CoCo community has documented thoroughly. The 6309 mode is a FRAM register switch, and 6309-aware software runs at full speed
The Dragon 32/64 and CoCo share enough hardware that they coexist in a single bitstream, switchable from the OSD — the same combined-machine approach as the Atari 16/32-bit personality.
Throttling Modes
The softcore's execution speed is controlled by a fixed-point cycle budget accumulator in the interface block. Each host clock cycle the accumulator advances by a programmable speed_ratio value. When it reaches 1.0 the pipeline is allowed one guest CPU cycle; otherwise it stalls. Setting speed_ratio to ∞ (or simply disabling the gate) gives full turbo speed; setting it to 1.0 gives exact original timing.
// Each host clock cycle:
cycle_budget += speed_ratio; // fixed-point register, written via FRAM
if (cycle_budget >= 1.0) {
cycle_budget -= 1.0;
allow_one_guest_cycle(); // pipeline gate opens
}
// else: pipeline stalls for one host cycle
speed_ratio is a single FRAM register write. Changing speed takes effect on the next host cycle — no reset, no reconfiguration, no glitch. The custom chips always run at their correct video-synchronised rate regardless of the CPU throttle setting.
Decoupled and Coupled Speed
Two modes control the relationship between CPU speed and custom chip speed:
Decoupled — the CPU softcore runs at maximum FPGA speed; the custom chips run at their original pixel/bus clock. The custom chips see correctly-timed bus cycles for any shared memory access via wait state insertion at the bus interface. Private fast RAM accesses are instant. This is equivalent to adding fast RAM to the original machine — software runs faster wherever it uses private memory, while timing-sensitive display hardware is unaffected.
Coupled — the custom chips are also clocked faster via their FPGA clock divider. The whole machine accelerates uniformly. The display output is re-timed by the overlay block so HDMI output stays at standard frame rates while the hardware runs multiple frames per display frame internally.
| Preset | CPU | Custom chips | Notes |
|---|---|---|---|
| Authentic | Throttled to original | 1× | Maximum compatibility — exact original timing |
| Turbo | Uncapped | 1× | Decoupled — fastest compatible mode |
| 2× | 2× | 2× | Coupled uniform speedup |
| 4× | 4× | 4× | Coupled maximum speedup |
| Maximum | Full FPGA rate | Decoupled | Timing-sensitive software may break |
| Mode | Speed ratio | Effective guest speed | Primary use |
|---|---|---|---|
| Turbo | Gate disabled | Maximum (~200–400× original) | Normal production use |
| ×10 | 10.0 | 10× original | Fast-forward through slow sections |
| ×4 | 4.0 | 4× original | Rapid test of timing-sensitive code |
| ×2 | 2.0 | 2× original | Slow-motion debugging — timing issues easier to spot |
| ×1 (original) | 1.0 | Exact original speed | Full cycle accuracy — CPU timing mode |
| ×0.5 | 0.5 | Half original speed | Watch raster effects build line by line |
| ×0.25 | 0.25 | Quarter original speed | Single-scanline debugging |
| ×0.1 | 0.1 | One tenth speed | Instruction-by-instruction visual tracing |
| Step | 0 (halted) | One instruction on demand | Deep instruction-level debugging |
Any fractional value is valid — speed_ratio is a full fixed-point register, not a discrete selector. A developer can dial in 0.03× to step through code almost frame by frame.
An important property of fractional modes: the display output continues running at full frame rate even while the simulated CPU runs at a fraction of original speed. The video chips are locked to the pixel clock, not the CPU throttle. At ×0.25, the display updates 60 times per second but the CPU completes only a quarter of a scanline's worth of instructions per frame — you can literally watch a raster effect build up scanline by scanline on the live display while the CPU churns forward at a controlled pace.
Typical Debug Workflow
1. Start at Turbo — run at full speed, gameplay is normal
2. Approach region of interest
3. FRAM write: set speed to ×0.5 — everything slows, timing issues become visible
4. Tagged memory address fires watchpoint — pipeline halts automatically
5. Inspect registers and memory state via QSPI
6. FRAM write: set speed to Step
7. Single-step through instructions, watching registers and display update
8. FRAM write: set speed to ×1 — resume at original speed for cycle-accurate validation
9. FRAM write: Turbo — return to full speed
This transition from turbo to slow-motion to halted to stepping and back is entirely register writes — no bitstream reload, no reset, no loss of machine state at any point.
Interaction with SRAM Debug Tags
Speed throttling and SRAM debug tags compose naturally. A common pattern:
- Run at Turbo with a trace tag on a RAM region — collect an access log without slowing down
- Switch to ×0.25 when approaching a known-problematic area — slow enough to observe
- Execute breakpoint on ROM fires — pipeline halts mid-instruction
- Inspect state, modify a RAM value via QSPI, resume at ×0.5 to watch the corrected behaviour
Rewind — Running Backwards
The rewind capture block records CPU and RAM state changes into a ring buffer in FireStorm DDR3. Each entry stores an XOR delta — the old value XOR'd with the new value.
The reason XOR works symmetrically in both directions:
delta = old XOR new
Undo: current_value XOR delta = new XOR (old XOR new) = old ✓
Redo: current_value XOR delta = old XOR (old XOR new) = new ✓
The same stored entry applies equally for undo and redo. Directionality comes entirely from which way the ring buffer pointer moves — backwards undoes, forwards redoes. The ring buffer pointer position is the current point in time.
Entry Format — Bitmask Frames
Rather than emitting a separate entry per register or per memory write, the capture block packages each instruction's changes into a single delta frame:
Delta frame:
[entry type: 2 bits]
[changed_mask: N bits] ← one bit per CPU register/flag, CPU-specific
[PC delta: 16 bits] ← signed; absolute entry emitted for jumps > ±32KB
[XOR delta per set bit] ← only CPU registers that actually changed
[RAM writes: address + XOR delta, one per write in this instruction]
[chipset reg writes: reg ID + XOR delta, one per chipset write]
Checkpoint frame (emitted periodically):
[entry type: 2 bits]
[full CPU register snapshot]
[full chipset register snapshot]
PC absolute frame (for long jumps):
[entry type: 2 bits]
[full PC value]
The changed_mask means zero-delta registers are never stored — if X, SP and most flags don't change in a tight loop, they contribute nothing to the ring buffer. The capture block maintains a small register mirror in BSRAM (6 bytes for a 6502, 72 bytes for a 68000) and compares on each instruction retirement to build the mask.
The capture block computes RAM write deltas by reading the old value from SRAM before the write completes — this read-before-write happens in the capture block's own independent pipeline and adds no stall to the CPU.
Chipset Registers Are Fully Reversible
On original hardware, many chipset registers were write-only — the VIC-II's sprite coordinates, the SID's envelope parameters, Agnus DMA pointers, the SNES PPU scroll registers. The CPU could write them but never read them back, so their internal state was inaccessible.
Since the chipset is synthesised logic running in the FPGA, every register has a readable internal state regardless of what the original chip exposed. The personality's address decoder routes reads on write-only addresses to the actual internal flip-flops rather than the original chip's external read path — transparently, with no special addressing required. This is described in detail in Chipset Register Read-Back.
The capture block uses this read-back path internally to fetch the old value before each chipset write, enabling correct XOR delta computation even for registers that were write-only on the original hardware. Chipset register writes are captured in the same ring buffer as CPU registers and RAM writes. They are far less frequent than RAM writes — a busy C64 frame might involve a few hundred VIC-II register writes compared to millions of RAM accesses — so the additional ring buffer bandwidth is modest.
Bytes per instruction after compression:
| CPU | Registers tracked | Typical bytes/instruction | Example instruction |
|---|---|---|---|
| 6502 | A, X, Y, SP, PC, 5 flags | ~4–6 bytes | LDA #$42 = mask + PC delta + A delta + NZ delta |
| Z80 | 14 main + 4 alt + 2 index + flags | ~5–8 bytes | LD A,n = mask + PC delta + A delta + flags delta |
| 68000 | 8 D + 7 A + PC + SR | ~8–14 bytes | MOVE.L D0,D1 = mask + PC delta + D1 delta + SR delta |
| MIPS R3000 | 32 GPR + PC + HI/LO | ~10–16 bytes | ADDU D,S,T = mask + PC delta + dest delta |
Chipset register writes add a few bytes per write event, but as these are infrequent compared to CPU instructions the per-instruction average impact is small.
Checkpoint Entries for Fast Seeking
A full snapshot — CPU registers and full chipset register state — is emitted every N instructions (configurable). DeMon's jog dial fast-rotation mode jumps checkpoint-to-checkpoint rather than entry-by-entry, enabling coarse seeking across long timelines. The debug application displays a timeline bar with checkpoint markers as visible anchors.
Ring Buffer Depth
The ring buffer is a power-of-2 block of FireStorm DDR3. On the Ant64C, FireStorm has access to ~2 GB DDR3 — the ring buffer can occupy all of it if the personality does not need DDR3 for other purposes. On the Ant64 it is 1 GB. The Ant64S has 8 MB PSRAM, giving much shallower depth.
| CPU | Compressed rate | Ant64S (8 MB) | Ant64 (1 GB) | Ant64C (~2 GB) |
|---|---|---|---|---|
| 6502 @ 1 MHz | ~1.1 MB/s | ~7 seconds | ~15 minutes | ~29 minutes |
| Z80 @ 3.5 MHz | ~3.5 MB/s | ~2 seconds | ~5 minutes | ~9.5 minutes |
| 68000 @ 7.16 MHz | ~28 MB/s | <1 second | ~36 seconds | ~72 seconds |
| MIPS R3000 @ 33.8 MHz | ~120 MB/s | <1 second | ~8 seconds | ~17 seconds |
These figures assume the full DDR3 pool is allocated to the rewind buffer. Users can trade rewind depth for expanded system RAM — giving the emulated machine more memory than the original hardware ever had. See FireStorm DDR3 for the full tradeoff options.
What rewind can and cannot do. The capture block records CPU registers, RAM writes, and chipset register writes — which together constitute the complete internal state of the emulated machine. Rewinding fully restores the CPU, all RAM, and all chipset state including display registers, audio parameters, DMA pointers, and sprite tables. The display and audio output resettles to match the restored state within one frame.
Interrupts are handled correctly. An interrupt firing between two instructions produces a sequence of RAM writes (return address and pushed registers onto the stack), a PC delta (jump to the interrupt vector), and register deltas (status flags changed). From the capture block's perspective this is indistinguishable from any other sequence of writes — it records what the hardware actually did, regardless of why. Rewinding through an interrupt restores the stack to its pre-interrupt state, restores the PC to the instruction that was about to execute, and restores the flags. The interrupt appears to un-fire cleanly.
The interrupt pending state in the chipset is also restored — the VIC-II raster interrupt flag, the Z80 interrupt acknowledge, the timer register that caused the interrupt — because chipset register writes are captured alongside everything else. When the machine runs forward again from the rewound point, the same interrupt will re-fire at the same instruction boundary, which is the correct and expected behaviour.
Multi-CPU systems. Machines with more than one CPU — the Mega Drive (68000 + Z80), SNES (65816 + SPC700), Saturn (two SH-2s) — need all CPUs captured in the same ring buffer. Each delta frame includes a cpu_id field identifying which CPU produced the entry:
Delta frame:
[entry type: 2 bits]
[cpu_id: 2 bits] ← 0 = main CPU, 1 = sub CPU, 2+ = additional
[changed_mask: N bits]
[PC delta + register deltas + RAM/chipset writes]
A single ordered buffer is important for machines where the two CPUs share memory or interact via bus arbitration. On the Mega Drive, the 68000 and Z80 share the Z80 RAM region and communicate via the bus request/grant mechanism — their writes to shared memory are interleaved, and the relative ordering is semantically meaningful. A single buffer preserves that ordering exactly; two separate buffers cannot. Rewinding replays the interleaved stream in reverse, restoring both CPUs together.
For loosely coupled systems like the SNES — where the 65816 and SPC700 interact only through four dedicated communication ports — two separate buffers are optionally supported, allowing each CPU's timeline to be scrubbed independently. Single-buffer mode remains the default and is always correct.
The only state that cannot be reversed is external I/O — signals that left the FPGA into the real world: MIDI output to external hardware, serial data, disk writes to physical media. These happened and cannot be un-happened. For the overwhelming majority of debug and game use cases this is irrelevant — the interesting state is internal.
Jog Dial Time Scrubbing
DeMon has a dedicated jog dial used for system supervision and debug tasks — separate from the 8 jog dials on Pulse that control sequencer and MIDI parameters. In personality debug mode this dial becomes a physical time scrub control — no PC, no debug application window required, hands directly on the timeline:
- Rotate clockwise — advance time (positive
speed_ratio); faster rotation = higher speed - Rotate anticlockwise — rewind (negative
speed_ratio); faster rotation = faster rewind - Push — pause / resume (toggle between halted and last active speed)
- Push and hold + rotate — fine single-step in either direction; each detent = one guest instruction
- Fast spin — jumps checkpoint to checkpoint for coarse timeline navigation
DeMon translates jog dial events into FRAM register writes to the speed_ratio register via QSPI — the same mechanism the debug application uses. The transition from running forward to halted to rewinding to stepping is entirely physical, with the display updating in real time as the ring buffer is traversed in either direction.
This makes the debug workflow tactile: overshoot a breakpoint, spin the dial back, find the exact instruction, push to pause, inspect via the debug application. The combination of hardware-speed execution, XOR delta rewind, bitmask-compressed frames, and physical scrub control gives a debugging experience that no original hardware developer of any of these machines could have imagined.
Debugging and Development via the FRAM Interface
The SG2000 small core runs AntOS and has access to the personality via the QSPI FRAM interface for system-level tasks — save states, rewind management, scripted automation, system switching. The SG2000 big core, when a debug session is active, runs a dedicated debug application with the full 1 GHz C906 to itself and no OS overhead. This application has read/write access to the entire personality via the same FRAM interface:
Softcore CPU registers — read or write any register at any time. When the softcore is halted, register values are stable and coherent.
Guest system memory — read or write any address in the emulated machine's RAM or ROM shadow. Inspect the stack, patch variables, inject test data, verify game state.
Custom chipset registers — read or write any register in the emulated custom chips via the FRAM interface. This includes registers that were write-only on the original hardware — the VIC-II's sprite coordinates, the SID's envelope state, Agnus DMA pointers, SNES PPU scroll registers. See Chipset Register Read-Back below.
Debug control — halt, run, single-step, reset. Read the cycle counter.
Speed throttle and time scrubbing — write the speed_ratio FRAM register to switch between Turbo, any fractional speed, Step, or negative (rewind) values instantly without losing machine state. DeMon's dedicated jog dial maps directly to this register for hands-on time scrubbing without a PC. See Throttling Modes.
SRAM debug tags — set execute breakpoints, read/write watchpoints, and trace flags on any memory address by writing the 4 extra bits of the relevant 36-bit SRAM word. Tags take effect on the very next access. No LUT cost; no cycle penalty until fired. See The Extra 4 Bits — Hardware Debug Tags.
Trace buffer — read the log of all accesses to trace-tagged addresses, stored in FireStorm DDR3.
The big core debug application exposes a GDB remote stub over TCP/WiFi via the AntOS network stack, so a developer connects from a standard IDE or debugger on a PC with no special hardware. The personality runs at full hardware speed and halts on command — the entire machine state readable and writable in microseconds over QSPI.
Chipset Register Read-Back
On original hardware, chipset register access was highly asymmetric. Many registers were write-only — reading the same address returned open bus noise, the last data bus value, or nothing meaningful. In some cases read and write were mapped to entirely different addresses. Some familiar examples:
| Machine | Register | Original write | Original read |
|---|---|---|---|
| C64 | SID frequency | $D400 — sets voice 1 freq low |
$D400 — open bus or last byte |
| Amiga | Colour register | $DFF180 (COLOR00) — sets colour |
$DFF180 — no-op on original hardware |
| Amiga | Blitter status | write registers only | $DFF002 (DMACONR) — separate read address |
| SNES | BGMode | $2105 — write-only |
$213C–$213F — completely separate status ports |
| NES | PPUCTRL | $2000 — write-only |
$2002 (PPUSTATUS) — different address, different data |
Since the entire chipset address space is defined by the personality developer at implementation time, the interface block's address decoder already knows which addresses correspond to write-only registers. There is no need for any special convention — reading $D400 simply returns the actual SID flip-flop value, because the decoder routes reads on that address to the internal flip-flop rather than the original chip's external read path. This is a detail of the personality's address decoder wiring, invisible to everything above it.
From the debug application's perspective, every chipset register is readable at its normal address. From the capture block's perspective, it always reads internal flip-flops — it has no concept of the original chip's external interface at all. The "hidden read" is just how the address decoder is built; no extra bits, no address space doubling, no user-visible change to the memory map.
This serves two distinct purposes:
For the rewind capture block — to compute delta = old XOR new on a write to a write-only register, the capture block reads the current flip-flop state through the internal path before each write. This is what makes XOR delta rewind correct for the full chipset, including registers the original programmer could never read.
For the debug application — the SG2000 big core can inspect the complete internal state of the emulated chipset at any normal address. The SID's internal envelope phase, the Copper's current instruction pointer, the blitter's internal accumulator, the PPU's internal scroll latch — anything that exists as a flip-flop in the design is readable at its normal address.
The personality developer wires the internal flip-flop outputs to both the capture block and the read path during chipset implementation. Since the full address map is known at synthesis time, the routing is determined once and baked into the bitstream.
All figures represent the estimated maximum speed multiplier of the Ant64 FPGA personality relative to the original hardware running at its original specification. The multiplier accounts for three compounding factors: clock speed ratio, CPI improvement from pipelined zero-wait-state execution, and removal of bus contention where it applied on the original hardware.
Turbo mode only — the multiplier shown is the maximum speed the softcore achieves with throttling disabled. Cycle-accurate mode always runs at 1× original speed by definition and is not listed.
A note on fabric speed. The GoWin GW5AT-138K uses a 22nm process — the same generation as Xilinx Kintex-7 and Intel Arria 10. The hard RISC-V A25 core embedded in the related GW5AST variant runs at 400 MHz in silicon; BSRAM is rated at 380 MHz. For pipelined CPU softcores on this fabric, achievable synthesis frequencies are substantially higher than on the older GoWin GW1N (~55nm) devices where the open-source A500 and SNES cores were originally demonstrated. The multipliers below reflect the 22nm capability.
Original RAM speed, bus contention severity, and clock speed all feed into the figure — which is why some slower machines have higher multipliers than faster ones. A C64 at 1 MHz with severe bus contention gains more from the architecture than a machine that was already running from fast RAM.
A note on bus width and prefetch queues. Several CPUs had internal registers wider than their external data bus — the 68008's 32-bit registers on an 8-bit bus, the 65816's 16-bit registers on an 8-bit bus, the 8088's 16-bit registers on an 8-bit bus. On the Ant64, the 36-bit SRAM delivers 4 bytes per cycle, collapsing multiple original bus cycles into one. Where multipliers reflect this gain, it is noted per machine.
Some of these CPUs had instruction prefetch queues (68000: 4 bytes, 8088: 4 bytes, 8086/286: 6 bytes, 386+: 16 bytes) which partially hid the instruction fetch bus width penalty on original hardware by fetching ahead during execution. This means the bus-width gain on instruction fetch is slightly smaller than it would appear — the prefetch queue was already doing useful work. However prefetch queues only buffer instruction fetches, not data accesses. A MOVE.L reading or writing a data address still required multiple bus cycles on original hardware regardless of the prefetch queue. The multiplier adjustments for bus width are therefore most accurate for data-heavy code and modestly conservative for pure instruction throughput.
1970s
| Machine | CPU | Original clock | Key bottlenecks | Est. speed multiplier | Est. effective speed |
|---|---|---|---|---|---|
| Altair 8800 | Intel 8080 | 2 MHz | DRAM wait states, no cache | ~280× | ~560 MHz |
| Apple I | MOS 6502 | 1 MHz | DRAM, 2-cycle minimum enforced by bus | ~480× | ~480 MHz |
| Commodore PET 2001 | MOS 6502 | 1 MHz | DRAM wait states | ~480× | ~480 MHz |
| TRS-80 Model I | Zilog Z80 | 1.77 MHz | DRAM wait states, refresh cycles | ~330× | ~584 MHz |
| Apple II | MOS 6502 | 1.023 MHz | DRAM, soft switch contention | ~460× | ~471 MHz |
| Atari 2600 | MOS 6507 | 1.19 MHz | TIA halts CPU every scanline — severe | ~570× | ~678 MHz |
1980s — 8-bit home computers
| Machine | CPU | Original clock | Key bottlenecks | Est. speed multiplier | Est. effective speed |
|---|---|---|---|---|---|
| ZX80 | Zilog Z80 | 3.25 MHz | CPU HALTed during entire display generation | ~500× | ~1.6 GHz |
| ZX81 | Zilog Z80 | 3.25 MHz | CPU HALTed during display; ~75% of cycles lost | ~540× | ~1.8 GHz |
| Sinclair ZX Spectrum 48K | Zilog Z80 | 3.5 MHz | ULA steals ~29% of cycles during display | ~310× | ~1.1 GHz |
| Sinclair ZX Spectrum 128K | Zilog Z80 | 3.5469 MHz | ULA contention; slightly less severe than 48K | ~300× | ~1.1 GHz |
| BBC Micro Model B | MOS 6502 | 2 MHz | CRTC bus steal; less severe than Spectrum | ~410× | ~820 MHz |
| Acorn Electron | MOS 6502 | 2 MHz | ULA contention — worse than BBC Micro | ~440× | ~880 MHz |
| Commodore 64 | MOS 6510 | 0.985 MHz | VIC-II bad line steals; ~15% overhead | ~660× | ~650 MHz |
| Commodore 128 | MOS 8502 | 2 MHz (fast mode) | Less contention than C64 | ~365× | ~730 MHz |
| Atari 400 / 800 | MOS 6502 | 1.79 MHz | ANTIC DMA steals up to 50% in graphics modes | ~450× | ~806 MHz |
| Atari XL / XE | MOS 6502 | 1.79 MHz | ANTIC DMA steals | ~450× | ~806 MHz |
| Dragon 32 / 64 | Motorola 6809 | 0.89 MHz | DRAM wait states, SAM chip contention; 16-bit D and index registers on 8-bit bus — 16-bit ops took 2 cycles | ~650× | ~579 MHz |
| TRS-80 Color Computer | Motorola 6809 | 0.89 MHz | DRAM wait states; same 8-bit bus improvement as Dragon | ~640× | ~570 MHz |
| TRS-80 Model III / 4 | Zilog Z80 | 2–4 MHz | DRAM wait states, refresh | ~190–280× | ~380–1,120 MHz |
| Oric-1 / Atmos | MOS 6502 | 1 MHz | DRAM wait states | ~480× | ~480 MHz |
| Amstrad CPC 464 | Zilog Z80 | 4 MHz | Gate array inserts 1 wait state every cycle | ~255× | ~1.0 GHz |
| Amstrad CPC 6128 | Zilog Z80 | 4 MHz | Same as 464 | ~255× | ~1.0 GHz |
| MSX (standard) | Zilog Z80 | 3.58 MHz | DRAM wait states; VDP contention | ~225× | ~806 MHz |
| MSX2 | Zilog Z80 | 3.58 MHz | Similar to MSX1 | ~225× | ~806 MHz |
| Thomson MO5 / TO7 | Motorola 6809 | 1 MHz | DRAM wait states | ~540× | ~540 MHz |
| Mattel Aquarius | Zilog Z80 | 3.5 MHz | DRAM wait states | ~230× | ~805 MHz |
| Sinclair QL | Motorola 68008 | 7.5 MHz | 8-bit external bus on 32-bit CPU — every MOVE.L required 4 bus cycles; FPGA collapses this to 1 | ~250× | ~1.9 GHz |
| Sam Coupé | Zilog Z80 | 6 MHz | ASIC contention during display | ~155× | ~930 MHz |
| ZX Spectrum Next (KS2) | Z80 softcore @ 28 MHz | 3.5–28 MHz | SRAM (10 ns) but wait states remain at 28 MHz turbo | ~35× at 28 MHz turbo | ~980 MHz |
1980s — 16-bit home computers and workstations
| Machine | CPU | Original clock | Key bottlenecks | Est. speed multiplier | Est. effective speed |
|---|---|---|---|---|---|
| Atari ST | Motorola 68000 | 8 MHz | DRAM wait states; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1 | ~200× | ~1.6 GHz |
| Atari STE | Motorola 68000 | 8 MHz | Same as ST; slightly improved DMA | ~200× | ~1.6 GHz |
| Amiga 500 | Motorola 68000 | 7.16 MHz | Chip RAM shared with Agnus — 50% max bandwidth; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1 | ~230× | ~1.6 GHz |
| Amiga 1000 | Motorola 68000 | 7.16 MHz | Same as A500 | ~230× | ~1.6 GHz |
| Amiga 2000 | Motorola 68000 | 7.16 MHz | Chip RAM contention; fast RAM optional; 16-bit bus — same improvement as A500 | ~230× | ~1.6 GHz |
| IBM PC XT | Intel 8088 | 4.77 MHz | 8-bit external bus on 16-bit CPU — word ops took 2 bus cycles; FPGA collapses to 1; 3–4 wait states | ~310× | ~1.5 GHz |
| IBM PC AT | Intel 80286 | 6–8 MHz | 16-bit registers and 16-bit bus — no bus-width gain; 1–2 wait states | ~185× | ~1.1–1.5 GHz |
| IBM PC AT 286 | Intel 80286 | 10–12 MHz | Same as above; no cache | ~130× | ~1.3–1.6 GHz |
| Acorn Archimedes A305/A310 | ARM2 | 8 MHz | DRAM wait states; relatively clean bus | ~155× | ~1.2 GHz |
1980s — Consoles and handhelds
| Machine | CPU | Original clock | Key bottlenecks | Est. speed multiplier | Est. effective speed |
|---|---|---|---|---|---|
| Atari 5200 | MOS 6502C | 1.79 MHz | DRAM; ANTIC DMA steals | ~440× | ~788 MHz |
| ColecoVision | Zilog Z80 | 3.58 MHz | DRAM wait states | ~225× | ~806 MHz |
| Intellivision | GI CP1610 | 0.894 MHz | 16-bit bus but very slow DRAM | ~500× | ~447 MHz |
| Vectrex | Motorola 6809 | 1.5 MHz | DRAM wait states; 16-bit D/index registers on 8-bit bus — modest additional gain | ~375× | ~563 MHz |
| NES / Famicom | MOS 6502 (RP2A03) | 1.79 MHz | PPU shares bus; DRAM | ~415× | ~743 MHz |
| Master System | Zilog Z80 | 3.58 MHz | DRAM wait states | ~225× | ~806 MHz |
| Game Boy (DMG) | Sharp SM83 | 4.19 MHz | DRAM wait states; PPU steals during fetch | ~235× | ~985 MHz |
| Atari 7800 | MOS 6502C | 1.79 MHz | MARIA DMA steals — severe in graphics-heavy games | ~430× | ~769 MHz |
| TurboGrafx-16 | Hudson HuC6280 | 7.16 MHz | Fast for era; dedicated video bus helps | ~135× | ~967 MHz |
| Mega Drive / Genesis | Motorola 68000 + Z80 | 7.67 MHz + 3.58 MHz | DRAM wait states; VDP contention; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1 | ~195× | ~1.5 GHz + ~698 MHz |
| SNES / Super Famicom | WDC 65816 | 3.58 MHz (2.68 slow) | 8-bit bus on 16-bit CPU — 16-bit ops took 2 bus cycles; PPU DMA; slow ROM | ~310× | ~1.1 GHz |
| Neo Geo AES | Motorola 68000 | 12 MHz | Fast SRAM on cartridge; 16-bit bus means MOVE.L took 2 bus cycles — FPGA collapses to 1 | ~120× | ~1.4 GHz |
1990s — Home computers and workstations
| Machine | CPU | Original clock | Key bottlenecks | Est. speed multiplier | Est. effective speed |
|---|---|---|---|---|---|
| Amiga 1200 | Motorola 68020 | 14 MHz | AGA chip RAM is 32-bit and 68020 bus is 32-bit — no bus-width gain; Agnus/Alice bus contention remains | ~105× | ~1.5 GHz |
| Amiga 4000 | Motorola 68030 | 25 MHz | AGA chip RAM is 32-bit; fast RAM 32-bit — no bus-width gain anywhere; contention remains on chip RAM | ~65× | ~1.6 GHz |
| Atari Falcon030 | Motorola 68030 | 16 MHz | 32-bit CPU on 16-bit external bus — MOVE.L took 2 bus cycles; DRAM wait states; DSP on same bus | ~110× | ~1.8 GHz |
| Acorn Archimedes A3000 | ARM2 | 8 MHz | DRAM wait states | ~155× | ~1.2 GHz |
| Acorn RiscPC 600 | ARM610 | 30 MHz | DRAM; reasonable for era | ~53× | ~1.6 GHz |
| IBM PC 386DX | Intel 386 | 16–40 MHz | DRAM 2–4 wait states | ~45–80× | ~720 MHz–3.2 GHz |
| IBM PC 486DX | Intel 486 | 25–66 MHz | On-chip cache helps; misses expensive | ~20–40× | ~500 MHz–2.6 GHz |
| IBM PC 486DX2 | Intel 486 | 66 MHz | Cache-hot: good; cold: expensive | ~20× | ~1.3 GHz |
1990s — Consoles and handhelds
| Machine | CPU | Original clock | Key bottlenecks | Est. speed multiplier | Est. effective speed |
|---|---|---|---|---|---|
| Game Boy Color | Sharp SM83 | 8 MHz (double speed) | DRAM; PPU contention | ~120× | ~960 MHz |
| Game Gear | Zilog Z80 | 3.58 MHz | DRAM wait states | ~225× | ~806 MHz |
| Lynx | MOS 65C02 | 4 MHz | DRAM; Mikey DMA steals | ~248× | ~992 MHz |
| Neo Geo Pocket | Toshiba TLCS-900H | 6.144 MHz | SRAM-based; 16/32-bit registers on 8-bit external bus — 16-bit ops 2 cycles, 32-bit ops 4 cycles | ~110× | ~676 MHz |
| PlayStation | MIPS R3000A | 33.8 MHz | DRAM; scratchpad SRAM limited | ~55× | ~1.9 GHz |
| Saturn | Hitachi SH-2 × 2 | 28.6 MHz each | DRAM; complex bus arbitration between CPUs | ~53× per CPU | ~1.5 GHz per CPU |
| Nintendo 64 | MIPS R4300i | 93.75 MHz | 64-bit registers on 32-bit bus — 64-bit ops took 2 bus cycles (though most N64 code used 32-bit ops); RDRAM high latency | ~27× | ~2.5 GHz |
| Virtual Boy | NEC V810 | 20 MHz | 32-bit registers on 16-bit external bus — 32-bit ops took 2 bus cycles; DRAM wait states | ~95× | ~1.9 GHz |
| Game Boy Advance | ARM7TDMI | 16.78 MHz | 32-bit registers on 16-bit bus to ROM and EWRAM — 32-bit ops took 2 cycles on those; IWRAM was 32-bit | ~90× | ~1.5 GHz |
| WonderSwan | NEC V30MZ | 3.072 MHz | 16-bit registers on 8-bit external bus — word ops took 2 bus cycles; DRAM | ~280× | ~860 MHz |
Multi-System Bitstreams
The GoWin GW5AT-138K's 138K LUT budget is large enough to hold multiple complete chipsets simultaneously in a single bitstream, switchable instantly via AntOS or the boot menu with no bitstream reload. The full LUT budget table, case studies for the Amiga, Atari 16/32-bit, and Sinclair/ZX lineage, and the manifest.json component switching examples are documented in personality.
As a reference: a complete late-era 16-bit chipset (full Amiga 500 ECS, full SNES, full Mega Drive) costs approximately 18–22K LUTs chipset-only. The combined Nintendo NES+SNES bitstream fits in ~28–30K LUTs and runs on both Ant64 and Ant64S. The combined Atari ST + STE + TT + Falcon030 bitstream runs to ~65–80K LUTs; the ZX lineage (ZX80 through Spectrum Next including QL and Jupiter Ace) fits in ~35–45K LUTs — comfortably within the 138K budget alongside the interface block.
Ant64S Compatibility
The Ant64S uses a GoWin 60K FPGA and a fundamentally different memory architecture from the Ant64 and Ant64C. Where the Ant64 uses twin pairs of 36-bit IS61LPS51236B SRAM giving two independent 200 MHz buses, the Ant64S uses 32-bit DDR3 and PSRAM. Initial Ant64 prototypes also use 32-bit memory with the 138K fabric, so prototype personality bitstreams share this memory architecture with the Ant64S — making the Ant64S bitstream a useful starting point for prototype development. This difference runs deeper than fabric size — the memory controller HDL, the bus interface, and the memory-mapped layout are all different. Every personality therefore requires a dedicated Ant64S bitstream written against the Ant64S memory interface, not a reduced version of the Ant64 bitstream.
The personality cartridge ships both bitstreams. DeMon detects which model is present at boot and loads the appropriate one automatically.
The Ant64S memory architecture has different characteristics from the SRAM-based Ant64:
- No 36-bit bus width — DDR3 is 32-bit, so the extra 4 bits used for SRAM hardware debug tags are not available. Execute breakpoints and read/write watchpoints on the Ant64S use conventional comparator logic in fabric rather than the zero-cost SRAM tag approach
- No dual independent buses — DDR3 is a single shared bus, so the Harvard-style simultaneous instruction fetch and data access that eliminates contention on the Ant64 is not available in the same form. The Ant64S bitstream handles this through burst prefetch buffering into BSRAM
- Higher initial latency — DDR3 has longer initial access latency than SRAM, partially offset by burst mode and BSRAM prefetch buffering
- Rewind ring buffer from PSRAM — 8 MB of PSRAM gives a shallow rewind buffer compared to the Ant64's 1 GB DDR3 or the Ant64C's ~2 GB
Most 8-bit and simple 16-bit chipsets fit within the 60K fabric alongside the Ant64S interface block. There is no software fallback — if a chipset cannot be made to fit, that personality is not available on the Ant64S.
What may differ between Ant64 and Ant64S bitstreams for the same personality:
| Feature | Ant64 / Ant64C | Ant64S |
|---|---|---|
| Memory architecture | Twin 36-bit SRAM buses | 32-bit DDR3 + PSRAM |
| Instruction / data bus | Independent — zero contention | Shared DDR3 — managed via burst prefetch |
| Hardware debug tags | Zero-cost SRAM bit tags | Fabric comparator logic |
| Rewind depth | Up to ~29 min (Ant64C) / ~15 min (Ant64) | ~7 seconds (8 MB PSRAM) |
| Multi-system simultaneous | Up to ~6 systems | 1–2 systems |
| Enhanced display features | Full FireStorm integration | May be reduced |
| Audio post-processing | Full DSP chain | May be reduced |
Summary
A personality cartridge for the Ant64 recreates a historical computer or console as an optimised FPGA softcore — the original instruction set implemented with modern microarchitecture, zero-wait-state SRAM replacing DRAM that was 30–60× slower, all custom chips running in parallel with no shared bus contention, and optional cycle-accurate throttling for software that depends on exact original timing.
The rewind capture block records XOR deltas of CPU register changes, RAM writes, and chipset register writes into a bitmask-compressed ring buffer in FireStorm DDR3 — up to ~2 GB on the Ant64C, giving up to 29 minutes of full rewind depth for an 8-bit system and over a minute for a 68000. Since the entire chipset is synthesised logic, every register is readable and reversible regardless of whether the original hardware exposed it — display state, audio state, DMA pointers, sprite tables, all of it. Because XOR is self-inverse, the same stored entry applies equally for undo and redo. Combined with the programmable throttle and DeMon's dedicated jog dial, the complete internal state of the machine is navigable physically with one hand on the dial.
The SG2000 small core runs AntOS alongside the personality at all times, providing networking, storage, and system services. The SG2000 big core runs an optional debug and development application — with the full 1 GHz C906 dedicated to it — providing full read/write access to the softcore's registers, system memory, and custom chipset registers via the QSPI FRAM interface. Hardware debugging at full hardware speed, via the same memory-mapped mechanism used for every other Ant64 subsystem.
The Ant64 Personality Interface Block handles all Ant64-specific interfacing. The personality developer implements the chipset and connects it to known ports. Everything else — HDMI output, audio output, FRAM windowing, debug register bank, SRAM debug tags, rewind capture — is provided.
For the Ant64S, personalities ship a dedicated bitstream written against the Ant64S's 32-bit DDR3 + PSRAM memory architecture — a different memory interface from the Ant64's twin 36-bit SRAM buses, requiring its own HDL rather than a reduced version of the Ant64 bitstream. If the chipset fits in the 60K fabric, it runs in hardware. If it doesn't fit, that personality is not available on the Ant64S — there is no software fallback.