Building a Compiler from Scratch
TypeScript · Node.js · Recursive Descent Parser · AST · Bytecode Generation · 2025
Overview
A complete compiler built from scratch in TypeScript that transpiles a proprietary hardware scripting language to bytecode assembly and raw binary. 5-stage pipeline (scanner, parser, semantic analyzer, compiler, code generator) achieving exact opcode parity with the official compiler — 61 opcodes, all language features.
The Challenge
The target language had no public specification, no open-source compiler, and no tooling ecosystem. The only reference was a closed-source compiler that produced binary output. Building a compatible compiler required reverse-engineering the binary format, deducing grammar rules from examples, and implementing every language feature from scratch.
Architecture
Source Code (.gpc)
|
[Scanner] -- Tokenization (keywords, literals, operators)
|
[Parser] -- Recursive descent, AST construction
|
[Analyzer] -- Semantic analysis, scope resolution, type checking
|
[Compiler] -- AST to intermediate representation
|
[Generator] -- IR to bytecode assembly + raw binary
|
Output (.bin) Key Technical Decisions
Recursive Descent over Parser Generators — Chose hand-written recursive descent for full control over error messages and recovery. The language has unusual constructs (combo blocks, hardware-specific keywords) that would fight a generated parser.
Two-Pass Semantic Analysis — First pass collects all declarations (functions, defines, data sections). Second pass resolves references, validates types, and checks constraints. This allows forward references without requiring declaration order.
Binary Format Reverse Engineering — Documented the complete binary format: header structure, opcode encoding, operand formats, string table layout, data section alignment. Created a verification pipeline that compares output byte-for-byte against the reference compiler.
Compiler Statistics
| Metric | Value |
|---|---|
| Opcodes implemented | 61 (full parity) |
| Language features | All (functions, combos, data sections, defines, remaps) |
| Scanner tokens | 45+ token types |
| AST node types | 30+ |
| Error codes | 41 with human-readable messages |
| Test coverage | Integration tests against disassembler output |
Verification Strategy
Every compiled binary is verified against the reference compiler's output using a custom disassembler. The test suite compiles real-world scripts (some 6000+ lines) and compares the disassembly instruction-by-instruction. Any divergence fails the build.
Key Learnings
- Reverse engineering is systematic, not guesswork — document every byte offset, build verification tooling early
- Error messages are a product feature — users see compiler errors more than they see working code
- The scanner is the simplest stage but has the most edge cases (string escaping, numeric formats, comment nesting)
- Forward references make users happy but make the compiler author's life harder — worth the trade