JCCIDC
AI Systems Architect · Automation Strategist
← BACK TO CASE STUDIES

Building a Compiler from Scratch

TypeScript · Node.js · Recursive Descent Parser · AST · Bytecode Generation · 2025

Overview

A complete compiler built from scratch in TypeScript that transpiles a proprietary hardware scripting language to bytecode assembly and raw binary. 5-stage pipeline (scanner, parser, semantic analyzer, compiler, code generator) achieving exact opcode parity with the official compiler — 61 opcodes, all language features.

The Challenge

The target language had no public specification, no open-source compiler, and no tooling ecosystem. The only reference was a closed-source compiler that produced binary output. Building a compatible compiler required reverse-engineering the binary format, deducing grammar rules from examples, and implementing every language feature from scratch.

Architecture

Source Code (.gpc)
      |
  [Scanner]  -- Tokenization (keywords, literals, operators)
      |
  [Parser]   -- Recursive descent, AST construction
      |
  [Analyzer] -- Semantic analysis, scope resolution, type checking
      |
  [Compiler] -- AST to intermediate representation
      |
  [Generator] -- IR to bytecode assembly + raw binary
      |
Output (.bin)

Key Technical Decisions

Recursive Descent over Parser Generators — Chose hand-written recursive descent for full control over error messages and recovery. The language has unusual constructs (combo blocks, hardware-specific keywords) that would fight a generated parser.

Two-Pass Semantic Analysis — First pass collects all declarations (functions, defines, data sections). Second pass resolves references, validates types, and checks constraints. This allows forward references without requiring declaration order.

Binary Format Reverse Engineering — Documented the complete binary format: header structure, opcode encoding, operand formats, string table layout, data section alignment. Created a verification pipeline that compares output byte-for-byte against the reference compiler.

Compiler Statistics

Metric Value
Opcodes implemented 61 (full parity)
Language features All (functions, combos, data sections, defines, remaps)
Scanner tokens 45+ token types
AST node types 30+
Error codes 41 with human-readable messages
Test coverage Integration tests against disassembler output

Verification Strategy

Every compiled binary is verified against the reference compiler's output using a custom disassembler. The test suite compiles real-world scripts (some 6000+ lines) and compares the disassembly instruction-by-instruction. Any divergence fails the build.

Key Learnings

  • Reverse engineering is systematic, not guesswork — document every byte offset, build verification tooling early
  • Error messages are a product feature — users see compiler errors more than they see working code
  • The scanner is the simplest stage but has the most edge cases (string escaping, numeric formats, comment nesting)
  • Forward references make users happy but make the compiler author's life harder — worth the trade
TypeScriptNode.jsRecursive Descent ParserASTBytecode Generation