Code Generation
verum_codegen translates VBC bytecode into native machine code via
two backends: LLVM for ahead-of-time CPU compilation, and
MLIR for the GPU path (@device(gpu) functions, tensor
kernels). Neither is a JIT; Verum has two and only two execution
modes, the VBC interpreter and the AOT-lowered native binary.
Everything verum_codegen produces is compiled ahead of time.
LLVM backend
Located in verum_codegen::llvm. Huge — instruction.rs alone is
1.1 M LOC.
VBC → LLVM IR
vbc_lowering.rs walks VBC functions and emits LLVM IR:
- Values become SSA registers.
- Control flow maps to LLVM basic blocks +
br/br-cond. - Function calls map to
call/invoke(the latter for exception-propagating contexts). - CBGR checks lower to a load-compare-branch sequence, then optimised by LLVM's passes.
- Cubical opcodes lower to identity / noop (proof erasure).
Runtime support
runtime.rs generates:
- CBGR helper functions (fast-path in-asm; slow-path in C).
- Alloc / dealloc entry points (call into
verum_toolchain::mem). - Context stack primitives (push, pop, lookup).
- Panic and unwind machinery.
FFI trampolines
ffi.rs emits trampolines for extern "C" functions:
- Argument marshalling (Verum types ↔ C types).
- Return value handling.
- Exception conversion (C errno / return codes → Verum
Result). - CBGR guarding of outgoing pointers (wrapped as
&unsafe T).
Tensor and SIMD
tensor_ir.rs (134 K LOC) and simd.rs (21 K LOC) map tensor and
SIMD opcodes to platform intrinsics — SSE/AVX/AVX-512 on x86_64, NEON
on aarch64, scalar fallbacks on other targets.
GPU (Metal)
metal_ir.rs emits Metal shading-language IR for @gpu.kernel
functions.
Inline assembly
asm.rs supports @llvm_only unsafe asm!(...) for kernels and
drivers. Bitfield layout is handled by bitfield.rs for
memory-mapped I/O.
MLIR backend
Located in verum_codegen::mlir. Used for autodiff lowering and the
GPU path (@device(gpu) functions and tensor kernels). The LLVM
backend handles CPU AOT; MLIR owns the GPU target family.
MLIR is invoked only at AOT compile time. There is no
just-in-time compilation in Verum — GPU kernels are compiled to
their target IR (PTX, HSACO, SPIR-V, Metal AIR) during verum build and embedded into the final executable. The runtime looks a
kernel up by ID and launches it; it never invokes MLIR.
Dialect stack
VBC lowers progressively through a fixed ladder of dialects:
Passes in passes/ drive each lowering step. Standard dialects
(arith, cf, math, memref) handle the plumbing; verum.tensor
keeps high-level shape information alive long enough for fusion
passes to run before dropping to linalg.
VBC → dialect mapping (selected)
| VBC opcode | verum.tensor | linalg | Target lowering |
|---|---|---|---|
TENSOR_MATMUL (0xE8) | verum.matmul | linalg.matmul | nvvm.mma / gpu.wmma |
TENSOR_CONV (0xEA) | verum.conv2d | linalg.conv_2d_nhwc | implicit GEMM |
TENSOR_SOFTMAX (0xF4) | verum.softmax | scf.for + arith | online softmax |
TENSOR_LAYERNORM (0xF5) | verum.layer_norm | custom | fused kernel |
TENSOR_BATCHNORM (0xF6) | verum.batch_norm | custom | fused kernel |
TENSOR_EINSUM (0xEC) | verum.einsum | linalg.generic | target-specific |
TENSOR_FLASH_ATTENTION (0xFC) | verum.attention | — | intrinsic |
Flash attention stays a single op all the way to the target — the
lowering emits vendor intrinsics directly rather than reconstructing
the pattern from linalg.
GPU targets
Five backends share the MLIR pipeline:
| Target | Triple | Matmul tile (default) |
|---|---|---|
| CUDA (NVIDIA, tensor cores) | nvptx64-nvidia-cuda | 128×128×32 on SM8x, 64×64×32 on SM7x |
| CUDA (NVIDIA, no TC) | nvptx64-nvidia-cuda | 32×32×8 |
| ROCm (AMD, matrix cores) | amdgcn-amd-amdhsa | 128×128×16 |
| ROCm (AMD, no MC) | amdgcn-amd-amdhsa | 32×32×8 |
| Metal (Apple) | air64-apple-macosx | 32×32×8 |
| Vulkan | spirv64-unknown-vulkan | 16×16×8 |
| SYCL / oneAPI | spir64-unknown-unknown | 16×16×8 |
Tile sizes come from GpuTarget::matmul_tile_sizes(); the compiler
picks the widest variant supported by the detected device.
AOT compilation
aot/ runs the full pass pipeline and emits an object file linked
alongside the LLVM-produced host code. Every GPU kernel in a Verum
binary is produced this way — during verum build, never at
runtime. The output is a target-specific binary (PTX, HSACO,
SPIR-V, or Metal AIR) embedded into the executable and loaded by
the runtime on first kernel launch.
Caching
Compiled kernel binaries are cached keyed by
(vbc_fingerprint, target, opt_level) so an incremental rebuild
touches only the kernels whose VBC actually changed. First-build
times vary by target (~50 ms per kernel for NVPTX on modern
hardware) but are paid once per edit, not per program run.
On the absence of a JIT
Verum evaluated an ORC-based JIT tier during early design and
removed it. The reasoning: the interpreter already starts in
milliseconds and handles the full VBC instruction set (including
cubical and HoTT opcodes that a JIT would have to recompile every
time the type checker changes); the AOT path produces native code
that beats any JIT's warm-up and peak performance; and a third
tier doubles the combinatorial surface area of the backend without
providing a use case that the other two do not already cover. A
two-mode design (verum run for iteration, verum build for
production) turns out to be enough.
GPU binaries
gpu_binary.rs assembles Metal .metallib, SPIR-V modules, or PTX /
HSACO blobs from MLIR-lowered kernels and embeds them into the final
executable via the linker's __TEXT,__const section (macOS) or a
dedicated .rodata.verum_gpu section (Linux / Windows). The runtime
looks them up by kernel ID at launch time; the launch path has no
dependency on a live MLIR context.
Autodiff lowering
@differentiable functions go through a source-transformation pass
that runs on the MLIR side so VJP rules can be expressed over
linalg ops rather than bytecode:
- The primal function is lowered to
verum.tensoras usual. - A reverse-mode pass walks the op graph and emits a companion function using VJP rules registered per op.
- Tape storage (for activations that the backward pass needs) uses
a stack-allocated
GradientTapewhen possible, falling back toHeap<...>when shapes are dynamic.
The GradientTape context sees every linalg op, so the backward
pass fuses into the forward kernel whenever the dataflow allows
(saving a materialisation of the primal output).
Linking
link.rs invokes the platform linker:
- Linux:
lldwith musl for static builds;ld.bfdas fallback. - macOS:
ld64, dynamic system libraries (libSystem). - Windows:
lld-linkwith MSVC CRT. - Cross-compilation: pre-staged sysroots bundled in the
verumbinary.
LTO options:
thin(default): fast, good inlining.full: slower, maximum cross-module optimisation.
Debug information
- DWARF on Linux / macOS.
- PDB on Windows.
- Source-level debugging: variables, types, expressions.
- CBGR header awareness: debuggers can pretty-print generation and epoch.
Artefact layout
target/
├── debug/
│ ├── <name> (executable or .cog)
│ ├── <name>.vbc (bytecode, always emitted)
│ └── <name>.dwarf
└── release/
└── <name> (LTO'd, stripped)
.cog is a library artefact: VBC + metadata + optional proof certs.
See also
- VBC bytecode — the input to codegen.
- Runtime tiers — when each backend runs.
- Language → FFI — user-facing FFI boundary.