Skip to main content

Code Generation

verum_codegen translates VBC bytecode into native machine code via two backends: LLVM for ahead-of-time CPU compilation, and MLIR for the GPU path (@device(gpu) functions, tensor kernels). Neither is a JIT; Verum has two and only two execution modes, the VBC interpreter and the AOT-lowered native binary. Everything verum_codegen produces is compiled ahead of time.

LLVM backend

Located in verum_codegen::llvm. Huge — instruction.rs alone is 1.1 M LOC.

VBC → LLVM IR

vbc_lowering.rs walks VBC functions and emits LLVM IR:

  • Values become SSA registers.
  • Control flow maps to LLVM basic blocks + br / br-cond.
  • Function calls map to call / invoke (the latter for exception-propagating contexts).
  • CBGR checks lower to a load-compare-branch sequence, then optimised by LLVM's passes.
  • Cubical opcodes lower to identity / noop (proof erasure).

Runtime support

runtime.rs generates:

  • CBGR helper functions (fast-path in-asm; slow-path in C).
  • Alloc / dealloc entry points (call into verum_toolchain::mem).
  • Context stack primitives (push, pop, lookup).
  • Panic and unwind machinery.

FFI trampolines

ffi.rs emits trampolines for extern "C" functions:

  • Argument marshalling (Verum types ↔ C types).
  • Return value handling.
  • Exception conversion (C errno / return codes → Verum Result).
  • CBGR guarding of outgoing pointers (wrapped as &unsafe T).

Tensor and SIMD

tensor_ir.rs (134 K LOC) and simd.rs (21 K LOC) map tensor and SIMD opcodes to platform intrinsics — SSE/AVX/AVX-512 on x86_64, NEON on aarch64, scalar fallbacks on other targets.

GPU (Metal)

metal_ir.rs emits Metal shading-language IR for @gpu.kernel functions.

Inline assembly

asm.rs supports @llvm_only unsafe asm!(...) for kernels and drivers. Bitfield layout is handled by bitfield.rs for memory-mapped I/O.

MLIR backend

Located in verum_codegen::mlir. Used for autodiff lowering and the GPU path (@device(gpu) functions and tensor kernels). The LLVM backend handles CPU AOT; MLIR owns the GPU target family.

MLIR is invoked only at AOT compile time. There is no just-in-time compilation in Verum — GPU kernels are compiled to their target IR (PTX, HSACO, SPIR-V, Metal AIR) during verum build and embedded into the final executable. The runtime looks a kernel up by ID and launches it; it never invokes MLIR.

Dialect stack

VBC lowers progressively through a fixed ladder of dialects:

Passes in passes/ drive each lowering step. Standard dialects (arith, cf, math, memref) handle the plumbing; verum.tensor keeps high-level shape information alive long enough for fusion passes to run before dropping to linalg.

VBC → dialect mapping (selected)

VBC opcodeverum.tensorlinalgTarget lowering
TENSOR_MATMUL (0xE8)verum.matmullinalg.matmulnvvm.mma / gpu.wmma
TENSOR_CONV (0xEA)verum.conv2dlinalg.conv_2d_nhwcimplicit GEMM
TENSOR_SOFTMAX (0xF4)verum.softmaxscf.for + arithonline softmax
TENSOR_LAYERNORM (0xF5)verum.layer_normcustomfused kernel
TENSOR_BATCHNORM (0xF6)verum.batch_normcustomfused kernel
TENSOR_EINSUM (0xEC)verum.einsumlinalg.generictarget-specific
TENSOR_FLASH_ATTENTION (0xFC)verum.attentionintrinsic

Flash attention stays a single op all the way to the target — the lowering emits vendor intrinsics directly rather than reconstructing the pattern from linalg.

GPU targets

Five backends share the MLIR pipeline:

TargetTripleMatmul tile (default)
CUDA (NVIDIA, tensor cores)nvptx64-nvidia-cuda128×128×32 on SM8x, 64×64×32 on SM7x
CUDA (NVIDIA, no TC)nvptx64-nvidia-cuda32×32×8
ROCm (AMD, matrix cores)amdgcn-amd-amdhsa128×128×16
ROCm (AMD, no MC)amdgcn-amd-amdhsa32×32×8
Metal (Apple)air64-apple-macosx32×32×8
Vulkanspirv64-unknown-vulkan16×16×8
SYCL / oneAPIspir64-unknown-unknown16×16×8

Tile sizes come from GpuTarget::matmul_tile_sizes(); the compiler picks the widest variant supported by the detected device.

AOT compilation

aot/ runs the full pass pipeline and emits an object file linked alongside the LLVM-produced host code. Every GPU kernel in a Verum binary is produced this way — during verum build, never at runtime. The output is a target-specific binary (PTX, HSACO, SPIR-V, or Metal AIR) embedded into the executable and loaded by the runtime on first kernel launch.

Caching

Compiled kernel binaries are cached keyed by (vbc_fingerprint, target, opt_level) so an incremental rebuild touches only the kernels whose VBC actually changed. First-build times vary by target (~50 ms per kernel for NVPTX on modern hardware) but are paid once per edit, not per program run.

On the absence of a JIT

Verum evaluated an ORC-based JIT tier during early design and removed it. The reasoning: the interpreter already starts in milliseconds and handles the full VBC instruction set (including cubical and HoTT opcodes that a JIT would have to recompile every time the type checker changes); the AOT path produces native code that beats any JIT's warm-up and peak performance; and a third tier doubles the combinatorial surface area of the backend without providing a use case that the other two do not already cover. A two-mode design (verum run for iteration, verum build for production) turns out to be enough.

GPU binaries

gpu_binary.rs assembles Metal .metallib, SPIR-V modules, or PTX / HSACO blobs from MLIR-lowered kernels and embeds them into the final executable via the linker's __TEXT,__const section (macOS) or a dedicated .rodata.verum_gpu section (Linux / Windows). The runtime looks them up by kernel ID at launch time; the launch path has no dependency on a live MLIR context.

Autodiff lowering

@differentiable functions go through a source-transformation pass that runs on the MLIR side so VJP rules can be expressed over linalg ops rather than bytecode:

  1. The primal function is lowered to verum.tensor as usual.
  2. A reverse-mode pass walks the op graph and emits a companion function using VJP rules registered per op.
  3. Tape storage (for activations that the backward pass needs) uses a stack-allocated GradientTape when possible, falling back to Heap<...> when shapes are dynamic.

The GradientTape context sees every linalg op, so the backward pass fuses into the forward kernel whenever the dataflow allows (saving a materialisation of the primal output).

Linking

link.rs invokes the platform linker:

  • Linux: lld with musl for static builds; ld.bfd as fallback.
  • macOS: ld64, dynamic system libraries (libSystem).
  • Windows: lld-link with MSVC CRT.
  • Cross-compilation: pre-staged sysroots bundled in the verum binary.

LTO options:

  • thin (default): fast, good inlining.
  • full: slower, maximum cross-module optimisation.

Debug information

  • DWARF on Linux / macOS.
  • PDB on Windows.
  • Source-level debugging: variables, types, expressions.
  • CBGR header awareness: debuggers can pretty-print generation and epoch.

Artefact layout

target/
├── debug/
│ ├── <name> (executable or .cog)
│ ├── <name>.vbc (bytecode, always emitted)
│ └── <name>.dwarf
└── release/
└── <name> (LTO'd, stripped)

.cog is a library artefact: VBC + metadata + optional proof certs.

See also