verum benchmark — Head-to-head vs Coq / Lean4 / Isabelle / Agda
verum benchmark is the head-to-head comparison surface for
proof-assistant landscape evaluation. Run a fixed reference suite
against one or more systems and emit a typed comparison matrix
across the canonical performance / trust / capability dimensions.
The point is reproducibility: anyone can re-run the suite and verify the numbers. No marketing claims — only verifiable measurements.
Mental model
Verum benchmarks five proof assistants:
- Verum — what we're benchmarking against the rest.
- Coq (also reachable via the
rocqalias) —coqc. - Lean 4 / Mathlib (aliases:
lean,mathlib,mathlib4) —lean. - Isabelle/HOL (aliases:
isabelle/hol,hol) —isabelle. - Agda —
agda.
For each (system, theorem) pair, the runner emits a
BenchmarkResult carrying the system tag, the suite + theorem
identifiers, the measured metric, the value, a Unix timestamp, and
an optional reproducibility envelope (hash of the input corpus).
Per-theorem metrics aggregate to a per-suite leader board via the
ComparisonMatrix.
The 9 canonical metrics
| Metric | Direction | What it measures |
|---|---|---|
kernel_loc | lower is better | Trusted Computing Base size: kernel LOC + transitive trust dependencies. |
lines_per_second | higher is better | Compilation / verification speed (LOC/s). |
theorems_per_second | higher is better | Theorems verified per second. |
peak_rss_bytes | lower is better | Per-theorem peak resident-set size (bytes). |
elapsed_ms | lower is better | Per-theorem wall-clock duration (milliseconds). |
cross_format_exports | higher is better | Independent kernel formats that re-check the proof. Verum target: 4 (Coq + Lean + Isabelle + Dedukti); foreign systems re-check themselves only ⇒ 1. |
tactic_coverage_percent | higher is better | Fraction ([0, 100]) of standard obligations the tactic library closes via 1-line invocations. |
trust_diversification_count | higher is better | Number of distinct kernels in the trust circle that agree on each theorem. |
llm_acceptance_percent | higher is better | Fraction ([0, 100]) of LLM-proposed proofs that pass the kernel. Only Verum supports this today via the LCF-style fail-closed loop (see LLM tactic protocol); comparable systems baseline at 0%. |
verum benchmark metrics prints this list with the
higher_is_better direction.
Subcommand reference
verum benchmark run --system <S> --suite-name <N> [--theorem <T>]…
[--format plain|json|markdown|csv]
verum benchmark compare [--system <S>]… --suite-name <N> [--theorem <T>]…
[--format plain|json|markdown|csv]
verum benchmark metrics [--format plain|json|markdown|csv]
run
Run the suite against a single system. Emits one
BenchmarkResult per (theorem, metric) for per-theorem metrics
plus one per metric for suite-level metrics (e.g. kernel_loc).
$ verum benchmark run --system verum --suite-name mathlib-basics \
--theorem addnC --theorem addn0
Benchmark transcript (5 result(s)):
verum addnC elapsed_ms 0
verum addn0 elapsed_ms 0
verum (suite) kernel_loc 5000
verum (suite) lines_per_second 50000
verum (suite) llm_acceptance_percent 65
compare
Run the suite against every requested system (or all five when
--system is omitted) and emit the comparison matrix.
$ verum benchmark compare --suite-name mathlib-basics \
--theorem addnC --format markdown
# Benchmark comparison — `mathlib-basics`
| Metric | verum | coq | lean4 | isabelle | agda |
|---|---|---|---|---|---|
| `kernel_loc` | 5,000 LOC ⭐ | 200,000 LOC | 50,000 LOC | 10,000 LOC | 30,000 LOC |
| `lines_per_second` | 50,000 LOC/s ⭐ | 20,000 LOC/s | 30,000 LOC/s | 25,000 LOC/s | 15,000 LOC/s |
| `cross_format_exports` | 4 ⭐ | 1 | 1 | 1 | 1 |
| `llm_acceptance_percent` | 65.0% ⭐ | 0.0% | 0.0% | 0.0% | 0.0% |
The ⭐ marker decorates the leader cell per metric. Markdown
output is direct-paste-able into a release announcement or a
comparison page. JSON output ships the full structured matrix
plus a leaders summary suitable for CI gates.
metrics
Print every supported metric with its higher_is_better direction
— useful for CI scripts that need to know which direction means
"better" for a given metric.
$ verum benchmark metrics
Benchmark metrics (9):
Name Direction
──────────────────────────────── ────────────────────
kernel_loc lower is better
lines_per_second higher is better
theorems_per_second higher is better
peak_rss_bytes lower is better
elapsed_ms lower is better
cross_format_exports higher is better
tactic_coverage_percent higher is better
trust_diversification_count higher is better
llm_acceptance_percent higher is better
V0 vs V1 production runners
V0 ships mock runners that emit canned values reflecting the documented landscape claims:
- Verum kernel TCB: held to a single-reviewer / single-session audit budget — the trust-boundary subset that produces verdicts. Surrounding kernel infrastructure (apply-graph, ATS-V dispatch, bridge-audit) is part of the same crate but consumes kernel verdicts rather than producing them, so it sits outside the claimed TCB. By contrast, mainstream proof assistants ship trusted bases that are roughly two orders of magnitude larger: Coq's kernel is on the order of a hundred thousand lines of OCaml, Lean 4's kernel is in the tens of thousands, Isabelle is in the low tens of thousands, and Agda is similarly sized. Verum's smaller TCB is the point of the LCF-style design — the rest of this benchmark is about measuring whether that smaller surface still catches what the larger ones do.
- Compilation speed: target ≥ 50 KLOC/s for Verum; mainstream ahead-of-time compilers in the comparison set typically measure in the 15–30 KLOC/s range.
- LLM acceptance rate: Verum's LCF-style fail-closed loop turns speculative tactics into accept-or-reject verdicts; the mainstream comparison set has no equivalent feature.
These canned values are placeholders for the protocol shape, not
real measurements. V1+ swaps in production runners that actually
invoke coqc / lean / isabelle / agda and measure timings on
disk. The trait surface is unchanged, so the CLI / docs / CI
scripts continue to work identically.
The point of shipping V0 today: lock in the schema, the matrix
shape, the leader-detection contract, the JSON / markdown / csv
output formats. Once production runners land, every CI workflow
that already consumes verum benchmark compare --format json
auto-upgrades.
Reproducibility envelope
Every result carries an optional repro_envelope field — a hash
of the input corpus + tool version. When set, the same hash + same
tool version on the same hardware should reproduce the same value.
The envelope is the auditable claim:
{
"system": "verum",
"suite": "mathlib-basics",
"theorem": "addnC",
"metric": "elapsed_ms",
"value": 142.0,
"timestamp": 1714478400,
"repro_envelope": "blake3-of-corpus:0a1b2c3d..."
}
A reviewer can take the envelope hash, fetch the matching corpus snapshot, install the matching tool versions, and re-run — getting the same number (within hardware noise). Drift surfaces as a CI diff against the baseline.
CI usage
Pin the cross-system leader board so a regression in any system surfaces immediately:
# .github/workflows/bench.yml — runs weekly
verum benchmark compare --suite-name mathcomp-basics \
--format json > today.json
LEADER=$(jq -r '.leaders[] | select(.metric == "kernel_loc").leader' today.json)
[ "$LEADER" = "verum" ] || (echo "kernel_loc leadership lost" && exit 1)
The standard release-announcement workflow:
verum benchmark compare --suite-name mathlib-basics \
--format markdown > release/bench-comparison.md
verum benchmark compare --suite-name mathlib-basics \
--format csv > release/bench-comparison.csv
Output schemas
JSON (compare --format json)
{
"schema_version": 1,
"suite": "mathlib-basics",
"results": [ <BenchmarkResult>, ... ],
"matrix": [
{ "metric": "kernel_loc", "system": "verum", "value": 5000.0 },
...
],
"leaders": [
{ "metric": "kernel_loc", "leader": "verum" },
{ "metric": "lines_per_second", "leader": "verum" },
...
]
}
CSV (compare --format csv)
metric,system,value,is_leader
kernel_loc,verum,5000,true
kernel_loc,coq,200000,false
...
Markdown (compare --format markdown)
A pivot table with the leader cell decorated as ⭐ (see live demo
above). Direct-paste-able into release announcements.
Cross-references
- Tactic catalogue — the
combinator surface that drives
tactic_coverage_percent. - LLM tactic protocol — the
LCF-style fail-closed loop that drives
llm_acceptance_percent. - Auto-paper generator — every
exported paper draft can include the latest comparison matrix
via
verum benchmark compare --format markdownpiped into the paper template. - Foreign-system import — the
inverse of
cross_format_exports: import a foreign theorem to Verum. - Incremental cache — once a benchmark is verified, the closure cache makes subsequent runs cheap.