What the project has actually achieved, stated against verified ground truth rather than ambition. The house style is honesty: each claim is followed by its limit.
In roughly three days (v0.1.0 → v1.6.2, ~195 merged PRs), unsorry went from a coordination
skeleton to a swarm that has proved five mathlib-absent results, demonstrated its two
load-bearing mechanisms end-to-end (decomposition and dependency reuse), survived its own
operational failure modes (three CLI quota outages in one day, now absorbed unattended; its
first gate false negative caught, fixed, and canary-guarded within two hours), passed
three adversarial red-team rounds including the CRITICAL hole an external review found, and
built — and auto-runs — a policy-compliant pipeline to upstream its novel lemmas to mathlib.
The honest frontier is unchanged: every result so far is elementary, the hard-target ceiling is
untested, and nothing has landed in the commons yet.
--wfail, axiom
audit against the {propext, Classical.choice, Quot.sound} whitelist, leanchecker kernel replay,
a regenerated statement-binding obligation) and Gate B (deterministic AISP hygiene). No human is
in the correctness path.{n : ℕ} (hn : 1 < n)) made the regenerated binding obligation trip
linter.unusedVariables under --wfail, so Gate A rejected every correct proof of that goal —
a false negative, never a false positive. Fixed by scoping the lint suppression to the generated
bindings, whose force is the type-check, not lints (#225), and pinned by binder-shape-canary, a
permanent sound goal carrying exactly that binder shape whose binding the gate regenerates and
rebuilds every run (#233) — a regression now goes red at the gate, not on a contributor’s PR.Five results verified mathlib-absent before the run and proved by the swarm:
| Result | What | How |
|---|---|---|
nicomachus_sum_cubes |
∑k³ = (∑k)² | direct, 24-line multi-step (phase2-run-001) |
sum_range_pow_four_closed |
Faulhaber k=4 closed form | direct induction |
platonic_schlafli_pairs |
the five Platonic Schläfli pairs | forced depth-3 decomposition, 13 lemmas (phase3-run-001) |
sum_range_cube_eq_triangular_sq |
triangular closed form of Nicomachus | dependency reuse of the above, 4m57s (phase3-run-002) |
not_prime_pow_four_add_four |
n⁴ + 4 is never prime for n > 1 (Sophie Germain) | direct, by external machine binto-labs (#221) — the proof that surfaced gate bug #231 |
Born from three real quota outages in one day: the infrastructure-failure guard (ADR-016 — a fast-failed call + failed health probe is never read as goal evidence) and the supervisor (ADR-017 — exponential backoff, PR hygiene, claim guard). The third outage was absorbed with zero queue corruption and no human intervention; the first two had each demoted a whole goal tree and needed manual repair. The progressive effort ladder (ADR-015) spends deep reasoning only where a cheaper attempt failed — 11 of 13 run-001 proofs closed at the cheapest rung.
Issue #190’s CRITICAL (goal tampering),
HIGH (workflow self-protection), MEDIUM (unpinned actions) and LOW (audit corpus) findings are all
addressed — goal-statement immutability (ADR-018), SHA-pinned actions + CODEOWNERS + a settings
checklist (ADR-019), and an opaque audit fixture. The review’s own verdict — “serious,
well-engineered work… README’s framing is accurate” — and its one stale finding (“decomposition
never completed”, true when written, closed by run-001) are recorded honestly.
The upstream pipeline (ADR-020) scans nightly for packet-eligible lemmas (proved +
absence-verified + unpacketed), re-checks absence at mathlib HEAD, and auto-opens a sponsor packet
PR — a git apply-able patch with the human author’s header, gate evidence, a paste-ready factual
AI disclosure, and an explicit rewrite-in-own-words boundary (mathlib forbids LLM-written
conversation). The first run produced 9 packets, and the first of them (Nicomachus) is stamped
kernel-verified at mathlib HEAD. The last mechanical mile is now one command (ADR-021,
v1.6.1): python3 -m tools.upstream.raise_pr turns a HEAD-verified packet into a draft
mathlib PR — clone, apply, push, pre-filled factual disclosure — with the policy boundary enforced
in code: it refuses without the sponsor’s --understood attestation, refuses an unverified
packet, opens drafts only, and writes no review words. The irreducibly human stages — understand
the proof, ask Zulip, write the narrative, face review — stay human
(docs/upstreaming.md). The model is policy-compliant by construction:
machine prepares, the human sponsor (signed up) owns every word the community reads.
Merged work now comes from the agent swarm, the maintainer, and three external contributor
machines (mac-158f, binto-labs, ohdearquant) — proving lemmas and fixing tooling through
the same gates, with no coordination beyond the claims branch. One of those machines
(binto-labs) contributed the fifth headline result above and, in doing so, found the gate’s
first false negative — external traffic is now hardening the gates, not just adding lemmas. The
contributor path is documented (CONTRIBUTING.md, v1.6.2), and the Apache-2.0 LICENSE file
that the README and every packet’s copyright header already referenced now actually exists in
the tree.
PROVE_CLAIM_CAP=1 is real but unreproduced;
its cost is bounded by PR dedupe, not eliminated.The frontier is the difficulty ceiling: a target a working mathematician would call non-trivial, reached by a queue that genuinely compounds at depth — and a first lemma actually merged into the commons. Everything built so far is the apparatus for asking that question honestly.
Generated as part of the v1.6.0 release; updated the same day through v1.6.2 (the fifth result #221, the #231 gate fix #225 and canary #233, the ADR-021 sponsor helper #229, CONTRIBUTING/LICENSE #232). Verified against the repository at the time of the update: 40 library modules, 39 proved goals (one of them the permanent binder-shape canary), ~195 merged PRs, 9 upstream-eligible lemmas (the fifth result awaits the nightly packet scan), ADRs through ADR-021, three green red-team rounds.