Status: proposal / for discussion · 2026-06-10
Phase 2 demonstrated the loop can produce novel, kernel-verified mathematics: the machinery (decomposition, affinity/gap selection, the statement-binding gate) is built and red-team-proven, and the swarm proved its first mathlib-absent lemma (Nicomachus, phase2-run-001). That is a proof of concept. Phase 3 asks the question a proof of concept cannot: does it scale and sharpen? Does a swarm pointed at genuinely hard, mathlib-absent targets grow the verified commons faster than people can — or does it stall at the elementary band?
This is a menu of the work that question breaks into, not a committed plan. The threads are roughly independent; pick the order by what we most want to learn.
phase1-run-002; absence is a grep pre-filter with a shelf life; Agent SDK credit economics are unbudgeted at scale.Drive the swarm at a target hard enough that a direct proof fails and decomposition must fire, then show the full chain close through Gate A. Exit met: platonic-schlafli-core reached via decompose → recompose (phase3-run-001) — a forced depth-3 tree, 13 goals proved, 4 recompositions, the parent’s proof file literally composing its sub-lemmas. The run also forged the operational layer (ADR-015/016/017) that let it finish unattended through three quota outages.
Pick one unformalised result whose proof genuinely needs several lemmas in order, seed the target plus its dependency edges, and let affinity/gap selection route the swarm bottom-up. Exit met: sum_range_cube_eq_triangular_sq closed by importing and invoking the swarm’s own nicomachus_sum_cubes (phase3-run-002) — merged work reused as an importable dependency (ADR-014), with run-001’s four recompositions as corroboration. Still open: the full ambition — a several-lemma tree routed bottom-up at depth — this was one declared edge.
A path from a verified library/Unsorry/*.lean lemma to a mathlib-ready PR: naming/style conformance, dedup against mathlib HEAD at submission time, and an artifact a human can open upstream. The value of the commons is only realised in the commons. Planned: mathlib-upstream-plan.md — mathlib’s AI policy (verified 2026-06-11) requires disclosure, an LLM-generated label, and an author who understands every line without AI, so the model is machine-prepared packets, human-sponsored PRs; a fully autonomous pipeline is against policy and a non-goal.
Invite external people to run agents on their own machines/subscriptions; observe claim-collision, merge-rate, and coordination dynamics at higher concurrency than the maintainer’s few agents. Tests the design’s core promise (untrusted, heterogeneous, rag-tag) for real. Prereq: the dedup fix below, or collision/merge churn will dominate.
Fix the merge-lag duplicate fan-out (a “sha-already-in-library / claimed-by-me” short-circuit decoupled from the main-branch status flip — flagged in phase1-run-002); budget Agent SDK credit for sustained runs; consider a GitHub merge queue if volume grows. Unglamorous, but it is the difference between a demo and a service.
Failure notes. Today a failed prove attempt records that the goal resisted (the −10 affinity demote PR) but not what was tried: the failed Lean text is discarded with the worktree, so the next agent re-walks the same dead ends blind — and with concurrency, every agent pays full price for the same lesson. Add a one-line approach note to the demote path (e.g. note≜induction on n stalls at Even/ℕ-division bookkeeping, written by the prover on give-up, carried on the demote PR and surfaced to the next claimant’s prove prompt). Cheap to record, compounds across agents, and turns the affinity scalar into transferable knowledge. Keep it one line — full failed proofs are noise; the diagnosis is the asset.
For non-trivial targets the formalisation being faithful matters more than for trivia. Strengthen the dual-translation/fidelity gate (and the human-flag path) so a hard target’s Lean statement provably captures its informal claim — the residue ADR-011 explicitly leaves to fidelity, not binding.
The coordination layer is written in AISP (ADR-003) on two claims — compactness in LLM context and drift-resistant machine validation — and neither has ever been measured against the boring alternative (JSON/YAML with a schema validator). Honest starting observation: most records are machine-rendered (py_helper) and machine-parsed (Gate B), so the LLM rarely touches AISP directly; where it does is the swarm contract (protocol.aisp + the ~19 KB grammar guide loaded into every agent session) and records quoted into prompts. The benchmark, in two layers:
metrics.jsonl. Lands in every run report.format=aisp vs format=json; compare first-try record validity, protocol-discipline errors (claim/TTL misuse), end-to-end cycle success, and tokens consumed. → docs/metrics/aisp-benchmark-001.The honest possible outcome — AISP’s value is marginal for machine-rendered records and real only as context compression — is exactly worth knowing before thread D invites contributors to learn a bespoke notation. If the benchmark says JSON does the job, that’s a finding, not a failure; the validator and the gates, not the glyphs, were always the load-bearing part.
Still not open conjectures — the swarm formalises existing proofs, it does not discover research-frontier mathematics. Still upstream of welfare — an enabling public good, not a direct one.
Thread A, then B. Close the decomposition demonstration (it is nearly there and it is the load-bearing claim), then point the swarm at one real target result and watch the dependency tree compound. C (upstreaming) is the highest-leverage for impact but needs the most human-norm judgement; D/E/F are what make it a service rather than a demo. Each thread is an ADR + spec + a measured run, in the same idiom as Phases 0–2.