unsorry

unsorry Phase 2 — Implementation Plan: Open Lemmas and Target Decomposition

Field	Value
Document	Phase-2 implementation plan
Initiative	unsorry Phase 2 — open lemmas and target decomposition
Proposed By	unsorry maintainers
Date	2026-06-10
Status	Proposed

This is an execution plan, not an architecture document and not an ADR. It scopes what gets built, in what order, gated by what evidence to take the swarm from Phase 1 (a validated loop that proves known-true theorems already in mathlib) to Phase 2 (a loop that drives verified proofs to a chosen result not already in mathlib, by decomposition). It mirrors the staging discipline of distributed-research-swarm-plan.md but at the granularity of pull requests and the specs that land in each. The design doc remains the master; this plan does not reopen any of its decisions. Where it names new ADRs and specs (ADR-009/010/011, SPEC-009-A/010-A/011-A, docs/phase2-targets.md, phase1-run-002), those are the artifacts Phase 2 must produce — the plan defines the work, the ADRs ratify the decisions, the specs constrain the build.

1. Context

Phase 1 did exactly what it was designed to do, and it is important to be precise about what that was. The prove cycle ran end-to-end against agenticsnz/unsorry: distinctly-identified swarm agents claimed unproved goals, drove claude to write Lean proofs, self-verified locally, and merged through Gate A and Gate B (phase1-run-001: 5 prove attempts, 3 merges, merge rate 0.6, 0 coordination errors, 0 unsound proofs). That is a working autonomous research loop with a real kernel-enforced soundness boundary. But the three theorems it banked — int-add-neg (#72), int-neg-neg (#74), and-comm-imp (#70) — are one-line citations of existing mathlib lemmas (:= Int.neg_neg n and the like). They carry zero new mathematical value, by design. Phase 1 proved the machine, not mathematics. The merged proofs were instrumentation: they exist to show the claim/verify/merge plumbing, the gates, and the agent identity trail all hold under a real contributor workflow, which phase1-run-001 and the Round-001 red team (gate-a-redteam-001.md) jointly establish.

Phase 2 is where solved theorems start to matter. The objective is no longer “show the loop runs”; it is “drive verified proofs to a result that is not already in mathlib, by decomposing it into sub-lemmas that the swarm proves and recomposes.” This is the first phase whose output is a contribution rather than a self-test. It exercises the three mechanisms the design doc describes but Phase 1 deliberately left unbuilt: decomposition on prove-failure (Components §6), affinity-weighted and gap-based selection (Components §6), and the library index as a compounding substrate (Component §7). None of these is wired today — Phase-1 selection is plain lexicographic order, prove-failure just releases and flags with no decomposition (SPEC-007-A step 11), and affinity is never computed.

The honest framing from the master design carries straight through and must not be oversold: formal mathematics is an enabling public good, not a direct-welfare deliverable. It sits upstream of human welfare — verified software, clean cryptography, an error-free mathematical record, a sound substrate for AI reasoning — rather than at the point of delivery. A first lemma proved that is genuinely absent from mathlib is a real and lasting contribution; it is not a cure for a disease. Phase 2 is judged against that upstream standard, deliberately.

2. What must be true before Phase 2 starts (prerequisites)

Phase 2 points an autonomous swarm at a hard, unsolved-by-the-library target. That is only safe and only meaningful once five things hold. None is optional; pointing the machine at hard targets on an unproven loop, or with an unbound notion of “proved,” wastes budget and produces results no one can trust.

(a) Decomposition is built — ADR-009 / SPEC-009-A. Today a prove-failure ends the line (SPEC-007-A step 11: release + flag, “Phase 1 keeps it simple — no decomposition”). Phase 2 needs the failure path to instead produce a decompositions/<parent>.<agent>.aisp record (schema already in SPEC-003-C), generate the sub-goal records, wire the Post(A) ⊆ Pre(B) dependency edges, re-queue the subs, and block/unblock the parent. Without this the swarm cannot make progress on anything it cannot one-shot — which a real target, by definition, is.
(b) Affinity and gap-based selection are wired — ADR-010 / SPEC-010-A. The protocol already specifies the mechanism (⟦Γ:Affinity⟧: +1 on merge, −10 on fail, viability threshold τ_v = −5, select ≜ argmax(aff(g), −gap(g, library)), gap ≜ |deps(g) \ proved|). It is not implemented: SPEC-007-A selects the first candidate in lexicographic goal-id order and never reads or writes affinity. At Phase-2 scale a target fans out into many sub-goals of very different value; lexicographic order would have the swarm grind through them blindly. Selection must prefer the smallest viable gap and proven patterns before the machine is pointed at a target.
(c) The statement-binding check is live — ADR-011 / SPEC-011-A. This is the load-bearing prerequisite. Gate A proves a theorem is sound; it does not prove the theorem says what the goal asks — the gap the Round-001 red team exposed with PR #64 (autoImplicit vacuity, axioms: [], sound but meaningless). gate-a-redteam-001.md records the proper fix explicitly: a statement-vs-canonical-sha binding check that lowers the goal’s canonical statement to Lean and checks the merged theorem matches it. Until that exists, “this target is proved” is an unbound claim. Phase 2 cannot declare a target solved against a check that does not bind meaning — and decomposition makes this strictly worse (see §5), because every generated sub-statement is a fresh place to be vacuous or over-general. The binding check must extend to generated subs, not just top-level goals.
(d) A stable, measured Phase-1 run exists first — phase1-run-002. phase1-run-001 is a floor, not a clean baseline: merge rate 0.6 was dragged down by an infrastructure fault (the bare prove worktree never runs lake exe cache get, so local verify rebuilds ~8486 mathlib modules from source and times out), a full /tmp, and two redundant duplicate PRs from agents re-selecting the same goal under pending auto-merge. Before Phase 2, re-run the prove swarm with the cache fix in place, close most of the 20-goal backlog cleanly, and record phase1-run-002 with a merge rate that reflects the loop’s real capability rather than disk and cache friction. Do not point the machine at hard targets on a loop whose merge rate is still an infrastructure artifact.
(e) A target is chosen — docs/phase2-targets.md. This is an explicit human curation call and cannot be automated: the kernel guarantees soundness, not relevance. docs/phase2-targets.md holds the shortlist, grounded in the swarm’s real Phase-1 band (Int/Nat one-liners delegating to mathlib). The recommended first target is the Nicomachus identity, Σk³ = (Σk)² — an unambiguous definitional statement that sidesteps the binding gap, genuinely needs two or three lemmas depending on an existing mathlib Σk lemma (so it exercises decomposition records and gap-selection), and is cheap and true. A genuine first contribution follows: a LeanComb/CombiBench combinatorial identity (high-confidence absence, mathlib’s thinnest area) or an unsolved PutnamBench item. The AISP-15 dogfooding set is kept as a later flagship, not first: its claims have no Lean statement (paper/natural-deduction only), so it would co-load autoformalisation with the still-open vacuity/fidelity gap. Absence claims are 2026-06-10 snapshots and must be re-grepped against mathlib HEAD at commit time — several “obvious” candidates (Bertrand, Stirling, Frobenius/Chicken-McNugget, sum-of-two-squares) were dropped precisely because they are already in mathlib, and Pick’s theorem was downgraded to unverified after a Lean formalization appeared (arXiv 2603.23095, Mar 2026).

3. Staged delivery

Five stages, gated so each lands its specs and its evidence before the next begins. The dependency order is deliberate: stabilise and measure the loop, give it good selection, give it decomposition, bind its claims to meaning, then — and only then — point it at a target. Statement-binding (Stage D) lands before the first Phase-2 run because decomposition (Stage C) multiplies the binding gap; the two are sequenced so that no target run happens against an unbound notion of “proved.”

Stage A — Stabilise and measure Phase 1 (`phase1-run-002`)

Re-run the prove swarm on the existing 20-goal backlog with the infrastructure faults from phase1-run-001 fixed first: add lake exe cache get to the prove worktree setup (the one-line fix phase1-run-001 flagged but the observer was not permitted to make), and run with clone + workdir + CLAUDE_CODE_TMPDIR on a roomy filesystem so no agent hard-blocks on a full /tmp. Close most of the backlog. Record phase1-run-002 with merge rate, collision rate, coordination-error count, and the duplicate-PR/fan-out behaviour, so the Phase-2 baseline is the loop’s real capability rather than disk-and-cache friction.

Lands: the cache-warm prove-worktree change to swarm/agent.sh (SPEC-007-A step 6 amendment); docs/metrics/phase1-run-002.md + .json.
Exit: most of the 20-goal backlog proved and merged; a trustworthy merge rate recorded with the cache fix in place; the redundant-PR/fan-out behaviour characterised (input to the Stage C fan-out caps).

Stage B — Affinity and gap-based selection (ADR-010 / SPEC-010-A)

Wire the selection mechanism the protocol already specifies but the script ignores. Implement affinity bookkeeping on the index (+1 on merge, −10 on fail, τ_v = −5 viability skip + re-queue for re-decomposition) and replace lexicographic selection with argmax(aff(g), −gap(g, library)), gap ≜ |deps(g) \ proved|. This is pure coordination/queue logic — it never touches Gate A, never touches soundness — so it is built and tested before decomposition gives it many sub-goals of differing value to choose between.

Lands: ADR-010 (selection and affinity decision); SPEC-010-A (selection algorithm, affinity update rules, index-entry aff lifecycle); swarm/agent.sh selection step rewrite + index affinity-update on merge/fail; --self-test cases for ranking, the viability skip, gap computation, and re-queue.
Exit: selection demonstrably prefers smaller-gap, higher-affinity goals over lexicographic order on a fixture tree; affinity updates land on the index on merge and fail; below-τ_v patterns are skipped and re-queued, all under hermetic --self-test.

Stage C — Decomposition (ADR-009 / SPEC-009-A)

Turn the prove-failure path from “release + flag” into “decompose.” On a prove-failure within budget, the agent produces a decompositions/<parent>.<agent>.aisp record (SPEC-003-C schema): sub-lemma statements, fresh goals/<sub>.aisp records with src pointing at the decomposition, and Post(A) ⊆ Pre(B) dependency edges. The parent is marked blocked until its subs prove; unblock logic re-opens it when its dependency set is covered. Guards are non-negotiable: the SPEC-003-C cap of 8 subs per decomposition, plus tight depth and budget caps to prevent runaway fan-out (a sibling flood worsens the duplicate-PR throughput risk phase1-run-001 already observed). The dependency edges must form a DAG — Post(A) ⊆ Pre(B) edges that cycle would deadlock the queue. SPEC-003-C defines the edge type but specifies no acyclicity check; Gate B must reject cycles, and this plan treats closing that gap as in-scope for Stage C, not a follow-up.

The load-bearing soundness rule for this stage, stated once and enforced: sub-lemmas alone prove nothing about the target. A Decomp record plus merged sub-index entries must not flip the parent to proved. The parent counts as proved only when an agent writes a library module that imports the subs, proves the parent’s exact signature, and that module passes Gate A — the same trust model as every other proof. The parent’s kernel recomposition is the proof; the decomposition is only queue structure.

Lands: ADR-009 (decomposition decision, the no-auto-prove-on-subs rule, fan-out/depth caps); SPEC-009-A (decomposition-record production on prove-failure, sub-goal generation and re-queue, parent blocked/unblock semantics, depth/breadth guards, the Gate B acyclicity check on Post(A) ⊆ Pre(B) edges); swarm/agent.sh prove-failure path rewrite (replacing SPEC-007-A step 11); Gate B DAG validation; --self-test for record production, edge validity, the blocked/unblock transitions, the cap/depth guards, and rejection of a planted cyclic decomposition.
Exit: a prove-failure produces a valid, acyclic, capped decomposition with re-queued subs; the parent is blocked and only unblocks when its subs are covered; a parent never flips to proved without a kernel-recomposing library module passing Gate A; Gate B rejects a planted cyclic decomposition; all under --self-test plus a live decomposition smoke.

Stage D — Statement-binding (ADR-011 / SPEC-011-A)

Close the gap PR #64 exposed and the Round-001 red team recorded as deferred. Add a defeq meta-check to Gate A that lowers a goal’s canonical statement to Lean and checks the merged theorem’s statement is definitionally equal to it — binding soundness (which Gate A already enforces) to meaning (which it does not). Because decomposition (Stage C) creates many new sub-statements, and every generated sub-statement is a new place to be vacuous or over-general, the binding check must extend to generated subs, not only top-level goals — this is why Stage D lands before any Phase-2 target run, not after. Build AuditFixtures for vacuous and weakened statements (the autoImplicit-class vacuity vector from #64 among them) and re-run a red-team round to prove the gate now blocks what it previously let through.

Lands: ADR-011 (statement-binding decision, scope over goals and generated subs); SPEC-011-A (the defeq meta-check in Gate A, canonical-statement lowering, the sub-statement extension, fixture catalogue); tools/gate_a binding check; AuditFixtures for vacuous/weakened/over-general statements; a fresh red-team round (gate-a-redteam-002) including the #64 autoImplicit payload.
Exit: Gate A rejects a vacuously-true or weakened restatement under a plausible name (the #64 payload now fails on the binding check, not only on the option scan); the binding check fires on generated subs as well as top-level goals; the new red-team round records every binding vector blocked.

Stage E — Choose target and first Phase-2 run

With Stages A–D landed, make the human curation call (docs/phase2-targets.md), re-grep the chosen target’s absence against mathlib HEAD at commit time, and run the first Phase-2 orchestration: point N agents at the target with decomposition on (Stage C), affinity and gap-selection on (Stage B), and the binding check live (Stage D). Recommended first target is the Nicomachus identity Σk³ = (Σk)² — it exercises decomposition and gap-selection end-to-end on an unambiguous, definitionally-clean statement, cheaply. Observe and record: did the swarm decompose, prove the subs, and recompose to the parent’s exact signature through Gate A — and did it reach the target rather than merely fragmenting it.

Lands: the finalised docs/phase2-targets.md target selection with a HEAD re-grep note; the Phase-2 run record (phase2-run-001).
Exit: the success metric in §4 — a first lemma proved that was not already in mathlib — or a recorded, diagnosed failure to reach it.

4. Exit and success metric

The single number that matters for Phase 2 is the first lemma proved that was not already in mathlib, recomposed to the target’s exact signature and passing Gate A with the binding check live. Not sub-lemma throughput. Not decomposition-record count. Not PRs merged.

State the trap explicitly, because the architecture makes it easy to fall into: a swarm can decompose busily and prove a hundred trivial fragments while never reaching the target. High sub-lemma throughput with zero target progress is failure, not partial success. Affinity and gap-selection (Stage B) are meant to pull the swarm toward the target rather than into a comfortable thicket of easy subs, but the metric is the backstop — Phase 2 is measured by target reach, and the run record must report distance-to-target, not just activity. The Nicomachus first target is chosen partly so this metric has an unambiguous yes/no answer on a cheap run before any real-contribution target is attempted.

5. Risks

Runaway decomposition fan-out. Decomposition that re-decomposes its own subs can fan out exponentially; a sibling flood also worsens the duplicate-PR throughput problem phase1-run-001 already observed (agents re-selecting the same goal under pending auto-merge). Mitigate: the SPEC-003-C cap of 8 subs per decomposition, plus the tight depth and total-budget caps that land with Stage C, plus the affinity viability skip (τ_v) that stops re-decomposing unproductive patterns.
Affinity local optima. Affinity favours proven approaches (+1/−10 is deliberately asymmetric), which can trap the swarm in a locally-good decomposition that never reaches the target. Mitigate: the −gap(g, library) term keeps pulling toward the target; re-queue-for-re-decomposition on sub-threshold patterns; and the §4 metric makes “busy but not progressing” visible rather than rewarded.
Build and throughput cost at Phase-2 scale; Agent SDK credit economics. Phase 2 runs more agents over more cycles, each a claude call plus a lake build (the phase1-run-001 cache fix is a prerequisite precisely to keep per-cycle cost bounded). At scale this is a real credit and wall-clock budget question, not a rounding error. Mitigate: the warm-cache fix from Stage A; the per-cycle wall/turn/attempt budgets the protocol already enforces (budget ≜ ⟨turns ≤ 40, wall ≤ 1800s, attempts ≤ 2⟩); fan-out caps that bound the total work a single target can spawn; cheap first target (Nicomachus) before any expensive one.
Target chosen too hard, or already in mathlib. Too hard → no progress and burnt budget; already in mathlib → no value even on success. Mitigate: the staged target ladder in docs/phase2-targets.md (Nicomachus → combinatorial identity → PutnamBench), the explicit human curation call, and the mandatory re-grep against mathlib HEAD at commit time (several “obvious” candidates were dropped exactly for being already present; Pick’s theorem was downgraded after a 2026 Lean formalization appeared).
Statement-binding defeq edge cases. The Stage D defeq meta-check is not a fully solved problem — two faithful statements can differ in ways defeq does or does not see, and decomposition multiplies the surface (every generated sub is a new binding site). Mitigate: the AuditFixtures catalogue of vacuous/weakened/over-general statements; flag-don’t-block where the check is uncertain (mirroring the dual-translation fidelity gate’s discipline); and treat the sub-statement extension as a prerequisite of Stage D, not a follow-up, so no Phase-2 target run happens against an unbound sub.

References

Reference ID	Title	Type	Location
REF-1	Distributed Autonomous Research Swarm: Architecture and Plan	Design document	distributed-research-swarm-plan.md
REF-2	Swarm contract (Affinity, Records, Loop)	Protocol	../../swarm/protocol.aisp
REF-3	SPEC-003-C — Translation and Decomposition Records	Specification	../adrs/specs/SPEC-003-C-Translation-and-Decomposition-Records.md
REF-4	SPEC-007-A — Agent Loop Script	Specification	../adrs/specs/SPEC-007-A-Agent-Loop-Script.md
REF-5	ADR-006 — Gate A Soundness Enforcement	Decision	../adrs/ADR-006-Gate-A-Soundness-Enforcement.md
REF-6	Gate A Red Team — Round 001	Metrics	../metrics/gate-a-redteam-001.md
REF-7	Phase-1 swarm trial — run 001	Metrics	../metrics/phase1-run-001.md
REF-8	Phase-2 target shortlist	Curation	../phase2-targets.md
REF-9	ADR-009 — Decomposition (to be authored)	Decision	../adrs/ADR-009-Decomposition.md
REF-10	ADR-010 — Affinity and Gap-Based Selection (to be authored)	Decision	../adrs/ADR-010-Affinity-Gap-Selection.md
REF-11	ADR-011 — Statement-Binding Check (to be authored)	Decision	../adrs/ADR-011-Statement-Binding.md

Status History

Status	Approver	Date
Proposed	unsorry maintainers	2026-06-10

This site is open source. Improve this page.