unsorry

Contributor-Readiness Checklist Evidence (v1.0.0)

This document is the contributor-readiness evidence record for agenticsnz/unsorry and forms the body of the v1.0.0 release. Each of the six readiness items (a)–(f) was assessed by an independent adversarial verifier that cloned the repo fresh, re-ran the load-bearing checks rather than trusting the metrics prose, pulled live CI logs, and probed for evasion paths. A separate runner executed the README quickstart end-to-end from a clean clone.

Result: all six items are sufficient (high confidence). No verifier returned insufficient. The caveats each verifier raised are reproduced honestly below — especially the (b) shared-GitHub-account authorship caveat and the (a) statement-vs-canonical-binding gap deferred to Phase 1. Honesty is the product: nothing here is papered over to reach v1.0.0.

Verification environment: repo cloned at main HEAD 1771cd8. Date of assessment: 2026-06-10.


(a) Gate A has rejected real bad input on real PRs

Claim. Gate A (the soundness gate) has rejected real bad input on real PRs — not just in theory.

Verdict: sufficient · confidence: high.

Evidence.

Representative CI evidence: Actions run 27253447146 / job 80482647068.

Caveat (honest, disclosed by the project, does not undermine this item). Gate A has no statement-fidelity layer: no check binds a merged library theorem’s actual statement to its claimed canonical goal/sha. A vacuously-true or mis-stated theorem under a plausible name is sound but meaningless and would pass build + audit + leanchecker. #65 closed only the specific autoImplicit instance; the general class (statement-vs-canonical binding) remains open and is documented verbatim in the metrics doc, which defers the proper fix (statement-vs-canonical-sha binding / autoformalisation) to Phase 1. Because the disclosure is honest and this item is narrowly “rejected real bad input on real PRs” — soundness-breaking input, which it provably does on real closed PRs with red CI — the verdict is sufficient. Minor: the attack fixtures (Attack0-6.lean) live on the closed branches and were not merged to main (correct), so the durable evidence is the closed-PR CI history + the metrics doc, both verified live. The verifier pulled per-step failing logs for 4 of 9 vectors (#56/#61/#62/#64) in full detail and confirmed gate-a=FAILURE for the other five.


(b) A non-author agent has merged a proof end-to-end

Claim. A non-author (swarm) agent has merged a proof end-to-end.

Verdict: sufficient · confidence: high.

Evidence.

Caveat — the GitHub-author caveat (disclosed, prominent, matches reality). “Non-author agent” holds in the swarm AGENT_ID sense, not cryptographically. By ADR-007 design the agent runs under the maintainer’s shared GitHub account and git-author identity: gh pr view #74 shows author cgbarlow, is_bot:false, and git author Chris Barlow. The merged proofs are trivial one-liners (Int.neg_neg n, Int.add_right_neg n, ⟨h.2, h.1⟩), so the artifacts alone cannot distinguish “autonomous agent wrote it” from “maintainer hand-wrote it under an agent label.” The claim of non-human authorship rests on trusting the orchestration trail (claim/release commit markers + PR workflow), not on cryptographic proof. This is exactly the limitation the §HONEST CAVEAT (ADR-007) section of phase1-run-001.md discloses, in a prominently labeled section that matches reality. Second disclosed gap: the metrics.jsonl event stream is not committed to the repo (lives uncommitted on the workspace; bravo’s jsonl was unreadable due to ENOSPC), so the “jsonl events carry agent:prover-alpha” leg cannot be reproduced from a clone — only the commit-subject and PR-title legs are independently verifiable (and they suffice). Verdict sufficient because the claim is scoped to “a distinctly-identified swarm agent produced a novel proof that passed the real soundness gate and merged to main, with the shared-account caveat disclosed” — every part independently confirmed.


(c) Claim-collision handling works under concurrency (first-push-wins + TTL reaping)

Claim. Claim collisions and TTL reaping were observed under concurrency (first-push-wins arbitration + TTL reaping of dead claims).

Verdict: sufficient · confidence: high.

Evidence.

Caveats (three findings, none fatal).

  1. Metrics prose inaccuracy. The file claims the 00:44:22Z dispatch run “correctly kept the claim,” implying deliberate live-claim sparing — but the dead claim file was not committed until 03:03:30Z (commit a5dd46b, ~2h19m later). At 00:44 the reaper saw an empty claims dir (kept:0, reaped:[]); it kept nothing because there was nothing there. The narrative overstates this specific run. Live-claim safety is still genuinely shown structurally (only reap: commit; all else release:) and by the before-expiry re-run.
  2. Live-trial first-push-wins is asserted, not artifact-reproducible. Rejected pushes leave no commit, and the metrics honestly states the rebase-retry race counts (alpha 2, bravo 4, etc.) are “visible only in supervisor stderr,” which is not in the repo. The mechanism is reproducible via --self-test; the live-trial arbitration tallies cannot be independently verified from committed artifacts.
  3. The metrics’ own open caveat (“observing a cron-triggered reap remains open”) is now actually closed by the real event:schedule run 27254569302 — evidence is stronger than the file claims.

Net: the claim survives. Tightening to consider for the public invitation: correct the misleading 00:44-run gloss and add an on-repo collision-event log so live first-push-wins is not stderr-only.


(d) Statement-diff false-positive rate < 20%

Claim. The statement-diff (fidelity) false-positive rate is < 20%.

Verdict: sufficient · confidence: high — but rescued at the boundary; read the caveat.

Evidence.

Caveat — the headline “< 20%” is genuinely fragile; a skeptic should note it crossed the line. The frozen at-observation rate of 0.125 is below threshold only because 2 of 10 goals were still undecided (1 flag / 8 decided). Once all 10 goals were decided pre-fix, the strict rate was exactly 0.20 = 2/10, which by the repo’s own documented kill criterion (fp_rate >= 0.20) trips the criterion — the doc admits this verbatim (“0.20, exactly at the kill-criterion boundary”). So “< 20%” does not hold as-originally-measured; it survives only because same-day PR #50’s normalizer fix re-diffs both flags to MATCH for a true 0/10. That fix was verified legitimate (re-run; shas match; no over-stripping regressions; meaning-preserving guards present), so the final 0/10 is honest and the claim is ultimately sufficient — but it is a “rescued at the boundary by a post-hoc fix” result, not a comfortably-margined measurement. Two narrower caveats: (a) both flags shared a single mechanical root cause (binder-body paren wrap), so 0/10 reflects one fix class, not breadth across diverse divergence types; (b) the trial is on a 10-goal known-true elementary backlog where every flag is an FP by construction — the metric says nothing about the false-negative rate (truly-divergent translations wrongly matched), which is the more dangerous failure mode and is untested here.


(e) swarm/protocol.aisp exists and validates

Claim. swarm/protocol.aisp exists and validates.

Verdict: sufficient · confidence: high.

Evidence.

Caveats (bound the strength of “validates”; do not change the verdict).

  1. The upstream aisp-validator@0.3.0 is lenient. A negative control showed a corrupted file (appended garbage with an unbalanced bracket) still passes at Platinum with exit 0 (ρ dipped to 1.711); only an empty file is rejected (✗ INVALID, ⊘ Reject, exit 1). So “validates at Platinum” certifies generic AISP syntax/density/tier, not strict structural correctness or any unsorry domain schema. ADR-003 explicitly characterizes the package this way and deliberately makes it advisory-only, with the load-bearing checks in tools/gate_b (which also passes).
  2. Minor doc inconsistency. SPEC-003-D §22 states the bar as “Gold (◊⁺) or better” while swarm/README.md and CHANGELOG.md headline “◊⁺⁺ Platinum.” Not contradictory (Platinum exceeds Gold) and §23 reconciles them, but the advertised tier differs across docs.

The file exists, parses, exits 0, and clears the required ≥◊⁺ bar with margin — verified by independent re-run and by the green CI job.


(f) The README quickstart actually runs

Claim. The README “Running an agent” quickstart actually runs from a fresh clone.

Verdict: sufficient · confidence: high. Clean-clone runner: ran_clean: true.

Static verification. Every command in README L60–76 was checked against the real repo: clone + cd unsorry (dir name matches); lake exe cache get (canonical mathlib incantation, mathlib4 present in lake-manifest.json); lake build (lakefile.toml defaultTargets=[UnsorryLibrary, UnsorryGoals]); python3 -m tools.gate_b validate . (actually run from a clean clone → exit 0, 30 records, ok:true; negative control on a corrupted tree → ok:false count:9, proving the pass is real); ./swarm/agent.sh --prove --once (executable; flags parsed at lines 1623–1632; bash -n passes; --self-test run → 19/19 PASS, exit 0).

Clean-clone runner — step table. The dedicated runner cloned fresh and executed the quickstart end-to-end:

# Command Exit Note
1 git clone https://github.com/agenticsnz/unsorry && cd unsorry 0 Clean clone. Path adjusted to /workspaces because /tmp had only ~20M free (environmental, not a README defect); command otherwise verbatim.
2 export PATH="$HOME/.elan/bin:$PATH" (env setup) 0 lake/lean not on PATH in a bare shell; elan toolchain present. README prerequisites cover the Lean toolchain via elan, so expected env setup. lake --version → Lake 5.0.0 / Lean 4.30.0, matching lean-toolchain pin leanprover/lean4:v4.30.0.
3 lake exe cache get 0 Fetched/decompressed prebuilt mathlib (8283 cached files, “Completed successfully in 14435 ms”). Did exactly what README claims — fetch prebuilt mathlib, never builds from source. Warm cache here; README warns it can take minutes cold.
4 lake build 0 “Build completed successfully (586 jobs).” Ends clean as README claims. The declaration uses sorry warnings are on goals/*.lean placeholder theorems (the open backlog the swarm is meant to prove) — expected, not errors; the Unsorry.* library builds warning-free.
5 python3 -m tools.gate_b validate . 0 Pass (validator silent on success). Ran under Python 3.11.2; README prereqs name 3.12, but gate_b ran fine on 3.11 (environmental, no defect).
6 ./swarm/agent.sh --self-test 0 “self-test: all 19 tests passed.” README L74 says “Run ./swarm/agent.sh --self-test to check your setup” — does exactly that.
7 ./swarm/agent.sh --prove --dry-run 0 “dry-run: would claim goal int-sub-eq-add-neg,” 17 candidates listed, claimed nothing — exactly as README describes for --dry-run. Substituted for the literal --prove --once per task instruction to avoid opening a real PR / a long mathlib build; the full prove cycle is exercised by Phase-1 metrics.

Deviations (no undocumented fixes needed). Every documented command ran with exit 0 and did what the README claims. The three adjustments (clone path to /workspaces for disk space; export PATH for the elan toolchain covered by README prereqs; substituting --prove --dry-run for --prove --once per explicit task instruction) are all environmental or task-sanctioned, none are README defects.

Caveats. Two Lean-dependent commands (lake exe cache get, lake build) could not be executed in the static-verifier env (no Lean toolchain) — but the dedicated runner executed both with exit 0. Residual unfalsifiable-here points: (a) lake exe cache get depends on the upstream mathlib cache being published for the pinned v4.30.0 rev (c5ea0035) — ADR-002 says release-tag caches are guaranteed, and the runner’s warm fetch confirms it on at least one host; (b) the full --prove --once path additionally needs authenticated claude + gh and a claimable goal, exercised by Phase-1 metrics rather than in the quickstart run. Minor non-blocking nit: README prereqs say Python 3.12 but gate_b runs cleanly on 3.11 — the stated prereq is stricter than reality, not a failure.


Overall readiness statement

agenticsnz/unsorry is ready for public contributor invitation at v1.0.0. All six contributor-readiness items (a)–(f) were independently verified sufficient at high confidence by adversarial verifiers who cloned the repo fresh, re-ran the load-bearing checks, pulled live CI logs, and probed for evasion paths; the README quickstart runs clean from a fresh clone. The soundness gate (Gate A) has demonstrably rejected real soundness-breaking input on real, now-closed PRs with red CI; a distinctly-identified swarm agent has merged a novel, sorry/axiom-free proof end-to-end through that gate; claim-collision arbitration and TTL reaping are observed and reproducible; the statement-diff false-positive rate finished at 0/10 after a verified normalizer fix; the coordination protocol validates at the required tier; and a newcomer following the README can build the library and run the agent.

This readiness is asserted with its limitations stated, not hidden — the caveats below are carried forward into the public invitation. Honesty is the product.

Known limitations carried into the public invitation