unsorry

Swarm Metrics

Index of metric runs and the schema each machine record (phase0-run-NNN.json) follows.

Metric runs are recorded by a trial observer after each swarm trial. Each run lands as a docs PR through Gate B like everything else — the observer opens a PR with the machine record (.json), the human narrative (.md), and any index updates; required checks must be green before auto-merge, and no metric run is ever pushed directly to main.

Runs

Run	Date	Trial	Records
`phase0-run-001`	2026-06-10	Phase-0 swarm trial W2 (3 supervised agents + 1 sweep agent, 10 backlog goals)	phase0-run-001.json · phase0-run-001.md
`gate-a-redteam-001`	2026-06-10	Gate A red team W3 (9 adversarial bypass vectors, real PRs)	gate-a-redteam-001.md
`phase1-run-001`	2026-06-10	Phase-1 swarm W4 (first prove cycle: 3 prover agents on sonnet; merge_rate 0.6, 3 proofs merged to the verified library, 0 gate-a failures on the merged path)	phase1-run-001.json · phase1-run-001.md
`phase1-run-002`	2026-06-10	Phase-1 swarm rerun (post-cache-fix prove baseline: 3 prover agents, 12 cycles each; merge_rate 0.889 (up from 0.6), 0 prove_failures (down from 6 build timeouts), 0 gate-a failures on the merged path; 13 new goals proved, library 3→16 modules; dominant cost is merge-lag duplicate fan-out (~64% of cycles))	phase1-run-002.json · phase1-run-002.md
`checklist-evidence`	2026-06-10	Contributor-readiness checklist W5 (items (a)–(f) adversarially verified; body of the v1.0.0 release)	checklist-evidence.md
`gate-a-redteam-002`	2026-06-10	Gate A statement-binding red team (Stage D, ADR-011): 3 adversarial bypass vectors against the binding gate + 1 honest control, real PRs #124–#127; 3/3 attacks blocked at the regenerated binding obligation, 0 survivors, control passes gate-a — the round-001 statement-fidelity gap is now gated	gate-a-redteam-002.md
`phase2-run-001`	2026-06-10	First Phase-2 target run (Stage E): 1 prover agent (e-alpha) against the seeded, mathlib-absent target `nicomachus-sum-cubes` (Nicomachus Σk³=(Σk)²). EXIT METRIC MET — first mathlib-absent lemma proved; target reached via the DIRECT path in 1 cycle (no decomposition), PR #133 merged to main (`79c6f2c`), binding obligation held (gate-a statement-binding step green). Decomposition (Stage C) available but not exercised; subs 0/0; 0 open PRs remaining	phase2-run-001.json · phase2-run-001.md
`phase2-run-002`	2026-06-10	Phase-2 hard-target run (Stage E) built to TEST decompose → recompose: 4 prover agents (h1–h4) on `not-prime-pow-four-add-four`, `platonic-schlafli-core`, `alternating-sum-naturals`, `sum-range-pow-four-closed-form`. HEADLINE NOT DEMONSTRATED — decompose → recompose did NOT fire on any target (0 decomposed, 0 subs, 0 `decomposed` events, 0 decomposition PRs). 1/4 proved (`sum-range-pow-four-closed-form`, PR #140, DIRECT leaf, gate-a green, binding held); 3/4 not-reached (h1/h2/h3 stalled mid-direct-attempt before the budget-exhaustion → decompose branch could trigger). 0 open PRs. Stall is BEFORE decomposition, not a sub-failure or a broken recompose	phase2-run-002.json · phase2-run-002.md

Machine-record JSON schema

Top-level fields of phase0-run-NNN.json:

Field	Type	Definition
`run_id`	string	Unique id of the metric run, `phase0-run-NNN`.
`date`	string (ISO-8601 date)	Date the trial ran (UTC).
`agents`	array of `{id, model, role}`	Agent ids and the model each ran on (`role` is `supervisor-driven` or `sweep`).
`metrics`	object	The computed metrics — see field table below.
`planted_pairs`	array of `{pair, shas, sha_equal, note}`	Result per planted paraphrase pair (identical meaning, different English).
`reap`	object `{run_url, removal_commit, reaped, kept, unparsable}`	Reaper evidence: Actions run URL and the `reap:` commit on the `claims` branch.
`pr_numbers`	array of int	All trial PR numbers (titles `tr(...)` / `converge(...)`).
`kill_criterion`	object `{metric, threshold, value, tripped}`	The fidelity kill criterion (SPEC-003-D) and whether this run tripped it.
`notes`	array of string	Caveats, anomalies, and observability gaps — anything partially unobservable is declared here, never silently estimated.

`metrics` fields

Field	Type	Definition
`claim_attempts`	int	Count of `claimed` events plus `collision` events across all agents (incl. the sweep agent), from the agents’ `metrics.jsonl` telemetry.
`collisions`	int	Count of explicit `collision` events (an agent found the goal’s live-claim cap already filled and withdrew).
`collision_rate`	number	`collisions / claim_attempts` (0 if no attempts).
`translations_merged`	int	Merged trial PRs titled `tr(...)`.
`prs_total`	int	All trial PRs (`tr(...)` + `converge(...)`).
`gate_b_failures_on_first_attempt`	int	Trial PRs whose `gate-b` required check failed on its first attempt.
`validator_pass_rate`	number	(trial PRs whose `gate-b` check passed on first attempt) / (trial PRs total).
`goals_matched`	int	Goals whose final record is `status≜translated` (independent translations converged).
`goals_flagged`	int	Goals whose final record is `status≜flagged` (translations diverged after normalization).
`fp_rate`	number	`flagged / (matched + flagged)`. All Phase-0 goals are elementary known-true statements, so any flag is a false positive by construction.
`paraphrase_pairs_sha_equal`	int	Of the planted paraphrase pairs, how many ended with byte-equal final shas (secondary fidelity signal, not part of `fp_rate`).
`coordination_errors`	int	Malformed artifacts rejected by Gate B + double-claims beyond the GB014 cap + protocol violations visible in branch histories. Expected 0; each instance is described in `notes`.

Kill criterion

fp_rate >= 0.20 trips the fidelity kill criterion (SPEC-003-D: “flag FP ≥ 20 % on identical-meaning pairs”). If tripped, the response is ADR-008: drop to single translation plus Lean back-translation.

Phase-1 prove-cycle `metrics` fields

Prove-cycle runs (phase1-run-NNN.json) record proof production rather than translation, so they carry a different metrics block (Phase-0 translation fields do not apply):

Field	Type	Definition
`claim_attempts`	int	Count of `claimed` events (+ any `collision` events) across all provers’ `metrics.jsonl`. Where a prover’s jsonl is unobservable, its reported cycle count is folded in and the split is declared in `notes`.
`collisions`	int	Explicit `collision` telemetry events (cap-full claim withdrawal). Release-branch optimistic-concurrency races that the loser resolved by rebase-and-retry are not collisions.
`collision_rate`	number	`collisions / claim_attempts` (0 if no attempts).
`proofs_attempted`	int	Distinct `(goal, agent)` prove claims that ran a proof (re-claims of the same pair collapse to one).
`proofs_merged`	int	Merged prove PRs (titled `prove(<goal>): <name> by <agent>`), from `gh` ground truth.
`merge_rate`	number	`proofs_merged / proofs_attempted`.
`prove_failures`	int	`prove-failed` events — claude could not produce a locally-passing proof within the attempt budget. Not a soundness rejection unless `notes` says so.
`coordination_errors`	int	Gate-B rejections + double-claims beyond the live-claim cap + protocol violations in branch histories. Expected 0; each instance described in `notes`.
`gate_a_failures_on_merged_path`	int	Merged prove PRs whose `gate-a` required check was RED (proof passed the agent’s local `lake build`+`axiom_audit` but CI disagreed). 0 means local verify mirrored CI; any >0 is a false-confidence finding described in `notes`.

This site is open source. Improve this page.