Machine record: phase0-run-001.json · Schema: METRICS.md
Three supervised agents — trial-alpha (sonnet), trial-bravo (sonnet), trial-charlie
(haiku) — each ran 12 cycles of the Phase-0 loop (swarm/agent.sh: claim → translate → diff →
PR → release) against the 10-goal known-true backlog, deliberately racing each other. A fourth
agent, trial-omega (sonnet), then ran a 3-cycle convergence sweep to mop up overlapping
translation PRs, exercising the sweep fix merged in
PR #8 (SPEC-007-A) immediately before the trial.
A dead agent’s claim (trial-dead on nat-le-trans) was planted in advance to give the reaper
something real to reap. All three supervisors stopped at the 12-cycle cap with work still
visible, not at quiescence.
| Metric | Value | Interpretation |
|---|---|---|
| claim_attempts | 39 | 38 claimed + 1 collision event across all four agents. |
| collisions | 1 | trial-bravo found nat-le-trans’s live-claim cap full and withdrew per contract — the designed behaviour. |
| collision_rate | 0.0256 | Explicit cap-full withdrawals were rare; most contention resolved at the git layer (see caveat below). |
| translations_merged | 24 | Every tr(...) PR that was opened, merged — no translation was lost. |
| prs_total | 38 | PRs #9–#46, 24 tr(...) + 14 converge(...); all 38 merged autonomously. |
| gate_b_failures_on_first_attempt | 0 | No trial artifact was ever rejected by the validator. |
| validator_pass_rate | 1.00 | 38/38 trial PRs green on gate-b first attempt (88/88 gate-b runs succeeded; no re-run attempts exist). |
| goals_matched | 7 | Independent translations converged byte-equal after normalization. |
| goals_flagged | 1 | nat-le-refl — the trial’s single fidelity flag. |
| fp_rate | 0.125 | 1 flag / 8 decided goals; all goals are known-true, so the flag is a false positive by construction. |
| paraphrase_pairs_sha_equal | 1 of 3 | One pair byte-equal, one pair is the flag itself, one pair indeterminate (see Fidelity). |
| coordination_errors | 0 | No Gate B rejections, no claims beyond the GB014 cap, no protocol violations in any branch history. |
Not tripped. fp_rate = 0.125 < 0.20 (SPEC-003-D: “flag FP ≥ 20 % on identical-meaning pairs”). Statement-fidelity diffing survives Phase 0; ADR-008 (single translation + Lean back-translation fallback) stays on the shelf.
collision event on nat-le-trans at 03:07:35Z, after
trial-charlie (03:07:27Z, commit 1938793) and trial-alpha (03:07:33Z, commit 07cd367)
filled the 2-live-claim cap. Bravo withdrew and claimed nat-leq-self instead — exactly the
contract’s collision ⇒ next-goal recovery.claim:, 39 release:, 1 reap:) never
shows more than 2 concurrent live (unexpired) claims per goal, and never two live claims by
one agent. First-push-wins held: claim-push losers either rebased and won the next slot or
withdrew.metrics.jsonl has no event for claim-push races resolved by
rebase-and-retry (alpha 2, bravo 4, charlie several, including one 3-attempt claim) — those
are visible only in supervisor stderr. collision_rate therefore measures cap-full
withdrawals, not raw git-level contention, which was substantially higher.claims/nat-le-trans.trial-dead.aisp, expired 4502 s), kept 0,
unparsable [].claims: 09ba7efe4f377603d81df285e70b8092c99e5852
(unsorry-reaper[bot], “reap: 1 expired claim(s)”, deletes only the planted file).reap: commit in the history; every other claim removal
is a release: by the owning agent. An earlier dispatch run (27245383354, 00:44:22Z,
pre-expiry) correctly kept the claim. No cron-triggered runs appeared during the window —
both reaper runs were workflow_dispatch; cron-schedule reliability on this repo is
unverified by this trial.| Pair | Result |
|---|---|
| nat-mul-comm ↔ nat-product-order | Byte-equal final shas (ea25d3b6…): both English phrasings normalized to ∀x,y∈ℕ: x·y ≡ y·x. Correct fidelity — and an accidental proof that the backlog holds a semantic duplicate (Phase-1 dedup item). |
| nat-le-refl ↔ nat-leq-self | nat-leq-self matched (bdfe3dd8…); nat-le-refl flagged (sha ∅) — the run’s one false positive (PR #19 still merged; flag-don’t-block worked as designed). |
| nat-add-zero ↔ nat-zero-identity-add | Indeterminate. nat-add-zero matched (84f38b99…) but nat-zero-identity-add ended the trial status≜open with only trial-omega’s translation (PR #45) — omega cannot converge against itself, so there is no final sha to compare. Scored not-equal under the strict metric, but the comparison is unobservable for this run, not failed. |
git push is rejected non-fast-forward and
the cycle exits 1. Every retry recovered, nothing was lost or corrupted, but each occurrence
burns a cycle and double-counts claimed/translated/released telemetry without a pr-opened.
Worth a fix before Phase 1.translations:"3"). The cap bounds concurrency, not total translation
count — acceptable for Phase 0, worth tightening if duplicate effort matters at Phase-1 scale.converged line; only one convergence happened.09ba7efe,
job-summary report), and no live claim was ever touched. Caveat carried forward: the trial
only observed workflow_dispatch runs; observing a cron-triggered reap remains open.unparsable=[]).metrics.jsonl; failure counts come from supervisor
exit-code reports.The observation above froze at the supervisors’ cycle caps. The run was completed immediately afterwards; this addendum records the final state. The original numbers are left untouched as the at-observation record.
A fourth identity, trial-delta (sonnet), ran the two leftover second translations:
PR #48 (nat-zero-identity-add, converged in-PR)
and PR #49 (nat-zero-lt-succ → translated).
One delta cycle reproduced anomaly 1 (branch-reuse push rejection, retry recovered) —
consistent with the trial’s failure pattern.
Final decided state before adjudication: 8 translated + 2 flagged (nat-le-refl,
nat-zero-identity-add) — strict FP rate 2/10 = 0.20, exactly at the kill-criterion
boundary. Review of all flagged translation pairs found them identical up to α-renaming plus
one mechanical root cause: a redundant parenthesis wrap of the binder body
(∀x∈ℕ:x≤x vs ∀n∈ℕ:(n≤n)).
Resolution (PR #50): normalization step 5 —
redundant-paren elimination restricted to provably meaning-preserving groups (application
parens P(x) never collapsed; 17 new tests including the exact trial pairs). Sha-drift audit:
all 8 previously matched goals unchanged. Both flagged pairs re-diff to MATCH; goals
resolved translated with agreed shas.
| Goals decided | 10/10 translated, 0 flagged, 0 open |
| Post-fix fidelity FP rate | 0/10 |
| Paraphrase pairs byte-equal | 3/3 (ea25d3b6…, bdfe3dd8…, 84f38b99…) — every pair of independently-worded English statements converged to identical content addresses |
| Kill criterion | Not tripped — at-boundary reading rendered moot by the root-cause fix; ADR-008 fallback stays shelved |
| Anomalies 1/2/4 | Fix in flight: re-entrant cycle-state handling for agent.sh (branch-name uniqueness + hard-reset claims worktree per cycle) |