| Field | Value |
|---|---|
| Decision ID | ADR-017 |
| Initiative | unsorry Phase 3 — operational resilience |
| Proposed By | unsorry maintainers |
| Date | 2026-06-11 |
| Status | Accepted |
In the context of an agent loop that is deliberately fail-fast — it exits 3 on infrastructure failure (ADR-016) and exits 0 on an empty pool even while the tree is still in flight (open PRs awaiting gates, blocked parents awaiting the unblock sweep) — leaving “keep the tree moving” to a human-attended orchestrator,
facing a day of production incidents in which (a) every outage required a human to notice, diagnose, and relaunch; (b) an agent re-claimed a goal whose prove PR was already open (the claim is released at PR-open, so the claims branch cannot see in-flight work), producing the #168 demote that silently conflicted the real proof PR #166 — and a conflicted PR gets no CI runs, so its armed auto-merge waits forever without a single red signal; and (c) a claim-race window let two agents prove the same leaf and open duplicate PRs (#184/#185),
we decided for a supervisor wrapper (swarm/supervise.sh) owning the retry loop — exponential backoff on infra exits (300s ×2 capped 3600s), short retry on cycle failures, in-flight waits on empty-pool-with-open-scope, terminating only when the scoped goal tree is fully proved — plus scope-limited PR hygiene on every wait (close duplicate prove PRs keeping the oldest; loudly flag CONFLICTING PRs), plus an agent-side claim guard that skips any candidate with an open prove PR,
and neglected auto-resolving conflicted PRs (goal-record conflicts can need semantic judgement — the supervisor shouts, the maintainer resolves), fixing the claim-race window itself (the interleave is not yet reproducible; the PR-level dedupe bounds its cost to one redundant run), and moving supervision into CI cron (the supervisor needs the local clones, caches, and claude CLI),
to achieve a swarm where one command drives a goal tree to closure across quota outages, merge latency, and queue races — outages cost wall-clock, not maintainer attention,
accepting that the supervisor adds a second script to keep shellcheck-clean and self-tested, PR hygiene depends on best-effort gh calls (it degrades to logging when the API is down), and the claim guard spends one API call per candidate per pass.
Builds directly on ADR-016: the agent now stops cleanly on infrastructure failure; this ADR makes something restart it intelligently. The supervisor’s whole policy is a pure function (next_action) and the tree-closure test (scope_closed) reads only goal records — both hermetically tested. The duplicate-PR closer encodes exactly the #184/#185 resolution a maintainer performed by hand.
Pros: one command to closure; policy pure and testable; uses existing local environment; hygiene encodes proven manual fixes. Cons: still a local process (dies with the machine — the orchestrator restarts it, but that is one level up, not zero).
A scheduled workflow that relaunches agents. Rejected: CI runners have neither the mathlib caches, the clones, nor the claude CLI session; granting them all three rebuilds the local environment in a worse place.
Fold backoff into agent.sh. Rejected: fail-fast agents are the ADR-016 design — a sleeping agent holds claims and is indistinguishable from a hang; separating loop policy (supervisor) from cycle mechanics (agent) keeps both testable.
| Relationship | ADR ID | Title | Notes | |————–|——–|——-|——-| | Depends On | ADR-016 | Infrastructure-Failure Guard | Exit 3 is the supervisor’s backoff signal | | Amends | ADR-007 | Agent Identity and Budgets | Selection skips goals with open prove PRs | | Relates To | ADR-004 | Claims Branch | Claim release at PR-open is why the guard is needed |
| Reference ID | Title | Type | Location | |————–|——-|——|———-| | REF-1 | SPEC-017-A — Supervisor and in-flight guard | Specification | specs/SPEC-017-A-Swarm-Supervisor.md |
| Status | Approver | Date | |——–|———-|——| | Proposed | unsorry maintainers | 2026-06-11 | | Accepted | unsorry maintainers | 2026-06-11 |