unsorry

ADR-016: Infrastructure-Failure Guard

Field Value
Decision ID ADR-016
Initiative unsorry Phase 3 — operational correctness
Proposed By unsorry maintainers
Date 2026-06-11
Status Accepted

WH(Y) Decision Statement

In the context of an agent loop whose failure path treats every failed claude call as evidence about the goal — exhausted budget → decompose, unusable split → demote −10 (below τ_v the goal leaves the pool), facing two production incidents on 2026-06-11 in which a CLI quota outage made every call die in ~1 minute, and the loop — unable to tell “the model tried and failed” from “the model never ran” — demoted every open leaf of the active tree below the viability threshold, emptied its own pool, exited “no claimable goal”, and twice required maintainer affinity-restore PRs (#165, and the #168–#181 cleanup), we decided for classifying a failed proof-surface call as an infrastructure failure when it died in under UNSORRY_FASTFAIL seconds (default 240 — a real attempt must at least read the goal and run a build) and a follow-up health probe on the cheap model also fails; on that classification the cycle aborts with no prove-failed event, no decomposition, no demote, the claim is released, and the agent exits with a distinct code 3 so the orchestrator knows to reschedule rather than diagnose goals, and neglected retry-with-backoff inside the loop (an outage measured in hours would idle a worktree and mislead the orchestrator; a clean exit hands timing to whoever can see the clock), pre-claim health probes on every cycle (cost on the healthy path for a rare event), and distinguishing quota from auth/network failures (identical handling either way; the probe answers “can the CLI run”, which is all the queue needs), to achieve a queue whose affinity and decomposition state only ever encode model evidence about goals, never infrastructure weather — an outage now costs wall-clock, not state repair, accepting that a genuinely-broken call that fails fast while the cheap model happens to be healthy still counts as a real attempt (conservative: queue penalties stay possible), the probe spends one cheap-model call per fast failure, and a wall-timeout followed by a quota death is still recorded as a real attempt (the model had its chance).

Context

The demote path (ADR-010) and the decomposition fallback (ADR-009) both assume the budget was actually spent on the goal. The 05:48Z and 10:43Z outages violated that assumption at scale: 8 and 9 spurious demotes respectively, plus duplicate-demote PR churn from two agents failing in parallel, plus a stalled Thread-A tree each time. The fix lives at the only place that can tell the difference — the call site, with the duration and a probe in hand. Soundness is untouched: this changes what the queue learns from failures, never what the gates accept.

Options Considered

Option 1: Fast-fail + health-probe classification, clean exit 3 (Selected)

Pros: zero cost on the healthy path; pure-function classifier is hermetically testable; the orchestrator gets an unambiguous signal; queue state stays meaningful. Cons: conservative misclassification possible (fast real failures with a healthy CLI remain “real”); one cheap probe per fast failure.

Option 2: In-loop retry with backoff (Rejected)

Sleep and retry until the CLI recovers. Rejected: outages here are quota windows measured in hours; a sleeping loop holds its claim worktree, emits nothing, and looks identical to a hang from outside.

Option 3: Pre-claim health probe every cycle (Rejected)

Probe before claiming. Rejected: pays on every healthy cycle to catch a rare event, and a probe that passes at claim time says nothing about a death 20 minutes into the attempt.

Dependencies

| Relationship | ADR ID | Title | Notes | |————–|——–|——-|——-| | Amends | ADR-009 | Goal Decomposition | Decompose fallback skipped on infra failure | | Amends | ADR-010 | Affinity-Gap Selection | Demote requires a real attempt | | Relates To | ADR-015 | Progressive Effort Escalation | Ladder attempts are individually classified |

References

| Reference ID | Title | Type | Location | |————–|——-|——|———-| | REF-1 | SPEC-016-A — Infrastructure-failure guard | Specification | specs/SPEC-016-A-Infrastructure-Failure-Guard.md | | REF-2 | Incident timeline | Metrics | ../metrics/phase3-run-001.md (when landed) |

Status History

| Status | Approver | Date | |——–|———-|——| | Proposed | unsorry maintainers | 2026-06-11 | | Accepted | unsorry maintainers | 2026-06-11 |