pi-until-done

Pi extension that brings Hermes Agent's /goal (Ralph loop with judge) to Pi as /until-done. Pi self-judges every turn, runs verifyCommand to confirm done, and routes all CI/CD through mise across 18 language profiles.

Package details

← Back

extensionskillprompt

Install pi-until-done from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:pi-until-done

Package: pi-until-done
Version: 0.2.2
Published: May 5, 2026
Downloads: not available
Author: kirensrinivasan
License: MIT
Types: extension, skill, prompt
Size: 1.8 MB
Dependencies: 1 dependency · 4 peers

Pi manifest JSON

{
  "extensions": [
    "./extensions/until-done.ts"
  ],
  "skills": [
    "./skills"
  ],
  "prompts": [
    "./prompts"
  ],
  "image": "https://raw.githubusercontent.com/srinitude/pi-until-done/main/assets/preview.png"
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-until-done

pi-until-done preview

A Pi extension that brings Hermes Agent's /goal ("the Ralph loop with a judge") to Pi as /until-done. Every until_done_complete is gated by a cross-model LLM judge by default — the standard fix for Ralph-loop oscillation, where the executor talks itself into a premature "done." Pick a different model than the executor at setup time (judgeModel: { provider, modelId }) and the judge LLM has to agree before the goal transitions to done. If you genuinely have no second model available, opt back into same-model self-judge with sameModelJudge: true — until_done_set refuses without one of those two. Uses every Pi extension primitive that the goal-pursuit loop needs, and coexists cleanly with every other extension.

Pi's own philosophy (from srinitude/pi-config): minimal core, extensible edges, deterministic, inspectable, preserve developer agency. This extension hews to that line. It composes; it does not override. State lives in session entries. Every completion is judged by default — cross-model (different model than the executor) is required unless the contract explicitly opts into same-model self-judge. No system-prompt replacement, no side-database, no hidden state, no silent path past the judge.

Install

The package is on npm: https://www.npmjs.com/package/pi-until-done.

Through Pi (recommended)

pi install npm:pi-until-done                              # from npm (recommended)
pi install github:srinitude/pi-until-done                 # from git
pi install /path/to/pi-until-done                         # local install
pi -e /path/to/pi-until-done/extensions/until-done.ts     # try without installing

The package manifest declares all four pi.dev resource types (pi.extensions, pi.skills, pi.prompts, pi.image), so a single pi install wires up everything.

Directly via your package manager

bun add pi-until-done           # bun
npm install pi-until-done       # npm
pnpm add pi-until-done          # pnpm
yarn add pi-until-done          # yarn
deno add npm:pi-until-done      # deno

The runtime entrypoint is extensions/until-done.ts. No tools to install separately — every CI command routes through mise, which the extension assumes is already on your PATH.

Requirements

Pi >= 0.x (pi --version)
Bun >= 1.2 (the runtime extensions load through)
mise on PATH (used for every CI/CD invocation)

Use

/until-done finish migrating auth tests to Vitest

Pi runs a PHASE 0 brainstorm — refines the goal type (ticket vs. exploratory), inventories accessible surfaces (logs, metrics, staging URLs, flame graphs, sandboxes), and nails down the verifyCommand. Sharp goals terminate cleanly; vague goals burn turns.
Pi drafts a contract — outcome, done-criteria, verifyCommand (auto-wrapped with mise exec -- if not already mise-routed), ask-before list, decision style, goalType, surfaces, startPhase — and shows it to you.
You approve via the dialog (or /until-done autopilot to skip).
Pi calls until_done_set + until_done_plan and starts working in TDD-first mode: ANALYSIS → BOOTSTRAP → RED → GREEN → REFACTOR → CLEANUP (per pi-config).
After every turn, Pi self-judges. If done, it calls until_done_complete with quoted output of the verifyCommand as evidence. If blocked, until_done_block. Phase transitions go through until_done_progress({phase}). After complete, Pi calls until_done_distill to compile the journey into a PRD at .until-done/distilled.md.
When the budget (default 20 turns) is exhausted, the loop pauses and tells you exactly how to resume.
Anything you type at any point preempts the loop. For non-preempting side-questions, use /until-done ask <question>.

The status line shows the live phase glyph:

◷ analysis — reading code
⚙ bootstrap — validating infra
✗ red — failing test exists
✓ green — test passes
↺ refactor — cleanup of structure
⌫ cleanup — strip debug prints / scratch files before complete
· none — research/doc goal

Subcommands

Command	Purpose
`/until-done <intent>`	Start setup for a new goal
`/until-done status`	One-line current state
`/until-done detail`	Full contract overlay
`/until-done tasks`	Print the live YAML task list
`/until-done plan`	Show `.until-done/tasks.yaml` location
`/until-done northstar`	Print the locked goal contract
`/until-done replan-log`	Show every replan and its reason
`/until-done pause`	Halt continuation, keep state
`/until-done resume`	Resume + reset budget
`/until-done cancel`	Clear the goal
`/until-done budget <n>`	Change turn budget (1..20000; >500 prompts a confirm)
`/until-done ask <question>`	Side question — does not preempt the loop
`/until-done autopilot`	Toggle skipping the contract dialog for future setups
`/until-done judge`	Show the session-default judge mode
`/until-done judge <provider>/<modelId>`	Set a cross-model judge default (recommended; e.g. `anthropic/claude-opus-4-7`)
`/until-done judge same`	Set same-model self-judge as the default
`/until-done judge clear`	Unset the judge default; future setups must specify `judgeModel` or `sameModelJudge: true` per goal
`/until-done help`	Show this list

Plus: --until-done "<intent>" CLI flag and Ctrl+Shift+G shortcut to redraw the status widget.

Tools (8)

Tool	Purpose
`until_done_set`	Lock the North Star contract after user approval
`until_done_plan`	Provide the TDD-first task list (called once after `set`)
`until_done_replan`	Mid-execution restructuring — insert/remove/replace/split/merge/reorder
`until_done_task_update`	Patch a single task — status, learnings, gotchas, context
`until_done_progress`	Record a one-line progress note + optional phase transition
`until_done_complete`	Declare done — requires quoted `verifyCommand` output. If the contract opted into a `judgeModel`, the executor's claim is verified by that model before the goal transitions to `done`.
`until_done_block`	Pause with a question for the user
`until_done_distill`	After complete: compile the journey into a PRD at `.until-done/distilled.md`

Cross-model judge (default-on, required)

Every until_done_complete is gated by a strict-JSON judge LLM call. until_done_set requires you to pick a judge mode up front:

Cross-model (default, recommended): set judgeModel: { provider, modelId } to a model different from the executor. The judge sees only the goal, done-criteria, verifyCommand, and the executor's cited evidence — no executor history to bias it. Cross-vendor pairs (Anthropic + OpenAI), or same-family different-size pairs (Sonnet executor / Opus judge, GPT-5 executor / GPT-5-mini judge), both work.
Same-model self-judge: set sameModelJudge: true to use the active executor model with a fresh, completion-focused context. Use only when no second model is available — it's strictly weaker than cross-model for Ralph-loop convergence.
Neither set: until_done_set refuses with judge_unspecified. There is no silent path past the judge.

The /until-done judge slash command lets the user pre-configure a session-level default so the LLM doesn't have to specify on every goal. /until-done judge anthropic/claude-opus-4-7 configures cross-model; /until-done judge same configures same-model; /until-done judge clear unsets. Per-goal until_done_set arguments always win over the user default.

Verdict semantics:

Verdict done → executor's claim approved; status → done; judge's reason appended as evidence.
Verdict continue → completion refused; judge's reason appended as evidence; loop stays active and the executor must address the gap with stronger evidence.
Judge unavailable / unparseable → fail-open with a warning evidence line so judge-infra glitches don't block legitimate completion.

This closes the only convergence gap in the Ralph loop: a single model evaluating its own work has a documented tendency to talk itself into premature "done." Cross-model judge breaks that loop because the judge has no commitment to the executor's previous turns.

Pi primitive coverage matrix

The brief was: use every Pi primitive that the goal-pursuit loop needs, and have Pi call the shots wherever the active LLM can decide better than the extension. Each row below maps a primitive to how /until-done uses it. Lines marked no-op are intentionally inert — exercising a hook for its own sake would violate Pi philosophy. Primitives that don't serve the loop (e.g. provider registration, editor-text mutation) are explicitly listed as "not used" below.

Hook events (28/29 subscribed; one declarative)

Event	Mode	Why
`resources_discover`	declarative	Companion `skills/` and `prompts/` paths are declared via `package.json#pi.skills` and `package.json#pi.prompts` so Pi's package manager handles discovery (relative paths in a runtime hook resolve against cwd, not the extension dir, so the declarative form is the only correct path)
`session_start`	active	Reconstruct goal state from custom entries; honor `--until-done` flag; warn if `@qhn/pi-goal` is also installed
`session_before_switch`	active	Confirm before leaving an active goal
`session_before_fork`	active	Three-way choice: carry/leave/cancel the fork
`session_before_compact`	not subscribed	`SessionBeforeCompactResult` has no `customInstructions` slot — Pi reads compaction's customInstructions from outside the hook, so mutation is a no-op. Goal context is preserved via `session_compact` (re-anchor as a `CustomMessageEntry` in LLM context) and the per-turn system-prompt reminder via `before_agent_start`
`session_compact`	active	Re-anchor by emitting a `verdict` state event AND a `custom_message` containing recent evidence, learnings, and the current task — so the next turn's LLM context retains them past the compaction summary
`session_before_tree`	observed	Pi handles snapshotting; nothing to gate
`session_tree`	active	Full state reconstruction from new branch (todo.ts pattern)
`session_shutdown`	active	Clear status + widget keys cleanly
`context`	no-op	Pi philosophy: don't mutate LLM messages
`before_provider_request`	observed	Telemetry counter
`after_provider_response`	observed	Telemetry counter
`before_agent_start`	active	Append (never replace) a goal reminder block to the system prompt
`agent_start`	active	Reset per-iteration counters; set working-message to "pursuing: …"
`agent_end`	active	THE HEURISTIC JUDGE STEP: budget check, spin-guard, user-driven-turn detection, CI on stop, clean-end nudge, queue continuation. (LLM-based cross-model judge fires inside `until_done_complete`, not here.)
`turn_start`	active	Refresh status line
`turn_end`	active	Capture last assistant text snapshot
`message_start`	observed	Reserved hook
`message_update`	observed	Live status (rate-limited 500ms)
`message_end`	active	Capture finalized assistant text
`tool_execution_start`	observed	Tool-start counter
`tool_execution_update`	observed	Pi handles streaming UI
`tool_execution_end`	observed	Tool-end counter
`model_select`	observed	Telemetry only — judge model is whichever is active
`thinking_level_select`	observed	Telemetry counter
`tool_call`	active	POLICY GATE: enforce ask-before list against `bash`; tally progress signals per built-in tool
`tool_result`	observed	Reserved for future progress detection
`user_bash`	observed	Counter only — user-driven activity is allowed but doesn't count toward goal progress
`input`	active	Mark `userMessagedThisTurn = true` for interactive-source input only (extension-source `pi.sendUserMessage` calls don't flag it), so `agent_end` skips auto-continuation only when the human has actually spoken

Built-in tool coverage (7/7 enumerated)

Tool	How `/until-done` reasons about it
`read`	weak progress signal (+1) — investigation
`bash`	progress signal (+2) AND policy gate against ask-before
`edit`	strong progress signal (+3) — real change
`write`	strong progress signal (+3) — real change
`grep`	weak progress signal (+1) — search
`find`	weak progress signal (+1) — search
`ls`	weak progress signal (+1) — search

If progressSignalsThisTurn === 0 at agent_end, /until-done enters blocked with reason "spin guard" — the model literally did nothing useful that turn.

Other Pi primitives addressed

Primitive	Where
`pi.registerCommand`	`/until-done` with subcommand autocomplete
`pi.registerTool`	All 8 tools: `until_done_set`, `until_done_plan`, `until_done_replan`, `until_done_task_update`, `until_done_progress`, `until_done_complete`, `until_done_block`, `until_done_distill`
`pi.registerFlag`	`--until-done <text>`
`pi.registerShortcut`	`Ctrl+Shift+G` toggles the contract widget
`pi.appendEntry`	Persists `until-done.state` events (load/save)
`pi.sendUserMessage`	Continuation prompts + setup interview
`pi.sendMessage`	Re-anchors goal context as a `CustomMessageEntry` after compaction (with `display:false`)
`pi.getCommands`	Detects `@qhn/pi-goal` collisions
`pi.getFlag`	Reads `--until-done` value
`ctx.ui.confirm/select/input/editor`	Setup confirmation, fork choice, ask-before, cancel
`ctx.ui.notify`	Status messages
`ctx.ui.setStatus`	Footer status line
`ctx.ui.setWidget`	Above-editor widget with full contract
`ctx.ui.setTitle`	Terminal title during pursuit
`ctx.ui.setWorkingMessage`	"pursuing: …" during streaming
`ctx.ui.custom`	Full contract overlay (`/until-done detail`)
`ctx.ui.theme.fg`	All UI color uses theme tokens
`ctx.sessionManager.getBranch`	State reconstruction from JSONL entries
`ctx.waitForIdle`	Setup flow waits for the assistant before opening confirm
`ctx.signal`	Threaded into CI subprocesses so user `Esc` aborts in-flight checks. Also threaded into the cross-model judge LLM call so Esc cancels in-flight verdict requests.
`ctx.modelRegistry`	Used by the cross-model judge to resolve the configured judge model and its auth (api key + headers).
`pi-ai`'s `complete()`	Used by the cross-model judge for a one-shot LLM call against the judge model — kept out of Pi's session so it doesn't pollute the executor's context.
Skills (`skills/until-done/SKILL.md`)	Loaded on demand to teach Pi the contract & tool protocol
Prompts (`prompts/`)	Reserved for future prompt templates; `/until-done` itself is an extension command (which takes precedence over template names anyway)

Not used (intentional): pi.registerProvider/unregisterProvider (extension does not register providers in production — judges resolve through the user's existing model registry; the test harness uses pi.registerProvider to wire in a faux judge for deterministic tests), pi.setActiveTools (would silently disable user tools — a Pi-philosophy violation), ctx.compact/fork/navigateTree/ switchSession/newSession (those replace user state and must stay user-initiated), pi.exec (CI runs through its own Bun.spawn to thread ctx.signal), pi.events and pi.setSessionName/ setLabel/setModel/setThinkingLevel (no value when Pi already drives those decisions), and most of the editor-mutating ctx.ui.* surface (editor, setEditorText, pasteToEditor, setHeader/setFooter, etc. — would fight the user for the input box). The extension intentionally leaves these on the table.

North Star + dynamic task list

The brief was: a fixed criterion to guide the entire process to a clean end, but a task list that can be edited mid-flight when reality diverges. /until-done separates the two:

	Locked at `until_done_set`	Mutable mid-execution
`goal`	✓	✗
`doneCriteria`	✓	✗
`verifyCommand`	✓	✗
`askBefore` boundaries	✓	✗
`decisionStyle`	✓	✗
Task list (insert/remove/split/merge/reorder/replace)	✗	via `until_done_replan`
Per-task: validationSteps, ciCommands, styleguideRules, guardrails	✗	via `until_done_task_update`
Per-task: status, learnings, gotchas, context refs	✗	via `until_done_task_update`
`phase`	✗	via `until_done_progress`
`maxTurns`	✗	via `/until-done budget <n>`

The North Star (top block) is the fixed reference point. Pi can change how it gets there but never where it's going. The only way to change the North Star is /until-done cancel followed by a new setup — by design, this requires fresh user approval.

Replan operations (`until_done_replan`)

Op	Use when
`insert`	A new sub-task surfaced (insertAfter optional)
`remove`	A planned task is moot (must be `pending`/`blocked`; `done` is immutable)
`replace`	A pending task was specced wrong
`split`	One task is actually 2+ tasks
`merge`	Two+ tasks collapse into one
`reorder`	Dependencies need adjusting

Every replan requires a non-empty reason which is appended to affected tasks' learnings and to /until-done replan-log. Cycles are rejected. The whole batch validates atomically — if one op is illegal, none apply.

Live YAML on disk

After until_done_plan and every until_done_task_update / until_done_replan, the extension rewrites .until-done/tasks.yaml in the project root so humans can read the current state without opening the TUI:

generated: 2026-05-04T12:34:56.000Z
goalId: ud-abc123
goal: finish migrating auth tests to Vitest
doneCriteria: bun test exits 0 with all auth specs green
verifyCommand: bun test
phase: green
askBefore: [git push]
budget: { used: 7, max: 20 }
currentTaskId: T-005
tasks:
  - id: T-001
    title: Bootstrap Vitest config
    phase: bootstrap
    status: done
    dependencies: []
    blocks: [T-002]
    prerequisites: []
    validationSteps:
      - cat vitest.config.ts
      - bun test --version
    ciCommands: [bun test]
    styleguideRules: []
    guardrails: ["no new top-level deps without confirmation"]
    learnings: ["replan: discovered tsconfig conflict"]
    gotchas: ["forgot to update tsconfig include"]
    context:
      - path: package.json
        why: read existing test script
  - ...

Clean-end guarantee

When every planned task is done (or skipped) but Pi hasn't called until_done_complete, the extension sends Pi exactly one structured reminder per cycle:

All planned tasks are marked done. Two paths from here, pick one:

Run <verifyCommand>. If it passes, call until_done_complete.

If residual work surfaced, call until_done_replan with reason residual_work_discovered. Do not invent new work outside the plan.

After two such reminders, the loop pauses and yields to the user. The turn budget remains the absolute backstop.

Per-turn principle injection

Every turn, before_agent_start appends (never replaces) a composite reminder block to the system prompt. Setup and the loop continuation tick include the same blocks. Sources, in injection order:

TDD discipline — RED → GREEN → REFACTOR → CLEANUP.
Verifiability discipline — do not accept proxy signals; treat uncertainty as not achieved; quote command output as evidence.
pi-config principles (extensions/lib/strings/principles/):
- Bootstrap mandate (the 8 automation-foundation items)
- Performance mandate (any unnecessary slowdown is a defect)
- Capability injection + test model (no internals, no shared state)
- Definition of done (stricter — both validation suites + parity)
- Working style (declare phase, never claim unverified)
Mise-first CLI policy — every shell command via mise run or mise exec --. verifyCommand auto-wrapped on until_done_set.
Structural constraints — applies to every language Pi generates in: ≤3 nesting depth, ≤30 LOC per construct, ≤200 LOC per file.
Plan management + tool flow — when to call until_done_replan, until_done_task_update, until_done_complete, until_done_block.

TDD-first discipline (from pi-config)

/until-done enforces the pi-config operating contract end-to-end:

Phases are explicit and tracked. Pi declares phase: "analysis"|"bootstrap"|"red"|"green"|"refactor"|"none" via until_done_progress and the extension renders it live in the status line.
No GREEN without RED. The contract requires a failing test before any production change for code-shipping goals. The SKILL.md loaded in-session enforces this; the system-prompt reminder repeats it every turn.
Done = verifyCommand passes. until_done_complete requires evidence that quotes the command output. Speculative completion is refused.
Performance is a defect when there's a safe gain. REFACTOR encourages it.
No claims about unverified state. The skill bans pretending tests, guarantees, or context exist when they have not been verified.
Structural constraints. Nesting ≤ 3, construct ≤ 30 LOC, file ≤ 200 LOC, single responsibility per construct.

How `/until-done` differs from `@qhn/pi-goal` and Hermes `/goal`

	`@qhn/pi-goal`	Hermes `/goal`	`/until-done`
Setup flow	User-led interview	None — judge asks each turn	Pi-led interview
Judge	None — model self-decides	Auxiliary model judge call	Self-judge via tools by default; opt-in cross-model judge gates `until_done_complete`
State storage	Pi session entries	SessionDB.state_meta	Pi session entries
Hook coverage	1–2 events	n/a (Hermes-internal)	28/29 events subscribed (one declarative)
Conflict-safe	yes	n/a	yes (auto-detects qhn/pi-goal)
System-prompt mutation	none	none	append-only

If both @qhn/pi-goal and pi-until-done are installed, the user sees a one-time notice at session_start and can pick whichever they prefer per session. Tool/command/event keys are namespaced until-done.* and until_done_* to avoid collisions with anything else in the package ecosystem.

Edge cases the implementation handles

Extension loaded mid-session → state reconstructs from existing custom entries; if none, no-op.
Compaction during a goal → on session_compact, the extension emits a CustomMessageEntry (LLM-bound, display:false) containing the goal headline, verifyCommand, current task, recent evidence and recent learnings — so the next turn's LLM context retains them past the compaction summary. The per-turn system-prompt reminder via before_agent_start keeps the locked North Star in front of the model on every subsequent call.
Fork during a goal → user picks via select dialog.
Switch session during a goal → confirm dialog protects against accidental loss.
Branch via /tree → state fully rebuilt from new branch (matches the todo.ts reference pattern).
User interjects mid-loop → input hook flags userMessagedThisTurn (interactive-source only — extension-driven pi.sendUserMessage calls don't trigger it); agent_end consults the flag, persists a verdict: continue state event with reason "user-driven turn", and skips auto-continuation. Flag resets in agent_end after consumption (not in agent_start, which would race with the input hook).
Model produces no tools/text → progressSignalsThisTurn === 0 triggers blocked with spin-guard reason, prevents tight loop.
Turn budget exhausted → auto-pause with explicit /until-done resume instructions (Hermes parity).
Pi calls until_done_complete falsely → user can /until-done resume to challenge it; new evidence required.
Goal already exists during setup → select dialog: replace / keep / cancel.
RPC / print mode (no UI) → ctx.hasUI checks degrade gracefully; setWidget skipped, notify still fires, custom overlay falls back to JSON dump.
Provider/model switch mid-goal → no judge re-binding required because the active model itself is the judge.
Thinking level change mid-goal → tracked but doesn't affect state.
Compaction over contract → contract is one of the first entries on the branch; reconstruction walks from root.
Goal text with special chars → rendered through theme tokens, no shell expansion.
--until-done flag at startup → triggers /until-done <text> via sendUserMessage exactly once at startup.
@qhn/pi-goal also installed → coexistence notice; commands do not collide because /goal and /until-done are different names.
Ask-before timeout (no human at terminal) → 30s timeout on confirm; on dismiss, the tool call is blocked with user denied.
Hard ceiling 20000 turns → cmdBudget rejects values >20000; values >500 prompt a confirm dialog (the "go to lunch" threshold) so users opt into the spend / wall-clock cost explicitly.
Goal cancelled mid-streaming → state transitions to cleared; next agent_end short-circuits via state.status !== "active" guard.
Approval dialog times out → confirm resolves to false; goal is cleared.
Multiple goals attempted → until_done_set rejects with goal_exists.
Tool called before approval → until_done_set rejects with not_confirmed.
Skill discovery → declared via package.json#pi.skills and pi.prompts so Pi's package manager resolves paths against the extension root, not the user's cwd.
Session shutdown → status + widget keys cleared; entries persist on disk for the next pi -c.

Verifying

Every CI/CD operation runs through mise. Once you've installed deps:

cd pi-until-done
mise install              # installs bun + node per mise.toml
mise run install-deps     # installs bun deps (idempotent)
mise run check            # fast: typecheck + lint + format (parallel)
mise run ci               # full: typecheck + lint + format + compile + test + build
mise run release-ready    # release-readiness suite (parity check + surface presence)

Then in a Pi project:

pi -e ./extensions/until-done.ts
/until-done finish migrating auth tests to Vitest

Security

/until-done is an autonomous-loop extension. By default it runs Pi's built-in tools (read, bash, edit, write, grep, find, ls) on the active model's behalf each turn. You should know:

Filesystem writes are unrestricted unless you list specific commands in the contract's askBefore[]. Ask-before triggers a confirm dialog for any matching bash invocation. Examples: git push, destructive sql, rm, terraform apply.
Network calls are whatever the model decides to make via bash. The extension itself makes no network calls.
Credentials are never read or stored by the extension. Pi's session state is the only thing it persists, and it lives in your local Pi session entries (JSONL).
CI commands run through mise exec -- against your project's mise config. mise tasks ls --json is the only direct invocation the extension makes for discovery; nothing else.
No system-prompt replacement — the extension only appends to the system prompt via before_agent_start, so other extensions' rules still apply.
No background side effects — no daemons, no hidden state, no uploads, no telemetry. State is auditable in the JSONL session log.
Hard turn budget ceiling of 20000 with a confirm dialog above 500; default is 20. Spin guard, clean-end nudge, CI failure, user input, and /until-done pause all preempt regardless of budget.

For vulnerability disclosure see SECURITY.md.

Cross-platform CI

The GitHub workflow runs the full suite on macos-latest, ubuntu-latest, and windows-latest in parallel via a matrix strategy. The release-readiness job runs the same matrix on main and on dispatch.

Upstream Pi watcher (auto-merge gated on CI + CodeRabbit)

Two workflows cooperate to keep this extension current with upstream pi-mono with no manual ceremony when the upgrade is clean — and zero auto-merges when it's not:

.github/workflows/upstream-watch.yml (daily 06:13 UTC + manual dispatch) — checks npm for new releases of @mariozechner/pi-coding-agent and @mariozechner/pi-ai, refreshes bun.lock if a bump is available, runs mise run ci against the bumped versions in-workflow, and opens / updates a PR (upstream/pi-bump) with coderabbitai requested as reviewer and the auto-merge label.
.github/workflows/upstream-pi-merge-gate.yml — fires on every PR review submission and every CI workflow completion. It evaluates the PR's latest head SHA against two explicit gates:
- CI is green — every required check run on the head SHA reports success (or skipped), and at least three checks have completed (the OS matrix). This excludes the gate's own runs to avoid deadlocking on itself.
- CodeRabbit approved — the most recent review from coderabbitai[bot] is APPROVED.
Only when both gates are green on the same SHA does the workflow call gh pr merge --squash --delete-branch. Either gate failing keeps the PR open. The gate re-evaluates on every push, every CI completion, and every review submission, so push-fixes converge automatically.

This is an explicit gate in workflow source, not GitHub's native auto-merge. Branch protection settings are unnecessary for the gate to work, though they remain a useful defense-in-depth.

Repo prerequisites:

UPSTREAM_PAT repo secret — a personal access token with repo + workflow scope. Without it, PRs created by upstream-watch.yml use GITHUB_TOKEN and no downstream workflows fire on the PR (GitHub's recursive-workflow safeguard). The merge gate would then never see CI results and would never merge. The workflows fall back to GITHUB_TOKEN when the secret is absent, so the PR opens but holds indefinitely until CI is re-triggered manually.
CodeRabbit installed on the repo (https://app.coderabbit.ai). The workflow's reviewers: coderabbitai request invites it explicitly; CodeRabbit posts an approving review when its analysis is clean.

When upstream introduces a real API break, the in-workflow CI fails, the PR-triggered CI matrix fails, the merge gate refuses to merge — fix the extension code drift before merging rather than pinning back.

Tests live in tests/ and run against a real Pi runtime (via createAgentSessionRuntime from @mariozechner/pi-coding-agent) with deterministic LLM responses supplied by registerFauxProvider from @mariozechner/pi-ai. No hand-rolled ExtensionAPI mocks. Real fs (temp dirs), real subprocesses (Bun.spawn), real AbortSignal abort propagation. The harness lives in tests/helpers/.

Path	Covers
`tests/integration/harness-smoke.test.ts`	Runtime boots; extension binds; tools and command register
`tests/integration/goal-flow.test.ts`	Full happy-path E2E: setup → set → plan → progress → task_update → complete → distill, with real `tasks.yaml` + `distilled.md` on disk
`tests/tools/lifecycle.test.ts`	`until_done_set/complete/block/progress` status guards + state transitions
`tests/tools/judge.test.ts`	Cross-model judge: schema round-trip, judge-approves, judge-rejects, malformed JSON fail-open
`tests/tools/plan.test.ts`	`until_done_plan` dependency validation, `tasks.yaml` write
`tests/tools/replan.test.ts`	`until_done_replan` cycle detection, done-immutability, replanLog growth
`tests/tools/task-update-distill.test.ts`	Task patches, cursor advancement, `distilled.md` write
`tests/commands/router.test.ts`	Subcommand dispatch — autopilot toggle, "status report on the migration" → setup, budget numeric vs. text disambiguation
`tests/commands/setup.test.ts`	Contract dialog approve/reject, autopilot skips dialog, replace-vs-keep selector
`tests/commands/control.test.ts`	Pause/resume/cancel/budget — including resume-from-done with confirm and cleanEndPrompts reset
`tests/commands/info-ask.test.ts`	Status / detail / tasks / northstar / replan-log / ask side-question
`tests/hooks/agent-hooks.test.ts`	`agent_start`/`agent_end` — `userMessagedThisTurn` race fix (#1), budget exhaustion path
`tests/hooks/before-agent-start.test.ts`	System-prompt reminder block — augments only when active, includes verifyCommand line
`tests/hooks/session-hooks.test.ts`	`session_start` reconstruction, `session_compact` re-anchor (`CustomMessageEntry`), `session_shutdown` cleanup
`tests/hooks/tools-hook.test.ts`	Ask-before — UI-on confirm, no-UI block, scoring, `until_done_*` exclusion (#20)
`tests/ci/discovery.test.ts`	`discoverChecks` with real fs scenarios — Java vs Kotlin Gradle, PYTHON vs PYTHON_UV, ROBLOX vs LUAU, content-sniff disambiguation
`tests/ci/runner.test.ts`	`runOne` against real subprocesses — `true`/`false`/timeout/AbortSignal abort/output truncation marker
`tests/mise.test.ts`	`routeThroughMise` / `isMiseCommand` semantics
`tests/profiles/{bun,pnpm,npm,yarn,deno}.test.ts`	Each TypeScript runtime profile shape
`tests/platform/os.test.ts`	macOS/Linux/Windows path + line-ending neutrality
`tests/platform/discovery.test.ts`	All profiles use POSIX-style markers; mise as sole entry point
`tests/build-smoke.ts`	The runtime entrypoint resolves on every supported OS

Run them with mise run test. The full suite (mise run ci) covers typecheck + lint + format + compile + test + build and finishes in under 5 seconds locally.

Contributing

This project is open source under MIT. Contributions welcome.

Read AGENTS.md — the project's pi-config-derived contract
Read CONTRIBUTING.md — dev setup, TDD flow, PR rules
No AI co-authorship trailers in commits or PRs (project policy enshrined in CONTRIBUTING.md)
See CODE_OF_CONDUCT.md for community standards
For security issues: SECURITY.md (do not file public issues)
Changelog: CHANGELOG.md

PR template at .github/PULL_REQUEST_TEMPLATE.md. Issue templates under .github/ISSUE_TEMPLATE/.

Sources

Hermes Agent goals doc: https://hermes-agent.nousresearch.com/docs/user-guide/features/goals
Hermes Agent goals source: hermes_cli/goals.py in https://github.com/nousresearch/hermes-agent
Pi extension API: packages/coding-agent/src/core/extensions/types.ts in https://github.com/badlogic/pi-mono
Pi philosophy: https://github.com/srinitude/pi-config
Pi extensions doc: https://pi.dev/docs/latest/extensions

License: MIT.

pi-until-done

Install

Through Pi (recommended)

Directly via your package manager

Requirements

Use

Subcommands

Tools (8)

Cross-model judge (default-on, required)

Pi primitive coverage matrix

Hook events (28/29 subscribed; one declarative)

Built-in tool coverage (7/7 enumerated)

Other Pi primitives addressed

North Star + dynamic task list

Replan operations (until_done_replan)

Live YAML on disk

Clean-end guarantee

Per-turn principle injection

TDD-first discipline (from pi-config)

How /until-done differs from @qhn/pi-goal and Hermes /goal

Edge cases the implementation handles

Verifying

Security

Cross-platform CI

Upstream Pi watcher (auto-merge gated on CI + CodeRabbit)

Contributing

Sources

Replan operations (`until_done_replan`)

How `/until-done` differs from `@qhn/pi-goal` and Hermes `/goal`