@latent-variable/pi-terminal-bench
Self-contained benchmark suite for Pi. Runs QuixBugs and other coding tasks locally — no Docker, no Python frameworks, no external dependencies.
Package details
Install @latent-variable/pi-terminal-bench from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:@latent-variable/pi-terminal-bench- Package
@latent-variable/pi-terminal-bench- Version
1.0.7- Published
- Apr 17, 2026
- Downloads
- 785/mo · 23/wk
- Author
- latent-variable
- License
- MIT
- Types
- extension
- Size
- 332.3 KB
- Dependencies
- 0 dependencies · 1 peer
Pi manifest JSON
{
"extensions": [
"./src/index.ts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-terminal-bench
68 coding tasks for pi. No Docker, no frameworks, no API keys — just the pi CLI and Python 3.10+. You watch the agent work in real time.
Install
pi install /path/to/pi-terminal-bench
Then restart pi or run /reload.
Requirements
- pi CLI
- Python 3.10+ and bash
- Optional:
numpy,pandas,sympy,word2number— a handful of ported Terminal-Bench tasks use these. The verify scripts auto-install on demand, so you usually don't have to care.
Runs against any model pi has configured — local (OMLX, LM Studio, Ollama) or remote (Anthropic, OpenAI). Defaults to your active model; append provider/model to any command to override.
Commands
| Command | What it does |
|---|---|
/bench-list [filter] |
List tasks. filter matches name, category, or tag |
/bench-run <task|category|all> |
Run one task, a whole category, or everything |
/bench-results [N] |
Recent runs. With N, per-task detail for run N |
/bench-doctor |
Check prerequisites |
/bench-cleanup |
Kill stray benchmark processes |
Tasks — 68 across 11 categories
| Category | Count | Command | What it tests |
|---|---|---|---|
| QuixBugs | 40 | /bench-run quixbugs |
Single-line Python bug fixes (upstream) |
| Terminal-Bench ports | 8 | /bench-run terminal-bench |
Tasks ported from Terminal-Bench, Docker-free |
| Hard | 7 | /bench-run hard |
Multi-step algorithms, parsing, concurrency |
| Long Context | 6 | /bench-run long-context |
Multi-file refactors, test generation, API migrations |
| Code Generation | 3 | /bench-run codegen |
Build CLIs, REST APIs, state machines from a spec |
| Performance | 2 | /bench-run perf |
Optimize O(n²) code |
| Security | 2 | /bench-run security |
Fix SQL injection and path traversal |
| File Operations | 2 | /bench-run file-operations |
Read/write/transform files |
| Mathematics | 2 | /bench-run math |
Symbolic math, arithmetic puzzles |
| Games | 2 | /bench-run games |
Game-logic and puzzle solvers |
| Data Science | 1 | /bench-run data-science |
pandas ETL |
| Debugging | 1 | /bench-run debugging |
Fix a diverging ML training loop |
Run /bench-list <category> to see individual task names.
Example
/bench-run quixbugs-python-bitcount # one task
/bench-run hard # one category
/bench-run quixbugs anthropic/claude-sonnet-4-20250514 # override model
/bench-run all # everything
/bench-results # past runs
/bench-results 1 # per-task detail
Results are written as JSON to ~/.pi/agent/pi-terminal-bench/results/.
Adding tasks
Drop a JSON file in tasks/:
{
"name": "my-task",
"description": "What this tests",
"instruction": "What the agent sees",
"setup_files": { "buggy.py": "...", "test.py": "..." },
"verify": "cd $BENCH_WORK_DIR && python3 test.py",
"timeout": 180000,
"tags": ["custom"]
}
$BENCH_WORK_DIR is replaced with the task's workspace. verify passes iff exit code is 0. Keep verifies fast (< 30s), deterministic, and scoped to the workspace.
Safety
Every task runs in an isolated temp directory with a pi-bench. prefix ($TMPDIR/pi-bench.XXXXXX). After each task — pass, fail, or abort — the runner kills lingering processes (including descendants reparented to launchd) and removes the workspace. Active workspaces are persisted to ~/.pi/agent/pi-terminal-bench/active-workdirs.txt, so /bench-cleanup can sweep orphans from crashed sessions.
Every cleanup path is scoped strictly to paths matching pi-bench. — Homebrew, Xcode, git, and any other tool's temp files are untouchable.
Timeouts
Each task has a timeout (default 180s; harder tasks use 240s or 360s). If a command hangs, the agent gets a 2× extended window to recover with a steer message explaining the timeout. If the agent makes no file changes, the task is recorded as FAIL — never a false PASS.
Contributing
PRs welcome. Terminal-Bench has 241 Docker-based tasks; we've ported a subset that runs without Docker and will expand over time.