@latent-variable/pi-terminal-bench

Self-contained benchmark suite for Pi. Runs QuixBugs and other coding tasks locally — no Docker, no Python frameworks, no external dependencies.

Package details

extension

Install @latent-variable/pi-terminal-bench from npm and Pi will load the resources declared by the package manifest.

$ pi install npm:@latent-variable/pi-terminal-bench
Package
@latent-variable/pi-terminal-bench
Version
1.0.7
Published
Apr 17, 2026
Downloads
785/mo · 23/wk
Author
latent-variable
License
MIT
Types
extension
Size
332.3 KB
Dependencies
0 dependencies · 1 peer
Pi manifest JSON
{
  "extensions": [
    "./src/index.ts"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-terminal-bench

68 coding tasks for pi. No Docker, no frameworks, no API keys — just the pi CLI and Python 3.10+. You watch the agent work in real time.

Install

pi install /path/to/pi-terminal-bench

Then restart pi or run /reload.

Requirements

  • pi CLI
  • Python 3.10+ and bash
  • Optional: numpy, pandas, sympy, word2number — a handful of ported Terminal-Bench tasks use these. The verify scripts auto-install on demand, so you usually don't have to care.

Runs against any model pi has configured — local (OMLX, LM Studio, Ollama) or remote (Anthropic, OpenAI). Defaults to your active model; append provider/model to any command to override.

Commands

Command What it does
/bench-list [filter] List tasks. filter matches name, category, or tag
/bench-run <task|category|all> Run one task, a whole category, or everything
/bench-results [N] Recent runs. With N, per-task detail for run N
/bench-doctor Check prerequisites
/bench-cleanup Kill stray benchmark processes

Tasks — 68 across 11 categories

Category Count Command What it tests
QuixBugs 40 /bench-run quixbugs Single-line Python bug fixes (upstream)
Terminal-Bench ports 8 /bench-run terminal-bench Tasks ported from Terminal-Bench, Docker-free
Hard 7 /bench-run hard Multi-step algorithms, parsing, concurrency
Long Context 6 /bench-run long-context Multi-file refactors, test generation, API migrations
Code Generation 3 /bench-run codegen Build CLIs, REST APIs, state machines from a spec
Performance 2 /bench-run perf Optimize O(n²) code
Security 2 /bench-run security Fix SQL injection and path traversal
File Operations 2 /bench-run file-operations Read/write/transform files
Mathematics 2 /bench-run math Symbolic math, arithmetic puzzles
Games 2 /bench-run games Game-logic and puzzle solvers
Data Science 1 /bench-run data-science pandas ETL
Debugging 1 /bench-run debugging Fix a diverging ML training loop

Run /bench-list <category> to see individual task names.

Example

/bench-run quixbugs-python-bitcount                         # one task
/bench-run hard                                             # one category
/bench-run quixbugs anthropic/claude-sonnet-4-20250514      # override model
/bench-run all                                              # everything
/bench-results                                              # past runs
/bench-results 1                                            # per-task detail

Results are written as JSON to ~/.pi/agent/pi-terminal-bench/results/.

Adding tasks

Drop a JSON file in tasks/:

{
  "name": "my-task",
  "description": "What this tests",
  "instruction": "What the agent sees",
  "setup_files": { "buggy.py": "...", "test.py": "..." },
  "verify": "cd $BENCH_WORK_DIR && python3 test.py",
  "timeout": 180000,
  "tags": ["custom"]
}

$BENCH_WORK_DIR is replaced with the task's workspace. verify passes iff exit code is 0. Keep verifies fast (< 30s), deterministic, and scoped to the workspace.

Safety

Every task runs in an isolated temp directory with a pi-bench. prefix ($TMPDIR/pi-bench.XXXXXX). After each task — pass, fail, or abort — the runner kills lingering processes (including descendants reparented to launchd) and removes the workspace. Active workspaces are persisted to ~/.pi/agent/pi-terminal-bench/active-workdirs.txt, so /bench-cleanup can sweep orphans from crashed sessions.

Every cleanup path is scoped strictly to paths matching pi-bench. — Homebrew, Xcode, git, and any other tool's temp files are untouchable.

Timeouts

Each task has a timeout (default 180s; harder tasks use 240s or 360s). If a command hangs, the agent gets a 2× extended window to recover with a steer message explaining the timeout. If the agent makes no file changes, the task is recorded as FAIL — never a false PASS.

Contributing

PRs welcome. Terminal-Bench has 241 Docker-based tasks; we've ported a subset that runs without Docker and will expand over time.