pi-bench

LLM benchmark toolkit for pi coding agent. Probes every available model with real streaming API calls and ranks by latency, cost, and output quality. Provides curated model chain and blacklist for smart model selection in pi-recap and other extensions.

Packages

Package details

extension

Install pi-bench from npm and Pi will load the resources declared by the package manifest.

$ pi install npm:pi-bench
Package
pi-bench
Version
0.2.5
Published
May 13, 2026
Downloads
650/mo · 650/wk
Author
ffrappo
License
MIT
Types
extension
Size
46.2 KB
Dependencies
0 dependencies · 2 peers

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-bench

pi-bench banner

The LLM benchmark toolkit for pi coding agent.

Find the fastest, cheapest LLM models among all registered providers.

Probes every available model with a real stream() call using a representative prompt, then ranks by latency, cost, and output quality. Designed to feed smart model selection into pi-recap and other pi extensions.

Features

  • Universal provider loading — discovers and loads all pi extensions (Alibaba, Kimi, etc.) the same way pi does
  • Real probes — fires actual streaming API calls, measures time-to-first-byte and completion
  • Quality scoring — classifies responses as ok / multi-sentence / refusal / question / empty
  • Cost aware — calculates per-call cost in USD using model pricing
  • 30s hard timeout — if the full probe doesn't finish, the incremental CSV already contains every completed probe
  • Per-provider concurrency — 8 parallel probes per provider to saturate throughput
  • Standalone or extension — runs as CLI script or as a pi slash command (/bench)

Usage

As a pi extension

Install into pi's extensions directory:

git clone https://github.com/fornace/pi-bench.git ~/.pi/agent/extensions/pi-bench

Then run inside pi:

/bench

Results are saved to bench-results-v6.csv in the extension directory.

Standalone CLI

cd ~/.pi/agent/extensions/pi-bench
npx -y -p tsx tsx bench.mts

With custom output directory:

npx -y -p tsx tsx bench.mts --output-dir /tmp/bench-output

Programmatic

import { runBench, printTable } from "./bench.mts";

const { results, csvPath, stats } = await runBench({
  outputDir: "/tmp/bench",
  timeoutMs: 30000,
  concurrency: 8,
});

console.log(printTable(results));
console.log(`Probed ${stats.final} models → ${csvPath}`);

Output

CSV (bench-results-v6.csv)

Column Description
rank Position in latency ranking (ok models only)
id Model ID
provider Provider name (alibaba-cloud, google-vertex, etc.)
api API type (anthropic-messages, google-vertex, etc.)
family Model family tag (flash, turbo, plus, max, pro, etc.)
t_first_byte_ms Time to first token in ms
t_complete_ms Time to completion in ms
output_tokens Tokens generated
cost_usd Estimated cost in USD
status ok / timeout / error:... / empty
quality ok / multi-sentence / refusal / question / empty
sample First 60 chars of response

Candidates file (bench-candidates.txt)

Lists all models that passed the filter, plus dropped models with reasons.

Configuration

Tunables (in bench.mts)

Constant Default Description
PER_CALL_TIMEOUT_MS 4000 Max time per individual probe
TOTAL_RUN_TIMEOUT_MS 30000 Hard cap for the entire bench run
CONCURRENCY_PER_PROVIDER 8 Parallel probes per provider
BATCH_GAP_MS 200 Delay between probe batches

Filter

Models are filtered to text-capable candidates only. Blocklisted fragments: embed, audio, tts, whisper, transcribe, dall-e, dalle, imagen, stable-diffusion, midjourney, moderation, guard.

Typical Results

RANK  FB      TOTAL   COST         FAMILY   PROVIDER           ID
1     349ms   589ms   ~$0          plus     alibaba-cloud      qwen-vl-plus
2     436ms   620ms   ~$0          plus     alibaba-cloud      qwen-plus-2025-09-11
3     421ms   679ms   ~$0          flash    alibaba-cloud      qwen-flash
4     427ms   717ms   ~$0          turbo    alibaba-cloud      qwen-turbo
5     488ms   719ms   ~$0          plus     alibaba-cloud      qwen-vl-plus-2025-05-07

Top models are typically Alibaba Cloud Qwen variants at sub-700ms latency and ~$0 cost.

Headless mode — using pi-bench from other plugins

pi-bench is designed to be consumed by other pi extensions. There are three integration patterns:

Static imports (no runtime)

Import curated data directly from the package — no benchmark run needed:

import { CURATED_CHAIN, BLACKLIST_SEED } from "pi-bench";

// CURATED_CHAIN: ordered list of fast/cheap model IDs, ranked by latest bench
// BLACKLIST_SEED: known-bad models (404s, refusals, empty responses)

pi-recap uses this for its model picker chain. When you run a new benchmark, pi-bench updates CURATED_CHAIN and pi-recap picks up the new winners automatically — no config changes needed.

Benchmark UI component

Reuse the interactive model selector from your own extension:

import { showBenchmarkUI } from "pi-bench/ui.js";

// csvPath points to bench-results-v6.csv
const picked = await showBenchmarkUI(ctx, csvPath, "Pick a model");

This renders a scrollable, filterable SelectList with all benched models ranked by latency. Returns the selected model ID. Used by pi-recap's /recap → model: ... menu.

Finding the benchmark data directory

The CSV lives in the pi-bench extension directory. Resolve it at runtime:

import { fileURLToPath } from "node:url";
import * as path from "node:path";

const benchDir = path.dirname(fileURLToPath(import.meta.resolve("pi-bench/package.json")));
const csvPath = path.join(benchDir, "bench-results-v6.csv");

Headless vs UI mode

When pi-bench runs as a slash command (/bench), it detects whether a TUI is available via ctx.hasUI. Without a TUI (headless mode), results are printed to the console. With a TUI, the interactive selector is shown. The same benchmark subprocess runs in both cases — only the output display changes.

License

MIT

From the same author

By Francesco Frapporti at Fornace.

  • pi-recap — Always-visible session recap panel for pi. Uses pi-bench data to pick the fastest summarization model.
  • pi-banana — Generate and edit images inside pi using Google Nano Banana. Banner images for all these packages were created with pi-banana.
  • pi-alibaba-models — Complete Alibaba provider for pi: Qwen, DeepSeek, Kimi, GLM, MiniMax with native thinking levels.
  • pi-notte-theme — Notte: a true-dark pi theme where darkness has color and text glows like terminal phosphor.