@tryinget/pi-evalset-lab
pi extension for fixed-task-set eval runs and prompt/system comparisons
Package details
Install @tryinget/pi-evalset-lab from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:@tryinget/pi-evalset-lab- Package
@tryinget/pi-evalset-lab- Version
0.2.0- Published
- May 14, 2026
- Downloads
- 79/mo · 8/wk
- Author
- tryinget
- License
- SEE LICENSE IN LICENSE
- Types
- extension, prompt
- Size
- 135.2 KB
- Dependencies
- 0 dependencies · 2 peers
Pi manifest JSON
{
"extensions": [
"./extensions/evalset.ts"
],
"prompts": [
"./prompts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
summary: "Overview and quickstart for @tryinget/pi-evalset-lab." read_when:
- "Starting work in this package workspace."
- "Using /evalset run or /evalset compare." system4d: container: "Monorepo package for a pi fixed-task-set evaluation extension." compass: "Keep prompt/system comparisons small, reproducible, and easy to inspect." engine: "Define dataset -> run or compare variants -> export JSON/HTML report -> review deltas." fog: "Model/provider nondeterminism can make brittle checks noisy."
@tryinget/pi-evalset-lab
Monorepo package for fixed-task-set eval workflows in Pi (/evalset run|compare) with reproducible JSON reports and static HTML export.
- Workspace path:
packages/pi-evalset-lab - Release component key:
pi-evalset-lab - Former legacy standalone source:
~/programming/pi-extensions/pi-evalset-lab - Canonical package status: canonicalized here; the legacy repo was archived to
~/programming/pi-extensions/pi-evalset-lab-final-archive.tar.gzand removed after validation. - Session-history migration: no legacy Pi session-history directory existed for the old path, so relocation was recorded as
skip-no-history.
Primary category fit: Model & Prompt Management, Review & Quality Loops, UX & Observability, Safety & Governance.
Runtime dependencies and packaged files
This package expects Pi host runtime APIs and declares them as peerDependencies:
@mariozechner/pi-coding-agent@mariozechner/pi-ai
The npm package uses a files whitelist so required runtime artifacts are explicitly included:
extensions/evalset.tsprompts/examples/(sample datasets + sample report UI)scripts/export-evalset-report-html.mjs
Quickstart
Install package dependencies for local validation:
cd packages/pi-evalset-lab
npm install
npm run check
Install into Pi from the package directory containing package.json:
pi install /absolute/path/to/pi-extensions/packages/pi-evalset-lab
# then in Pi: /reload
For ad hoc source testing from this package directory:
pi -e ./extensions/evalset.ts
evalset command
/evalset help
/evalset init [dataset-path] [--force]
/evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset is a Pi slash command, not a shell executable.
Interactive mode:
pi -e ./extensions/evalset.ts
# then inside Pi:
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt
Non-interactive mode:
pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
# or, if installed/enabled:
pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
Interactive sessions use Pi UI hooks (ctx.ui) for status/notify updates. Non-interactive -p mode skips those UI calls when ctx.hasUI === false.
Included datasets and sample output
examples/fixed-task-set.json— tiny smoke set (3 cases)examples/fixed-task-set-v2.json— larger first pass setexamples/fixed-task-set-v3.json— less brittle checks (recommended)examples/evalset-compare-sample-embedded.html— self-contained report UI with embedded compare JSONexamples/evalset-compare-sample.png— screenshot preview of that HTML reportexamples/system-baseline.txtandexamples/system-candidate.txt— compare inputs
Preview:

Reports are written to explicit --out <path> when provided, otherwise .evalset/reports/*.json under the current project directory.
Each report includes run identity metadata (runId, datasetHash, casesHash, and variant hashes). Session messages keep lightweight report metadata only, not full report bodies.
Export report JSON to static HTML
npm run evalset:export-html -- --in .evalset/reports/compare-your-dataset-YYYYMMDDTHHMMSS.json
# optional:
npm run evalset:export-html -- --in .evalset/reports/run-your-dataset-YYYYMMDDTHHMMSS.json --out .evalset/reports/run-your-dataset.html --title "Evalset run report"
Script: scripts/export-evalset-report-html.mjs
Validation and release checks
Package-local validation:
npm run check
npm run release:check:quick
Monorepo-scoped validation:
cd ../..
bash ./scripts/package-quality-gate.sh ci packages/pi-evalset-lab
node ./scripts/release-components.mjs validate
Release metadata is root-managed through x-pi-template.releaseConfigMode=component and component key pi-evalset-lab.
The scoped package @tryinget/pi-evalset-lab is the canonical npm identity for future releases. The old unscoped pi-evalset-lab@0.2.0 package remains historical registry state, not the canonical development target.
Optional core hooks (future, not required)
This extension works today without Pi core changes. Optional hardening could include stable agent-level lineage IDs, explicit reproducibility metadata in pi-ai, shared provider payload hashing, or a headless agent-eval API for tool-heavy/full agent-loop benchmark runs.