@tryinget/pi-evalset-lab

pi extension for fixed-task-set eval runs and prompt/system comparisons

Packages

Package details

extensionprompt

Install @tryinget/pi-evalset-lab from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:@tryinget/pi-evalset-lab

Package: @tryinget/pi-evalset-lab
Version: 0.3.0
Published: Jul 13, 2026
Downloads: 204/mo · 186/wk
Author: tryinget
License: SEE LICENSE IN LICENSE
Types: extension, prompt
Size: 139.2 KB
Dependencies: 0 dependencies · 2 peers

Pi manifest JSON

{
  "extensions": [
    "./extensions/evalset.ts"
  ],
  "prompts": [
    "./prompts"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

summary: "Overview and quickstart for @tryinget/pi-evalset-lab." read_when:

"Starting work in this package workspace."
"Using /evalset run or /evalset compare." system4d: container: "Monorepo package for a pi fixed-task-set evaluation extension." compass: "Keep prompt/system comparisons small, reproducible, and easy to inspect." engine: "Define dataset -> run or compare variants -> export JSON/HTML report -> review deltas." fog: "Model/provider nondeterminism can make brittle checks noisy."

@tryinget/pi-evalset-lab

Monorepo package for fixed-task-set eval workflows in Pi (/evalset run|compare) with reproducible JSON reports and static HTML export.

Workspace path: packages/pi-evalset-lab
Release component key: pi-evalset-lab
Former legacy standalone source: ~/programming/pi-extensions/pi-evalset-lab
Canonical package status: canonicalized here; the former ~/programming/pi-extensions/ workspace was later deleted at operator request.
Session-history migration: no legacy Pi session-history directory existed for the old package path, so relocation was recorded as skip-no-history.

Primary category fit: Model & Prompt Management, Review & Quality Loops, UX & Observability, Safety & Governance.

Runtime dependencies and packaged files

This package expects Pi host runtime APIs and declares them as peerDependencies:

@earendil-works/pi-coding-agent
@earendil-works/pi-ai

The npm package uses a files whitelist so required runtime artifacts are explicitly included:

extensions/evalset.ts
prompts/
examples/ (sample datasets + sample report UI)
scripts/export-evalset-report-html.mjs

Quickstart

Install package dependencies for local validation:

cd packages/pi-evalset-lab
npm install
npm run check

Install into Pi from the package directory containing package.json:

pi install /absolute/path/to/pi-extensions/packages/pi-evalset-lab
# then in Pi: /reload

For ad hoc source testing from this package directory:

pi -e ./extensions/evalset.ts

evalset command

/evalset help
/evalset init [dataset-path] [--force]
/evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]

/evalset is a Pi slash command, not a shell executable.

Interactive mode:

pi -e ./extensions/evalset.ts
# then inside Pi:
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt

Non-interactive mode:

pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
# or, if installed/enabled:
pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"

Interactive sessions use Pi UI hooks (ctx.ui) for status/notify updates. Non-interactive -p mode skips those UI calls when ctx.hasUI === false.

Included datasets and sample output

examples/fixed-task-set.json — tiny smoke set (3 cases)
examples/fixed-task-set-v2.json — larger first pass set
examples/fixed-task-set-v3.json — less brittle checks (recommended)
examples/evalset-compare-sample-embedded.html — self-contained report UI with embedded compare JSON
examples/evalset-compare-sample.png — screenshot preview of that HTML report
examples/system-baseline.txt and examples/system-candidate.txt — compare inputs

Preview:

Evalset compare sample screenshot

Reports are written to explicit --out <path> when provided, otherwise .evalset/reports/*.json under the current project directory.

Each report includes run identity metadata (runId, datasetHash, casesHash, and variant hashes). Session messages keep lightweight report metadata only, not full report bodies.

Export report JSON to static HTML

npm run evalset:export-html -- --in .evalset/reports/compare-your-dataset-YYYYMMDDTHHMMSS.json
# optional:
npm run evalset:export-html -- --in .evalset/reports/run-your-dataset-YYYYMMDDTHHMMSS.json --out .evalset/reports/run-your-dataset.html --title "Evalset run report"

Script: scripts/export-evalset-report-html.mjs

Validation and release checks

Package-local validation:

npm run check
npm run release:check:quick

Monorepo-scoped validation:

cd ../..
bash ./scripts/package-quality-gate.sh ci packages/pi-evalset-lab
node ./scripts/release-components.mjs validate

Release metadata is root-managed through x-pi-template.releaseConfigMode=component and component key pi-evalset-lab.

The scoped package @tryinget/pi-evalset-lab is the canonical npm identity for future releases. The old unscoped pi-evalset-lab@0.2.0 package remains historical registry state, not the canonical development target.

Optional core hooks (future, not required)

This extension works today without Pi core changes. Optional hardening could include stable agent-level lineage IDs, explicit reproducibility metadata in pi-ai, shared provider payload hashing, or a headless agent-eval API for tool-heavy/full agent-loop benchmark runs.