pi-autocontext-lean-verify
Experimental Pi package for Lean-verified proof repair using autocontext and Pi.
Package details
Install pi-autocontext-lean-verify from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-autocontext-lean-verify- Package
pi-autocontext-lean-verify- Version
0.1.18- Published
- Jun 2, 2026
- Downloads
- 2,084/mo · 78/wk
- Author
- jayscambler
- License
- Apache-2.0
- Types
- extension, skill, prompt
- Size
- 913.6 KB
- Dependencies
- 3 dependencies · 0 peers
Pi manifest JSON
{
"extensions": [
"./extensions"
],
"skills": [
"./skills"
],
"prompts": [
"./prompts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-autocontext-lean-verify
Experimental Pi package for verifier-backed Lean proof repair using autocontext and Pi.
Working product framing:
A Pi plugin for verifier-backed formal proof repair using autocontext and Lean.
This is not a general theorem prover and should not be described as “autocontext proves theorems.” Pi/autocontext proposes and repairs proof bodies; Lean is the correctness oracle.
Status
- Published npm package. First trusted-publisher/provenance release:
pi-autocontext-lean-verify@0.1.4. - Tracks Linear issue
AC-731. - License: Apache-2.0.
- Default harness root in the standalone package: bundled
harness/directory. AUTOCONTEXT_FORMAL_ROOTcan override the harness root for local experiments.
Install
From npm:
pi install npm:pi-autocontext-lean-verify
From a local checkout:
pi install ./pi-autocontext-lean-verify
Or test a local checkout for one run without installing:
pi -e ./pi-autocontext-lean-verify
If the package is moved away from the harness, set:
export AUTOCONTEXT_FORMAL_ROOT=/path/to/support/formal-proof-lean-pilot
The extension looks for Lean at:
$LEAN
$AUTOCONTEXT_LEAN
/tmp/autocontext-elan-home/bin/lean
Autocontext is the repair engine and Lean is the oracle. Repair runs invoke the Python autocontext CLI on demand through:
uvx --python 3.12 --from autocontext==0.5.1 autoctx improve ...
action="preflight" checks uvx and autocontext==0.5.1 / autoctx improve availability so this dependency is explicit rather than a decorative package import.
Long-running run/benchmark/attribution actions default their result directories to a short temp path under the system temp directory (pi-autocontext-lean-verify/...) (or $AUTOCONTEXT_LEAN_VERIFY_RESULTS_ROOT) so npm-loaded extensions do not create overly long Pi session paths.
Provided resources
- Extension tool:
autocontext_lean_verify - Skill:
lean-verify - Prompt template:
lean-verify.md - Fixture group registry:
fixture_groups.json - Domain model, validation matrix, proof-transfer benchmark, and repo split notes:
docs/DOMAIN.md,docs/VALIDATION.md,docs/PROOF_TRANSFER_BENCHMARK.md,docs/REPO_SPLIT.md
Tool actions
Preflight
Checks harness paths, Lean, Python, manifest, and seed playbook.
{
"action": "preflight"
}
Setup
Runs preflight and a minimal Lean smoke proof (add_zero_right). Use this after first installing or loading the package.
{
"action": "setup"
}
Run
Runs run_playbook_transfer.py through a named mode. Use either explicit fixtures or a fixtureGroup.
{
"action": "run",
"mode": "pre_repair_hint",
"fixtureGroup": "broader",
"maxAttempts": 2,
"rounds": 2,
"timeoutSeconds": 60
}
Negative-control recovery
Use maxAttempts=3 for the expanded negative-control group because maxAttempts=2 showed provider variance while maxAttempts=3 recovered all six fixtures in the latest probe.
{
"action": "run",
"mode": "pre_repair_hint",
"fixtureGroup": "negative_controls",
"maxAttempts": 3,
"rounds": 2,
"timeoutSeconds": 60
}
Benchmark
Runs the packaged proof-transfer benchmark: seeded autocontext/Pi vs direct Pi repair-loop on the same fixed Lean templates. The default fixture group is challenge_v3_generalization; pregeneration and synthetic hint candidates are disabled.
{
"action": "benchmark",
"fixtureGroup": "challenge_v3_generalization",
"maxAttempts": 2,
"rounds": 2,
"timeoutSeconds": 120
}
From a checkout, the same benchmark can be launched without Pi:
npm run benchmark:v3
npm run benchmark:v4
npm run benchmark:v5
npm run benchmark:v6
npm run benchmark:v7
npm run benchmark:v8
npm run benchmark:v9
npm run benchmark:v10
npm run benchmark:v11
npm run benchmark:v12
npm run benchmark:v13
npm run benchmark:v14
npm run benchmark:v15
npm run benchmark:v16
npm run benchmark:v17
npm run benchmark:v18
npm run benchmark:v19
npm run benchmark:v20
npm run benchmark:v21
Attribution
Runs seeded autocontext, unseeded isolated autocontext, and direct Pi repair-loop for attribution. The default fixture group is challenge_v5_attribution.
{
"action": "attribution",
"fixtureGroup": "challenge_v5_attribution",
"maxAttempts": 2,
"rounds": 2,
"timeoutSeconds": 120
}
From a checkout:
npm run benchmark:v5:attribution
npm run benchmark:v6
npm run benchmark:v7
npm run benchmark:v8
npm run benchmark:v9
npm run benchmark:v10
npm run benchmark:v11
npm run benchmark:v12
npm run benchmark:v13
npm run benchmark:v14
npm run benchmark:v15
npm run benchmark:v16
npm run benchmark:v17
npm run benchmark:v18
npm run benchmark:v19
npm run benchmark:v20
npm run benchmark:v21
Summarize
Reads a saved run summary.
{
"action": "summarize",
"runRoot": "results/20260506T_broader_fixture_baseline_comparison"
}
Fixture groups
| Group | Fixtures |
|---|---|
smoke |
add_zero_right |
broader |
Seven broader fixtures covering Nat distribution, Bool/List recursion, list helpers, tree helpers, and reverse/append. |
heldout |
Seven original held-out transfer fixtures. |
combined |
broader + heldout (14 fixtures). |
negative_controls |
Six fixtures: distribution, multiplication associativity/commutativity, addition cancellation, and filter length bound negative controls. |
challenge_v2_no_helper |
Three no-expected-proof fixtures requiring local helper lemma discovery. |
challenge_v3_generalization |
Four no-expected-proof theorem-generalization fixtures requiring accumulator/multi-helper proof plans. |
challenge_transfer |
All v2/v3 challenge fixtures (7 fixtures). |
challenge_v4_count |
Four no-expected-proof count/list/tree fixtures. |
challenge_v5_attribution |
Four no-expected-proof fixtures for seeded/unseeded/direct attribution. |
challenge_v5_tree_tally |
The hard v5 tree/tally mirror fixture. |
challenge_v6_frontier |
Four no-expected-proof frontier fixtures where nested tree/partition composition starts separating seeded/unseeded/direct. |
challenge_v7_frontier |
Four no-expected-proof frontier-plus fixtures where seeded autocontext becomes unstable (2/4) while unseeded/direct are 0/4. |
challenge_v8_diagnostics |
Four no-expected-proof diagnostic fixtures isolating the v7 partition-heavy accumulator failures into smaller raw-tree, keep/drop, and list reassembly components. |
challenge_v9_composition_gradient |
Six no-expected-proof composition-gradient fixtures splitting keep/drop partition-through-mirror into scalar and paired length/sum obligations. |
challenge_v10_stats_reification |
Six no-expected-proof statsAcc-reification fixtures isolating tuple normalization, metric extensionality, count metrics, and stats equality from metric hypotheses. |
challenge_v11_metric_composition |
Six no-expected-proof metric-composition fixtures combining keep/drop metric bundles with statsAcc extensionality boundaries. |
challenge_v12_simultaneous_metrics |
Seven no-expected-proof simultaneous-metric fixtures isolating keep/drop count-only, length-only, sum-only, pair, and triple metric bundles. |
challenge_v13_decomposition_order |
Six no-expected-proof fixtures isolating v12 triple-bundle ordering, append/filter sub-lemmas, tree-only metrics, and conjunction reassembly. |
challenge_v14_metric_order_permutations |
Six no-expected-proof fixtures covering all metric-order permutations for the full simultaneous keep/drop triple bundle. |
challenge_v15_proof_shape_hints |
Four no-expected-proof fixtures exposing metric-bundle, side-bundle, append/filter, and reassembly hypotheses for the hard length/count/sum order. |
challenge_v16_compact_reassembly_hints |
Four no-expected-proof fixtures replacing the large v15 reassembly hypothesis with compact generic/named packer hypotheses. |
challenge_v17_proof_plan_hints |
Four no-expected-proof fixtures adding harmless True proof-plan skeleton names to v16 compact packer shapes. |
challenge_v18_prompt_only_skeleton_hints |
Two no-expected-proof fixtures reusing v16 theorem statements while supplying detailed skeleton names only through seeded playbook context. |
challenge_v19_bare_skeleton_names |
Ablation over the two clean v18 fixtures using only bare skeleton-name bullets in the seeded playbook. |
challenge_v20_description_only_skeleton |
Ablation over the two clean v18 fixtures using compact skeleton descriptions without plan_... labels in the seeded playbook. |
challenge_v21_neutral_anchor_skeleton |
Ablation over the two clean v18 fixtures using neutral step anchors plus compact descriptions without plan_... labels in the seeded playbook. |
challenge_v22_code_like_anchor_skeleton |
Ablation over the two clean v18 fixtures using compact descriptions plus code-like snake_case anchors without the earlier detailed-plan prefix. |
challenge_v23_exact_labels_without_plan_prefix |
Ablation over the two clean v18 fixtures using the exact v18 detailed labels with only the prefix removed. |
challenge_v24_plan_prefix_generic_labels |
Ablation over the two clean v18 fixtures using compact descriptions plus generic plan_ labels that differ from v18 exact labels. |
challenge_v25_step_prefix_exact_labels |
Ablation over the two clean v18 fixtures using compact descriptions plus step_ prefixed exact v18 label roots. |
challenge_extended_transfer |
All v2/v3/v4/v5/v6/v7/v8/v9/v10/v11/v12/v13/v14/v15/v16/v17/v18 challenge fixtures (78 fixtures). |
Modes
| Mode | Behavior |
|---|---|
seeded_pregenerate |
Use verified playbook to pregenerate a proof, then Lean-check. |
pre_repair_hint |
No pregeneration; try verifier-checked candidates before primary Pi repair. |
post_repair_hint |
No pregeneration; run primary Pi repair before hint candidates. |
structured_retry |
No pregeneration; structured alternate Pi retry without hint candidates. |
Current strongest repair-only mode:
pre_repair_hint
It maps to:
--no-pregenerate
--structured-alternate-retry
--structured-hint-candidates
--pre-repair-hint-candidates
--max-attempts 2
--rounds 2
--timeout 60
Frontier benchmark highlights
- v6 frontier: seeded
4/4, unseeded3/4, direct2/4; hardest pattern was nested tree mirror + partition/count/length invariants. - v7 frontier-plus: seeded
2/4, unseeded0/4, direct0/4; repeated seeded-failed partition-heavy fixtures stayed0/6. - v8 diagnostics: decomposing v7 gives seeded a clear edge but not full stability across repeats (
6/8seeded,1/8unseeded,0/8direct); raw stats and list-only reassembly are stable, while keep/drop partition-through-mirror remains stochastic. - v9 composition gradient: seeded solves all scalar/pair length+sum partition-through-mirror obligations (
6/6), while unseeded solves1/6and direct0/6; the remaining frontier points towardstatsAcctriple reification/composition rather than length/sum invariants themselves. - v10 stats reification: seeded solves the full layer (
6/6), unseeded solves core reification/count metrics (3/6), and direct solves only drop-count (1/6); the frontier is composing tuple normalization with metric equalities in the keep/drop tree-mirror setting. - v11 metric composition: seeded solves
5/6, unseeded0/6, direct1/6; the remaining seeded miss is the combined keep/drop metric-bundle proof, while applying provided metric bundles tostatsAccis comparatively easy. - v12 simultaneous metrics: seeded solves all single-metric and two-metric simultaneous keep/drop bundles (
6/7full suite); isolated repeats show the full count+length+sum bundle is stochastic for seeded (1/3across observations), while unseeded/direct have not solved it. - v13 decomposition/order: seeded solves
5/6, unseeded0/6, direct1/6; raw tree metrics, append/filter sub-lemmas, supplied append/filter hypotheses, side-grouped bundles, and conjunction reassembly all pass seeded. The only seeded miss is reordered full simultaneous metric grouping (sum, thencount, thenlength), sharpening the frontier to conjunction ordering/grouping sensitivity. - v14 metric-order permutations: seeded solves
5/6, unseeded/direct0/6; canonical count/length/sum and v13's failed sum/count/length order both solve seeded, while length/count/sum misses. Focused seeded repeats put the v13 miss at2/3and the v14 miss at2/3, confirming stochastic/non-canonical conjunction-order sensitivity rather than a simple fixed order failure. A later focused budget probe showedmaxAttempts=3with the same 120s Pi timeout produced only timeout-empty repairs (0/3for each miss). Raising timeout to 240s with the originalmaxAttempts=2recovers proofs but does not fully stabilize the frontier (2/3focused repeats for each miss;3/4each including the earlier timeout-240 smoke). - v15 proof-shape hints: seeded solves
3/4, unseeded0/4, direct2/4; metric/side reassembly is easy for direct, append/filter hypotheses let seeded solve the hard length/count/sum tree-induction shape, and the explicit reassembly-hyp variant misses in the full run. Focused repeats show append/filter hypotheses are stable seeded (3/3,4/4including original), while the larger explicit reassembly-hyp variant remains unstable (1/3,1/4including original). The frontier is not final conjunction assembly alone; compact helper hypotheses help more than large higher-order proof-shape hints. - v16 compact reassembly hints: seeded solves
2/4, unseeded/direct0/4; compact top-level Prop and generic Nat scalar packers solve in the full suite. The two first-run compact-packer misses each solve2/3on focused repeat (2/4including original), so compact packers reduce the v15 higher-order-hypothesis drag but do not remove stochastic induction/simp proof-search variance. Timeout 240 improves named-metric repeats to3/3but pair+top remains2/3, with severe wall-clock blowups despite Pi timeout messages, so it is not a clean stabilization knob. - v17 proof-plan skeleton hints: seeded solves
4/4, unseeded0/4, direct1/4; harmlessTruenames expose the intended induction/by-cases/simp/add-normalization skeleton. Detailed plan names are stable on repeat (3/3,4/4including original for both packer shapes), while coarse plan names remain stochastic (2/3,3/4including original). This points to detailed skeleton naming as the missing prompt structure. - v18 prompt-only skeleton hints: two clean v16 theorem-shape fixtures remove the v17
Trueplan hypotheses and move the same detailed skeleton names into the seeded playbook only. Seeded solves2/2, direct remains0/2, and seeded focused repeats are stable (3/3for both shapes,4/4including original). Unseeded is no longer zero (1/2attribution; repeats1/3pair+top and1/3named metric), so prompt-only skeleton context acts as a stabilizer rather than an exclusive capability source. - v19 bare skeleton-name ablation: reuses the two clean v18 fixtures and swaps descriptive skeleton guidance for bare semantic labels only. Seeded solves
2/2in the full attribution run, but repeats are unstable (2/3pair+top,1/3named metric;3/4and2/4including original). Unseeded is also stochastic (0/3pair+top,1/3named metric after a2/2full run), and direct remains0/2. Bare semantic labels alone do not reproduce v18 descriptive-context stabilization. - v20 description-only skeleton ablation: reuses the two clean v18 fixtures and keeps compact natural-language skeleton descriptions while removing
plan_...labels. Seeded solves2/2in the full attribution run, unseeded/direct solve0/2, but seeded repeats are only2/3for each fixture (3/4including original) because one repeat returns timeout-empty repairs for both fixtures. Descriptions alone are capable and stronger than bare labels on the named-metric shape, but they do not fully reproduce v18's descriptions+labels stabilization. - v21 neutral-anchor skeleton ablation: reuses the two clean v18 fixtures and keeps compact descriptions with neutral natural-language step anchors instead of code-like
plan_...labels. Seeded solves only1/2in the full attribution run, unseeded also solves1/2, and direct remains0/2. The first seeded repeat drops to0/2with severe wall-clock blowups; a partial second repeat fails pair+top before being curtailed. Neutral anchors do not close the v20-to-v18 stability gap and appear worse than description-only context in these probes. - v22 code-like anchor skeleton ablation: reuses the two clean v18 fixtures and combines compact descriptions with snake_case step anchors that deliberately avoid the earlier detailed-plan prefix. Seeded solves
2/2in attribution while unseeded/direct solve0/2, but seeded repeats are2/3for each fixture (3/4including attribution), matching v20-style partial stability rather than v18's stable3/3. - v23 exact-label-minus-prefix ablation: reuses the two clean v18 fixtures and preserves v18's exact detailed labels with only the prefix removed. The first partial attribution attempt had seeded
1/2; focused seeded retries were2/2,0/2,2/2(2/3per fixture), so removing only the prefix still fails to reproduce v18 stability. - v24 generic plan-prefix ablation: reuses the two clean v18 fixtures and keeps the
plan_prefix while changing to shorter generic labels. Focused seeded repeats were1/2,1/2,2/2(2/3per fixture), so genericplan_labels do not reproduce v18 stability. - v25 step-prefix exact-label ablation: reuses the two clean v18 fixtures, preserves the exact v18 semantic label roots, and replaces the original prefix with
step_. Focused seeded repeats were1/2,0/2,2/2(pair+top2/3, named metric1/3), so alternate-prefixed exact roots do not reproduce v18 stability.
See docs/V6_FRONTIER_REPORT.md, docs/V7_FRONTIER_REPORT.md, docs/V8_DIAGNOSTIC_REPORT.md, docs/V9_COMPOSITION_GRADIENT_REPORT.md, docs/V10_STATS_REIFICATION_REPORT.md, docs/V11_METRIC_COMPOSITION_REPORT.md, docs/V12_SIMULTANEOUS_METRICS_REPORT.md, docs/V13_DECOMPOSITION_ORDER_REPORT.md, docs/V14_METRIC_ORDER_PERMUTATIONS_REPORT.md, docs/V15_PROOF_SHAPE_HINTS_REPORT.md, docs/V16_COMPACT_REASSEMBLY_HINTS_REPORT.md, docs/V17_PROOF_PLAN_HINTS_REPORT.md, docs/V18_PROMPT_ONLY_SKELETON_HINTS_REPORT.md, docs/V19_BARE_SKELETON_NAMES_REPORT.md, docs/V20_DESCRIPTION_ONLY_SKELETON_REPORT.md, docs/V21_NEUTRAL_ANCHOR_SKELETON_REPORT.md, docs/V22_CODE_LIKE_ANCHOR_SKELETON_REPORT.md, docs/V23_EXACT_LABELS_WITHOUT_PLAN_PREFIX_REPORT.md, docs/V24_PLAN_PREFIX_GENERIC_LABELS_REPORT.md, and docs/V25_STEP_PREFIX_EXACT_LABELS_REPORT.md.
Guardrails
The underlying harness requires:
- fixed theorem templates
- proof-body-only candidates
- no
sorry - no
admit - no new axioms
- no
unsafe - no theorem weakening
- no new imports
- Lean verification before success is counted
Mathlib-aware lemma retrieval (opt-in)
By default the harness operates in an Init-only profile (no Mathlib). For
Mathlib-backed targets, harness/mathlib_retrieval.py surfaces ranked candidate
library lemmas (name, signature, one-line doc) into the repair prompt so the
model is shown real declarations instead of guessing names that may have been
renamed or never existed.
Enable it by pointing at a declaration index generated from the pinned Mathlib
revision (a JSON list of {"name", "signature", "doc"} records):
export AUTOCONTEXT_MATHLIB_INDEX=/path/to/mathlib_index.json
# or: python3 harness/prove_with_autocontext.py --mathlib-index /path/to/mathlib_index.json ...
When an index is configured the run switches to the lean-mathlib harness
profile: retrieved candidates are injected into the repair prompt and Mathlib
imports are permitted. When it is unset or missing, behaviour is unchanged
(Init-only, no retrieval, byte-identical repair prompt).
Because the index is a dump of the pinned revision, suggested names resolve in
that revision. Retrieval is advisory only: Lean remains the correctness oracle,
and nothing here asserts a candidate is applicable or a proof correct. Ranking is
keyword-based today; embedding-based semantic search and live #check validation
against the lake environment are documented follow-ups. Tracks issue #1.
Local validation
The local package has been smoke-tested through Pi:
pi -e ./pi-autocontext-lean-verify --no-session --no-builtin-tools --tools autocontext_lean_verify -p 'Use autocontext_lean_verify with action="preflight"'
Observed preflight summary:
Lean version: 4.29.1
Python: 3.14.2
uvx: uv-tool-uvx 0.6.17
Autocontext runtime: autocontext==0.5.1 via uvx autoctx improve (ok)
Fixture count: 130
Fixture groups: smoke=1, broader=7, heldout=7, combined=14, negative_controls=6, challenge_v2_no_helper=3, challenge_v3_generalization=4, challenge_transfer=7, challenge_v4_count=4, challenge_v5_attribution=4, challenge_v5_tree_tally=1, challenge_v6_frontier=4, challenge_v7_frontier=4, challenge_v8_diagnostics=4, challenge_v9_composition_gradient=6, challenge_v10_stats_reification=6, challenge_v11_metric_composition=6, challenge_v12_simultaneous_metrics=7, challenge_v13_decomposition_order=6, challenge_v14_metric_order_permutations=6, challenge_v15_proof_shape_hints=4, challenge_v16_compact_reassembly_hints=4, challenge_v17_proof_plan_hints=4, challenge_v18_prompt_only_skeleton_hints=2, challenge_v19_bare_skeleton_names=2, challenge_v20_description_only_skeleton=2, challenge_v21_neutral_anchor_skeleton=2, challenge_v22_code_like_anchor_skeleton=2, challenge_v23_exact_labels_without_plan_prefix=2, challenge_v24_plan_prefix_generic_labels=2, challenge_v25_step_prefix_exact_labels=2, challenge_extended_transfer=78
A zero-Pi-call run smoke test was completed:
results/20260506T_pi_package_wrapper_smoke_add_zero
1 / 1 proved
Pi calls: 0
Lean attempts: 1
A nontrivial package-driven run was also completed on a distribution negative-control fixture:
results/20260506T_pi_package_wrapper_nontrivial_mul_add
1 / 1 proved
Pi calls: 1
Lean attempts: 6
Pre-repair hint generated/passed/used: 3/0/0
This confirms that Lean-rejected synthetic candidates fall back to primary Pi repair through the package wrapper.
The expanded negative-control group now contains six fixtures. In no-pregeneration probes, all generated synthetic hint candidates were Lean-rejected (45 / 45 failed across three runs). With maxAttempts=3, primary Pi repair recovered 6 / 6 fixtures after those candidate rejections.
Post-registry package-driven validation also passed through the autocontext_lean_verify tool:
results/20260506T_pi_package_registry_setup_smoke
1 / 1 proved, Pi calls: 0, Lean attempts: 1
results/20260506T_pi_package_registry_negative_controls_attempts3
6 / 6 proved, Pi calls: 7, Lean attempts: 33, pre-repair hints: 15/0/15/0
results/20260506T_pi_package_registry_combined_seeded
14 / 14 proved, Pi calls: 14, Lean attempts: 17, pre-repair hints: 2/1/1
The npm-published trusted-publisher release was smoke-tested from the registry with:
pi -e npm:pi-autocontext-lean-verify@0.1.4 --no-context-files --no-builtin-tools --tools autocontext_lean_verify -p 'Call autocontext_lean_verify with action="setup"'
Observed npm-installed setup result:
1 / 1 proved
Pi calls: 0
Lean verifier attempts: 1
Package source: npm:pi-autocontext-lean-verify@0.1.4
npm pack --dry-run --json includes runtime package assets and the bundled harness, while excluding tests and generated result artifacts.
Proof-transfer benchmark evidence
Version 0.1.6 added packaged v2/v3 no-expected-proof challenge fixtures and a reproducible benchmark command. Version 0.1.7 adds v4/v5 challenge fixtures plus attribution benchmarking. The local v3 repeated stability probe produced:
seeded autocontext: 12 / 12 proofs across 3 trials
direct Pi repair-loop: 7 / 12 proofs across 3 trials
hard-plus subset: seeded autocontext 6 / 6 vs direct Pi 1 / 6
The v4/v5 follow-up produced:
v4 seeded autocontext: 12 / 12 across 3 repeats
v4 direct Pi repair-loop: 3 / 12 across 3 repeats
v5 seeded autocontext: 4 / 4 full set; hard tree/tally 3 / 3
v5 unseeded isolated: 3 / 4 full set; hard tree/tally 1 / 3
v5 direct Pi repair-loop: 1 / 4 full set; hard tree/tally 0 / 3
See docs/PROOF_TRANSFER_BENCHMARK.md for fixture groups, commands, and interpretation.
Release
Version 0.1.0 was the initial experimental npm release. Version 0.1.4 was the first successful GitHub OIDC trusted-publisher release with npm provenance attestations. Version 0.1.6 promotes the v2/v3 challenge fixtures and proof-transfer benchmark runner. Version 0.1.7 promotes the v4/v5 fixtures plus attribution benchmarking. Version 0.1.8 fixes default benchmark run roots for npm-loaded extensions to avoid Pi session ENAMETOOLONG failures. Version 0.1.9 promotes the v6/v7 frontier fixtures and reports. Version 0.1.10 promotes the v8-v12 diagnostic/frontier fixtures and reports. Version 0.1.11 promotes v13/v14 order-sensitivity fixtures, timeout/process-group attribution robustness, and budget-vs-timeout frontier evidence. Version 0.1.12 promotes v15 proof-shape hint diagnostics and focused stability evidence for compact helper hypotheses versus large explicit reassembly hints. Version 0.1.13 promotes v16/v17 diagnostics: compact reassembly packers, timeout-240 caveats, and detailed proof-plan skeleton hints. Version 0.1.14 promotes v18/v19/v20 prompt-only ablations over detailed descriptions, bare labels, and description-only skeleton context. Version 0.1.15 promotes v21 neutral-anchor skeleton evidence and hardens subprocess process-group timeout cleanup plus short temp run roots for run/benchmark/attribution actions. Version 0.1.16 promotes v22-v25 prompt-transfer ablations and the second timeout pipe-leak hardening patch. Version 0.1.17 updates the runtime pin to autocontext==0.5.1, inheriting upstream Python/TypeScript Pi CLI timeout cleanup hardening. Unpublished attempts 0.1.1–0.1.3 do not exist in the npm registry.
0.1.11 was published through GitHub trusted publishing from tag v0.1.11 (workflow 25753534079), had two npm provenance attestations, and was registry-smoked with setup 1/1 plus a v14 timeout-240 attribution contrast: seeded 1/1, unseeded 0/1, direct 0/1.
0.1.12 was published through GitHub trusted publishing from tag v0.1.12 (workflow 25761371690), has two npm provenance attestations, and was registry-smoked with setup 1/1 over 120 fixtures plus a focused v15 append/filter-hypothesis attribution contrast: seeded 1/1, unseeded 0/1, direct 0/1.
0.1.13 was published through GitHub trusted publishing from tag v0.1.13 (workflow 25774592204), has two npm provenance attestations, and was registry-smoked with setup 1/1 over 128 fixtures plus a focused v17 detailed proof-plan attribution contrast: seeded 1/1, unseeded 0/1, direct 0/1.
0.1.14 was published through GitHub trusted publishing from tag v0.1.14 (workflow 25813274000), has two npm provenance attestations, and was registry-smoked with setup 1/1 over 130 fixtures plus a focused v20 description-only seeded timeout-240 smoke: seeded 1/1.
0.1.15 was published through GitHub trusted publishing from tag v0.1.15 (workflow 25838541595), has two npm provenance attestations, and includes the process-group timeout/short-run-root hardening required before v22 prompt-transfer stress tests. Registry smoke covered preflight, setup 1/1, default action="run" short temp-root behavior, and packaged process_utils.py process-group cleanup.
0.1.16 was published through GitHub trusted publishing from tag v0.1.16 (workflow 25875599303), has two npm provenance attestations, and includes v22 code-like snake_case anchor fixtures, v23 exact-label-minus-prefix fixtures, v24 generic plan-prefix fixtures, v25 step-prefix exact-label fixtures, their reports, and a second timeout hardening patch that bounds post-kill waits and force-closes leaked subprocess pipes. Registry smoke covered preflight, setup 1/1, default action="run" short temp-root behavior, and packaged process_utils.py leaked-pipe cleanup. The combined v22-v25 evidence indicates v18's original full prefixed label form remains the strongest stability signal.
0.1.17 was published through GitHub trusted publishing from tag v0.1.17 (workflow 25887654100), is npm latest, has two npm provenance attestations, and updates the explicit autocontext runtime dependency from autocontext==0.4.8 to autocontext==0.5.1 so Pi-provider repair runs use the upstream hardened PiCLIRuntime timeout/process cleanup. Registry smoke covered preflight (autocontext==0.5.1, 130 fixtures) plus a bounded Pi/Lean sanity run on mul_add_distrib_right: 1/1 proved with final proof exact Nat.mul_add c a b, 1 Pi call, and 3 Lean verifier attempts.
Future releases use GitHub's npm trusted publisher workflow on version tags:
.github/workflows/publish.yml
on: push tags v*
environment: publish-pi-autocontext-lean-verify
npm publish --provenance --access public
Before cutting the next release:
- keep CI green,
- bump
package.jsonversion, - push tag
v<version>, - approve the
publish-pi-autocontext-lean-verifyenvironment gate, - verify npm publication and provenance attestations, then create/update GitHub release notes.