pi-autoresearch-vkf
Autoresearch with verifiable long-term scientific memory. A pi extension that gathers literature, stores it as VKF claims, runs experiments, and writes verified results back to a git-native knowledge bundle so future runs build on what was learned instead
Package details
Install pi-autoresearch-vkf from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-autoresearch-vkf- Package
pi-autoresearch-vkf- Version
0.6.0- Published
- Jun 28, 2026
- Downloads
- not available
- Author
- ericjahns
- License
- MIT
- Types
- extension, skill
- Size
- 187.4 KB
- Dependencies
- 0 dependencies · 4 peers
Pi manifest JSON
{
"extensions": [
"./extensions"
],
"skills": [
"./skills"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-autoresearch-vkf
Autoresearch that remembers — and can prove what it learned.
A pi extension that turns a blind optimization loop into a self-improving researcher with verifiable long-term memory. It gathers frontier literature, distills it into structured claims, verifies them, runs experiments, and writes the results back to a git-native knowledge bundle — so the next run builds on what was learned instead of rediscovering the obvious.
The memory layer is VKF
(Verifiable Knowledge Format): markdown + YAML knowledge objects with provenance,
evidence, confidence, and a trust lifecycle, gated by the real vkf CLI.
Why
A plain autoresearch loop tries an idea, measures it, keeps wins, reverts regressions — and forgets everything. It can't say where a good idea came from, what it already tried, or whether a win was real. This extension adds the missing layer:
RAG agent: retrieve papers → try idea → forget context
pi-autoresearch-vkf:
retrieve → extract claims → verify → store
→ hypothesize → test → update belief
→ avoid repeated failures → improve future search
The novelty isn't "autoresearch + RAG." It's that the agent's scientific memory is verifiable, lifecycle-managed, and auditable.
Install
pi install npm:pi-autoresearch-vkf
# or, from a local checkout:
pi install file:/path/to/pi-autoresearch-vkf
Requirements
| Dependency | For | Required? |
|---|---|---|
vkf CLI |
Trust gating — validation, graph, freshness, permission checks | Recommended (memory still works without it; validation is skipped) |
Web tools (WebSearch / WebFetch) |
Ingesting new knowledge from the literature | Recommended — the ingestion path |
vkfCLI — the extension finds it automatically inside a conda env namedVKF, or set$PI_AUTORESEARCH_VKFto thevkfexecutable.
Knowledge sources (how ingestion works)
The extension stores and reasons over knowledge; it does not fetch papers
itself. Gathering is done by the host agent through the autoresearch-vkf-knowledge-gather skill,
using the agent's built-in WebSearch + WebFetch against free, openly
accessible databases — no API keys, no paid services, no MCP setup:
- arXiv (
arxiv.org,export.arxiv.org/api) - Semantic Scholar (
api.semanticscholar.orgGraph API) - OpenAlex (
api.openalex.org) - Crossref (
api.crossref.org) - GitHub / docs / benchmark reports / blogs for implementation hints
The agent reads sources and calls remember_claim to persist each finding as a
VKF card. If the host has no web tools, you can still ingest by pasting papers /
PDFs / findings for the agent to extract, or by seeding claims from the agent's
own knowledge (marked low-reliability until verified).
Usage
In a project you want to optimize:
optimize the test suite runtime, using the research literature and remembering what works
The autoresearch-vkf skill drives it: confirm goal/metric/command → init →
gather literature → extract & verify claims → loop (recall → experiment →
write-back) → report. All state lives in one self-contained .autoresearch-vkf/
folder at the project root, so work survives restarts and context resets.
How it works
goal ─► recall_memory ─► gather literature ─► remember_claim (candidates)
│ │
│ verify_claim ──► trusted claims
▼ │
autoresearch-vkf-hypothesis-loop: recall ─► pick idea ─► vkf_run_experiment ─► vkf_log_experiment
│ │
│ writes experiment card back to memory,
│ updates the claim's belief & lifecycle
▼
autoresearch-vkf-research-report (paper → claim → hypothesis → patch → metric Δ → memory update)
One self-contained workspace
Everything the package owns lives under a single namespaced .autoresearch-vkf/
directory, so it never collides with other tools and is obvious at a glance:
| Layer | Folder | Lifetime |
|---|---|---|
| Session | .autoresearch-vkf/session/ |
this run — goal, experiment log, measure script, dashboards (safe to gitignore) |
| Project memory | .autoresearch-vkf/memory/ |
persists across runs — the VKF bundle (meant to be committed) |
| Global memory | ~/.autoresearch-vkf/memory/ |
persists across projects — trusted knowledge promoted from any repo |
The memory lifecycle
Every card carries a trust state. Agents propose; promotion is explicit and
audited (a VKF transaction is written for each change). The vision's states map
directly onto VKF status + a lifecycle directory:
| Memory state | VKF status | Directory |
|---|---|---|
candidate |
draft |
staging/ |
source_verified |
active |
verified/ |
locally_tested / replicated |
verified |
verified/ |
contradicted |
disputed |
deprecated/ |
deprecated |
deprecated |
deprecated/ |
retired |
retracted |
deprecated/ |
Only source_verified+ drives serious hypotheses; only locally_tested+ strongly
steers experiments. This — plus the staging area and the citation-checking
verifier — is the defense against memory poisoning.
Tools
| Tool | What it does |
|---|---|
init_research |
Scaffold the .autoresearch-vkf/ workspace (session + memory VKF bundle). |
remember_claim |
Stage a literature-derived candidate claim (+ its source paper). |
verify_claim |
Advance/downgrade a card's trust lifecycle (audited). |
recall_memory |
Query memory (project / global / both): trusted claims, candidates, prior experiments, negatives, conflicts. |
score_ideas |
Rank untested ideas by EV × feasibility × evidence × novelty × info_gain ÷ cost. |
find_contradictions |
Mine memory for tensions between claims — each a seed for a novel hypothesis. |
find_transfers |
Cross-domain mechanism search: same how, different where. |
vkf_run_experiment |
Run the measurement command; capture METRIC name=value. |
vkf_log_experiment |
Record a result, write it back to memory, update belief & lifecycle. |
promote_to_global |
Copy a trusted card into the cross-project global memory. |
export_dashboard |
Write browser dashboards: a live progress page + the vkf html idea-lineage graph. |
research_status |
Show session experiments + memory lifecycle. |
Skills
| Skill | Role |
|---|---|
autoresearch-vkf |
Orchestrator / spine — the entry point. |
autoresearch-vkf-knowledge-gather |
Find candidate techniques via WebSearch/WebFetch (arXiv / Semantic Scholar / OpenAlex / GitHub). |
autoresearch-vkf-claim-extract |
Distill sources into reusable claim cards. |
autoresearch-vkf-claim-verify |
Check citations & codebase fit — the trust layer. |
autoresearch-vkf-contradiction-miner |
Turn tensions in memory into novel hypotheses. |
autoresearch-vkf-cross-domain-transfer |
Import a mechanism from another field. |
autoresearch-vkf-idea-tournament |
Multi-perspective debate to pick the 2–3 ideas worth testing. |
autoresearch-vkf-hypothesis-loop |
Pick the next idea and run the smallest falsifying experiment. |
autoresearch-vkf-research-report |
The auditable lineage report. |
The .autoresearch-vkf/ workspace
.autoresearch-vkf/
session/ # ephemeral per-run state (config, experiment log, dashboards)
memory/ # the durable VKF knowledge bundle:
vkf.bundle.yaml # profile 1 (governed); 2 (verified) once evidence lands
staging/ # candidates (status: draft)
verified/ # source-/locally-verified, replicated
deprecated/ # contradicted / retired
transactions/ # one record per promote/demote/write-back
The memory/ bundle is just markdown — human-readable, version-controllable, and
auditable. Run vkf validate .autoresearch-vkf/memory, vkf graph,
vkf freshness, or vkf html over it any time.
Benchmark
Does verifiable memory + novelty scoring + synthesis actually search better than a
blind loop? npm run bench runs both policies over deterministic, ground-truth
idea-environments — driving ours through the real scoring.ts and synthesis.ts
— and reports the difference. See benchmark/README.md for
exactly what is and isn't simulated.
Mean over 500 seeds per scenario. "Standard" = blind loop (EV-greedy, no durable memory, no synthesis). "Ours" = VKF memory + novelty scoring + contradiction synthesis, driven through the real scoring/synthesis modules.
Tiny-LM validation loss (budget 10)
| Metric | Standard | Ours |
|---|---|---|
| Best improvement (higher better) | 0.035 | 0.130 |
| Unique mechanisms tried | 7.8 | 10.0 |
| Wasted (repeat) experiments | 2.2 | 0.0 |
| Dead-ends retried | 1.4 | 1.0 |
| Synthesized ideas discovered | 0.0 | 1.0 |
| Found optimum (rate) | 0% | 100% |
Inference latency (budget 8)
| Metric | Standard | Ours |
|---|---|---|
| Best improvement (higher better) | 0.043 | 0.150 |
| Unique mechanisms tried | 6.3 | 8.0 |
| Wasted (repeat) experiments | 1.7 | 0.0 |
| Dead-ends retried | 1.7 | 1.0 |
| Synthesized ideas discovered | 0.0 | 1.0 |
| Found optimum (rate) | 0% | 100% |
The global optimum in each scenario is a synthesized idea a blind loop can't construct, so it reaches it 0% of the time; ours gets both parents tried (memory + novelty), then synthesis unlocks the combo.
Watching progress
Three live views, in increasing detail:
Widget (always on, above the editor) — win/loss counts, best metric, memory state tally; refreshes after every tool call.
Fullscreen overlay — press Ctrl+G (or call
research_status) for the full experiment list, memory lifecycle, and verified claims.Browser dashboards —
export_dashboardwrites two self-contained pages to.autoresearch-vkf/session/:progress.html— metric-over-time chart, experiment timeline, and memory lifecycle; auto-refreshes so an open tab tracks the run live.dashboard.html— the interactive idea-lineage graph (paper → claim → experiment, with conflict/derived-from edges), generated byvkf html.
open .autoresearch-vkf/session/progress.html # watch progress as it goes open .autoresearch-vkf/session/dashboard.html # explore the knowledge lineage
Configuration
PI_AUTORESEARCH_VKF— path to thevkfexecutable (overrides auto-detection).PI_AUTORESEARCH_VKF_CONDA_ENV— conda env to findvkfin (defaultVKF).PI_AUTORESEARCH_GLOBAL_ROOT— root for the global cross-project memory (default~, i.e. the bundle lives at~/.autoresearch-vkf/memory/).PI_AUTORESEARCH_SHORTCUT— key for the fullscreen dashboard (defaultctrl+g; set tononeto disable).
Development
npm install
npm run typecheck # tsc --noEmit
npm test # node --experimental-strip-types --test tests/*.test.mjs
npm run bench # standard autoresearch vs ours
npm test requires a Node 22+ build with TypeScript stripping support (the same
requirement pi has for loading .ts extensions). On a Node built without it, run
the tests through a loader instead, e.g. node --import tsx --test tests/*.test.mjs.
Publishing
The package ships its .ts extensions and .md skills as-is (pi loads them
directly — no build step). The files whitelist publishes only extensions/,
skills/, and the docs; prepublishOnly runs typecheck as a gate.
Two ways to release:
- Tagged CI release (recommended). Add an npm Automation token as the repo
secret
NPM_TOKEN, then bump the version and push a matching tag — thepublish.ymlworkflow publishes with provenance:npm version patch # or minor/major — updates package.json + makes a tag git push --follow-tags - Manual.
npm login, then:npm publish --access public # prepublishOnly runs typecheck first
Verify what will ship first with npm pack --dry-run.
Roadmap
All four planned phases are in: the lean MVP (Phase 1), the novelty scorer
(Phase 2), the hypothesis-synthesis layer (Phase 3 — find_contradictions,
find_transfers, autoresearch-vkf-idea-tournament), and global cross-project memory + the
benchmark (Phase 4).
Possible next steps:
- End-to-end live benchmark — a real LLM agent on real repos with human novelty ratings (the controlled harness here isolates the search policy).
- Bundle profile 2 — attach reproduction
verificationblocks to experiment cards so memory validates at the strictverifiedprofile.
(Knowledge ingestion via WebSearch/WebFetch against free databases (arXiv,
Semantic Scholar, OpenAlex, Crossref) is built in — see
Knowledge sources.)
License
MIT