pi-reason-harness

Recursive self-improving reasoning harness for pi — iterate, verify, improve. Builds task-specific reasoning strategies on top of any LLM.

Packages

Package details

extensionskill

Install pi-reason-harness from npm and Pi will load the resources declared by the package manifest.

npm report

$ pi install npm:pi-reason-harness

Package: pi-reason-harness
Version: 1.0.1
Published: Jun 15, 2026
Downloads: not available
Author: monotykamary
License: MIT
Types: extension, skill
Size: 348.3 KB
Dependencies: 0 dependencies · 3 peers

Pi manifest JSON

{
  "extensions": [
    "./extensions"
  ],
  "skills": [
    "./skills"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

🤯 pi-reason-harness

Recursive self-improving reasoning harness for pi

20-layer meta-system that discovers, adapts, evolves, transfers, and validates strategies autonomously.

Builds task-specific reasoning strategies on top of any LLM by running iterative solve-verify-feedback loops with multi-expert ensembling, voting, and a 20-layer meta-system that discovers, adapts, evolves, transfers, and validates strategies autonomously.

JS-exclusive — LLM calls go through pi's native LLM infrastructure (@earendil-works/pi-ai). Code sandbox uses Node's vm module. Zero Python dependency.

How It Works

The core insight (from first-principles analysis of SOTA reasoning systems): LLMs are knowledge stores that require intelligent probing strategies to extract reliable answers. The harness layer (open-source) iteratively generates, verifies, and refines. The meta-system layer (proprietary, rebuilt here) discovers and evolves the strategies themselves.

The 20-Layer Meta-System

Layer	Name	What it does
0	Problem Critic	Inspects problems, proposes targeted deltas to proven templates (not writing from scratch)
1	Strategy Library	Persistent store of proven strategies with ROI + quality metrics
2	Meta-Rule Engine	Extracts cross-strategy principles that compound over time
3	Model Router	Thompson sampling for intelligent model selection per category
4	Budget Bandit	Early stopping, budget reallocation, re-exploration when stuck
5	Auto-Trigger	Self-improvement runs automatically (on success rate drops, new categories, periodic)
6	Recursive Harness Generation	Generates entire solve approach configurations (the "solver of solvers")
7	Ensemble Diversification	Each expert uses a fundamentally different approach strategy
8	Sub-problem Decomposition	Break hard problems into sub-problems, solve independently, combine
9	Budget Optimization	Marginal ROI estimation, reallocate iterations to high-ROI experts
10	Cross-Domain Transfer	Transfer proven strategies across analogous categories automatically
11	Confidence-Weighted Voting	Weight votes by self-assessed quality, not just output match
12	Progressive Difficulty	Train on easiest examples first, build up to harder ones
13	Auto-Transfer	Automatically transfer strategies when new categories are encountered
14	Per-Problem Prompt Synthesis	Generate + validate specialized prompts for novel problem types
15	Meta-Meta Level	Harness-of-harnesses — generate new approach types from performance data
16	Gradient-Based Budget Optimization	Trajectory-based improvement estimation with finite-difference gradients
17	Recursive Meta-Meta Nesting	Meta-harnesses feed back into solve; recursive evolution of underperformers
18	Multi-Model Decomposition	Route sub-questions to different models in parallel based on strengths
19	Per-Iteration Prompt Adaptation	Evolve the solver prompt mid-solve based on failure patterns
20	ARC-AGI Benchmark Integration	Validate against real ARC-AGI-2 challenges with scoring

Layer 6: Recursive Harness Generation — the "solver of solvers"

The biggest gap with Poetiq: their open-source code shows ONE harness configuration with fixed prompts. Their blog results prove they generate MULTIPLE different configurations per problem type.

A HarnessSpec defines a complete solve approach:

Approach type: code-sandbox, decomposition, chain-of-questions, analogy, counter-factual, exhaustive-search, code-direct
Solver/feedback prompts: Full templates with $$problem$$ placeholders
Config overrides: Temperature, iterations, reasoning level
Decomposition config: Max sub-problems, depth, combine strategy
Validation data: Score on held-out data, production stats

The system generates multiple specs per problem, validates them, and evolves them over time.

Layer 7: Ensemble Diversification

Instead of N experts with the same prompt (just different seeds/models), each expert uses a fundamentally different approach:

Expert	Approach	When to use
1	code-sandbox	Grid/array problems — generate code, execute, verify
2	decomposition	Complex problems — break into sub-problems
3	analogy	Hard problems — solve simpler version first
4	chain-of-questions	Knowledge tasks — hierarchical probing
5	counter-factual	Stubborn problems — generate wrong solutions, invert
6	exhaustive-search	Small search spaces — enumerate, filter

Each approach has its own specialized prompt template.

Layer 8: Sub-problem Decomposition

For hard problems that resist direct solving, the decomposer breaks them into independent sub-problems:

LLM analyzes the problem and proposes 2-4 sub-problems
Each sub-problem is solved independently
Sub-solutions are combined (sequentially, in parallel, or hierarchically)

Example: "Rotate 90° clockwise" → Sub-problem 1: "Transpose the grid" → Sub-problem 2: "Reverse each row"

Layer 9: Budget Optimization via Marginal ROI

Not just "stop when stuck" but "spend where ROI is highest":

Estimate marginal ROI per expert based on recent improvement rate
Reallocate remaining iterations to experts with highest expected improvement
Phase-based execution: run all experts for half-iterations, then reallocate

Layer 10-13: Cross-Domain Transfer + Auto-Transfer

When a strategy works in one domain, the system automatically transfers it to analogous domains:

Category similarity map: grid-transformation ↔ pattern-completion ↔ spatial-reasoning
Transfer adapts domain-specific parts while keeping universal insights
Auto-triggered when a new category is encountered with no existing strategies
Creates both a strategy entry and a harness spec for the new category

Layer 11: Confidence-Weighted Voting

Voting is weighted by self-assessed quality:

Solutions that pass in fewer iterations count more (efficiency bonus)
Solutions with high soft scores count more (partial accuracy bonus)
Failed solutions grouped by output similarity, ranked by total confidence

Layer 12: Progressive Difficulty

Training examples are ordered from easiest to hardest:

Difficulty proxy: grid size + unique value count + input/output asymmetry
The solver sees simpler patterns first, building up to complex ones
Mirrors Poetiq's per-iteration shuffle but with intelligence

Layer 14: Per-Problem Prompt Synthesis

For truly novel problems where no proven strategy exists, the system synthesizes specialized prompts:

Computes a problem fingerprint based on structural features (grid size, unique values, operation type)
If a validated synthesized prompt matches the fingerprint, uses it instead of the generic template
If no match, generates a new specialized prompt via LLM and validates it on training data
Prompts with validation score > 0.5 are persisted for future use
Fingerprint-based matching allows cross-problem generalization

Layer 15: Meta-Meta Level — Harness-of-Harnesses

The biggest architectural gap with Poetiq: their system doesn't just generate strategies, it generates new types of harness approaches. Our meta-meta level:

Analyzes performance data across all harness specs and meta-harnesses
Uses LLM to propose NEW approach types that combine strengths of successful ones
Each meta-harness has a name, description, solver prompt, config overrides, and rationale
Meta-harnesses can evolve (generation counter, parent lineage) like strategies
Example: "Decomposed-Sandbox-Synthesis" — combines decomposition's cognitive offloading with code-sandbox's deterministic verification

Layer 16: Gradient-Based Budget Optimization

Replaces simple proportional reallocation with finite-difference gradient estimation:

Estimates dScore/dIteration (improvement rate) using a 5-point window
Estimates d²Score/dIteration² (acceleration/deceleration)
Predicts expected next score: current + gradient + 0.5 × acceleration
Allocates iterations proportional to (expected improvement × confidence)
Detects when an expert should switch approaches: stuck (gradient ≈ 0) + decelerating (acceleration < 0)

Layer 17: Recursive Meta-Meta Nesting

Meta-harnesses don't just get generated — they feed back into solve:

selectMetaHarnessExpertConfig() assigns the best meta-harness to one expert in the ensemble
Meta-harness performance is tracked (useCount, avgScore, successCount)
Underperforming meta-harnesses (useCount ≥ 2, avgScore < 0.5, generation < 3) are recursively evolved via recursiveMetaEvolve()
This creates a true recursive loop: solve → generate meta-harness → use in solve → evolve if underperforming → repeat

Layer 18: Multi-Model Decomposition

When multiple models are available, the system routes sub-questions to the best-suited model:

decomposeAndRoute(): LLM analyzes the problem and assigns sub-problems to models based on heuristic strengths
Model strength heuristics: Anthropic (complex reasoning, code), OpenAI (math, creative), Google (multimodal), Groq (fast), Wafer (reasoning), DeepSeek (code, math)
Dependency tracking: sub-problems can depend on previous results
solveRoutedDecomposition(): solves each sub-problem with its assigned model, combines results
Triggered automatically in solve when models.length > 1 and useMeta=true

Layer 19: Per-Iteration Prompt Adaptation

The solver prompt adapts mid-solve based on failure patterns:

After 3 consecutive failed iterations (score < 0.5), adaptPromptMidSolve() is called
LLM analyzes the failure trajectory and suggests a prompt modification
Three adaptation types: pre-insert (add before problem), anti-pattern (warn after problem), section-replace (replace a named section)
applyIterationAdaptation() modifies the prompt for subsequent iterations
The adaptation persists within the expert's solve loop

Layer 20: ARC-AGI Benchmark Integration

The system can validate against real ARC-AGI-2 challenges:

loadArcChallenges(): loads challenges from ARC-AGI JSON files
runArcBenchmark(): runs the harness on a batch of challenges with budget limits
Re-verifies test outputs against ground truth (when available)
Computes: solved, partial solved, avg best score, total cost, total time
CLI: pi-reason-harness arc-benchmark --data-path ... --max-challenges 5
Benchmark results with wafer/GLM-5.1: 5/7 unique challenges solved (71%), 1 near-miss (0.97), cost ~$0.04/challenge

Layers 0-5: Core Meta-System

These were implemented in the previous iteration and remain the foundation:

Layer 0: Critique, Don't Create — The critic receives proven templates and proposes targeted deltas (insertions, anti-patterns, examples). This is code review, not writing from zero.
Layer 2: Meta-Rules Compound — When a child strategy outperforms its parent, generalizable principles are extracted and applied to other categories.
Layer 3: Thompson Sampling — Beta(α,β) sampling with Laplace smoothing picks the best model per category.
Layer 4: Budget Bandit — Early stopping, re-exploration when all experts fail.
Layer 5: Auto-Trigger — Runs automatically on success rate drops, new categories, and every 5th problem.

The Harness Layer

Below the meta-system, the harness implements the iterative solve loop with Poetiq-parity features:

Iterative solve-verify-feedback loops — Generate code, sandbox-execute, build detailed feedback
Multi-expert ensembling — Parallel experts with diverse approaches
Confidence-weighted voting — Group by output, rank by confidence
Poetiq-parity feedback — Element-by-element diff grids, shape mismatch detection
Poetiq-parity formatting — <Diagram> text with Fisher-Yates shuffle
Self-audit verification — LLM checks its own answers
Budget tracking — Per-problem cost/time limits

Task Types

Type	Strategy	Verification
`code-reasoning`	Generate JavaScript code → sandbox execute → verify against examples → feedback loop	Sandbox (default) or external
`knowledge-extraction`	Chain-of-questions probing → self-audit → confidence bucketing	Self-audit (recommended)
`hybrid`	Decide per-problem: code or direct answer → verify → feedback	Any method

Approach Types (for ensemble diversification)

Approach	Description	Best for
`code-sandbox`	Generate JS code, execute in sandbox, verify output	Grid/array transformations
`code-direct`	Generate code, extract answer without execution	Computation-heavy
`decomposition`	Break into sub-problems, solve each, combine	Multi-step problems
`chain-of-questions`	Hierarchical probing from broad to specific	Knowledge questions
`analogy`	Solve simpler version first, then scale up	Hard spatial problems
`counter-factual`	Generate wrong solutions, analyze failures, invert	Stubborn problems
`exhaustive-search`	Enumerate possibilities, filter by constraints	Small search spaces

Persistent Data

The meta-system persists across server restarts at ~/.pi-reason-harness/:

File	Contents
`strategies.json`	Strategy library with ROI, quality metrics, lineage
`meta-rules.json`	Cross-strategy principles with validation stats
`model-routes.json`	Per model×category routing stats
`harness-specs.json`	Complete harness specifications per category×approach
`synthesized-prompts.json`	Per-problem-type specialized prompts with validation
`meta-harnesses.json`	Generated approach types with evolution lineage

Quick Start

# Initialize a reasoning session
pi-reason-harness init --name "ARC solver" --type code-reasoning \
  --models '["anthropic/claude-sonnet-4-5","openai/gpt-4o"]' --num-experts 3

# Solve with the full 13-layer meta-system pipeline
pi-reason-harness solve --meta --problem "Transform the grid..." \
  --train-inputs '[[1,2],[3,4]]' \
  --train-outputs '[[4,3],[2,1]]' \
  --test-inputs '[[5,6]]'

# Analyze a problem without solving
pi-reason-harness meta-analyze --problem "Rotate a 2x2 grid 90 degrees clockwise"

# Decompose a hard problem into sub-problems
pi-reason-harness decompose --problem "Rotate a 3x3 grid 90 degrees clockwise. Input: [[1,2,3],[4,5,6],[7,8,9]]"

# Check harness specs
pi-reason-harness harness-specs

# Evolve the worst-performing spec
pi-reason-harness evolve-harness

# Check the strategy library
pi-reason-harness strategies

# Transfer a strategy from grid-transformation to pattern-completion
pi-reason-harness transfer --source-category grid-transformation --target-category pattern-completion

# Check meta-rules
pi-reason-harness meta-rules

# Check model routing stats
pi-reason-harness model-routes

Architecture

┌───────────────────────────────────────────────────────────┐
│  META-SYSTEM V3 (16 layers — the proprietary layer)       │
│                                                           │
│  Layer 0: Problem Critic (critique-don't-create)          │
│  Layer 1: Strategy Library (ROI + quality metrics)        │
│  Layer 2: Meta-Rule Engine (cross-strategy principles)    │
│  Layer 3: Model Router (Thompson sampling)                │
│  Layer 4: Budget Bandit (early stopping + re-explore)     │
│  Layer 5: Auto-Trigger (self-improving loop)              │
│  Layer 6: Recursive Harness Generation (solver-of-solvers)│
│  Layer 7: Ensemble Diversification (different approaches) │
│  Layer 8: Sub-problem Decomposition (break & combine)     │
│  Layer 9: Budget Optimization (marginal ROI realloc)      │
│  Layer 10: Cross-Domain Transfer (analogous categories)   │
│  Layer 11: Confidence-Weighted Voting (quality-ranked)    │
│  Layer 12: Progressive Difficulty (easiest-first)         │
│  Layer 13: Auto-Transfer (new category handling)          │
│  Layer 14: Per-Problem Prompt Synthesis (novel types)     │
│  Layer 15: Meta-Meta Level (harness-of-harnesses)         │
│  Layer 16: Gradient-Based Budget Optimization             │
│  Layer 17: Recursive Meta-Meta Nesting (harness↔solve)    │
│  Layer 18: Multi-Model Decomposition (model routing)      │
│  Layer 19: Per-Iteration Prompt Adaptation (mid-solve)    │
│  Layer 20: ARC-AGI Benchmark Integration (validation)     │
└───────────────────────┬───────────────────────────────────┘
                        │ generates (with deltas + rules + specs)
                        ▼
┌─────────────────────────────────────────────────────────┐
│  HARNESS (iterative solve-verify-feedback)              │
│                                                         │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────┐     │
│  │ Expert 1    │  │ Expert 2     │  │ Expert N    │     │
│  │ code-sandbox│  │ decomposition│  │ analogy     │     │
│  │ (pi-ai      │  │ (pi-ai       │  │ (pi-ai      │     │
│  │  LLM call   │  │  LLM call    │  │  LLM call   │     │
│  │  + sandbox  │  │  + sub-      │  │  + analogy  │     │
│  │  + verify   │  │  solve       │  │  + verify   │     │
│  │  + feedback │  │  + combine   │  │  + feedback │     │
│  └────┬────────┘  └─┬────────────┘  └──┬──────────┘     │
│       └─────────────┼──────────────────┘                │
│                     ▼                                   │
│     CONFIDENCE-WEIGHTED VOTING                          │
│     (group by output, rank by confidence)               │
│                     │                                   │
│                     ▼                                   │
│     LEARN + ADAPT + EVOLVE + TRANSFER                   │
│     (update strategies, extract rules, evolve specs,    │
│      transfer to new categories, auto-improve)          │
└─────────────────────────────────────────────────────────┘

CLI Reference

Command	Description
`init`	Initialize session with task config, models, verification
`solve`	Run iterative solve-verify-feedback loop
`status`	Show session state, budget, learned adaptations
`results`	Show iteration results
`learn`	Inspect strategy adaptations
`reset-learn`	Clear learned strategies
`clear`	Clear session
`meta-analyze`	Analyze a problem with the critic (no solving)
`meta-improve`	Manually trigger strategy evolution + rule extraction
`strategies`	List strategy library with ROI + quality metrics
`meta-rules`	List meta-rules with validation stats
`model-routes`	List model routing stats per model×category
`harness-specs`	List harness specifications with validation + production stats
`evolve-harness`	Evolve the worst-performing harness spec
`transfer`	Transfer strategy from one category to another
`decompose`	Decompose a problem into sub-problems
`synth-prompts`	List synthesized prompts with validation stats
`meta-harnesses`	List meta-harnesses (generated approach types)
`generate-meta-harness`	Generate a new approach type from performance data
`arc-benchmark`	Run ARC-AGI benchmark validation against real challenges
`route-decompose`	Decompose a problem across multiple models

init flags

--name, --type, --models, --num-experts, --verification, --verify-command, --max-cost, --max-time

solve flags

--problem, --train-inputs, --train-outputs, --test-inputs, --meta / -m

transfer flags

--source-category, --target-category

decompose flags

--problem

LLM Integration

The harness uses @earendil-works/pi-ai for all LLM calls. Models are specified in provider/model format (e.g., anthropic/claude-sonnet-4-5, openai/gpt-4o). API keys are resolved from the same environment variables pi uses:

ANTHROPIC_API_KEY — Anthropic models
OPENAI_API_KEY — OpenAI models
GEMINI_API_KEY — Google models
GROQ_API_KEY — Groq models
WAFER_API_KEY — Wafer Pass models (GLM-5.1, Qwen3.5-397B-A17B)
etc.

Custom Providers

The harness also supports custom providers (like Wafer Pass) that aren't in pi-ai's built-in model registry. Custom providers use direct OpenAI-compatible API calls. Currently supported:

Provider	Base URL	Models	Notes
`wafer`	`https://pass.wafer.ai/v1`	`GLM-5.1`, `Qwen3.5-397B-A17B`	Reasoning models with `reasoning_content` field

To add a new custom provider, add it to the CUSTOM_PROVIDERS map in server.ts.

No additional setup required — if pi can call the model, so can the harness.

Tests

npm test

104 tests covering: vm sandbox, formatProblem, arrayDiff, buildDetailedFeedback, PromptDelta application, budget bandit, Thompson sampling, meta-rule engine, prompt quality metrics, harness specs, ensemble diversification, budget optimization, cross-domain transfer, confidence-weighted voting, progressive difficulty, decomposition, problem fingerprinting, synthesized prompts, meta-harnesses, gradient estimation, approach switching, recursive meta-meta nesting, multi-model decomposition routing, per-iteration prompt adaptation, ARC-AGI benchmark.

License

MIT