pi-reason-harness

Recursive self-improving reasoning harness for pi — iterate, verify, improve. Builds task-specific reasoning strategies on top of any LLM.

Packages

Package details

extensionskill

Install pi-reason-harness from npm and Pi will load the resources declared by the package manifest.

$ pi install npm:pi-reason-harness
Package
pi-reason-harness
Version
1.0.1
Published
Jun 15, 2026
Downloads
not available
Author
monotykamary
License
MIT
Types
extension, skill
Size
348.3 KB
Dependencies
0 dependencies · 3 peers
Pi manifest JSON
{
  "extensions": [
    "./extensions"
  ],
  "skills": [
    "./skills"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

🤯 pi-reason-harness

Recursive self-improving reasoning harness for pi

20-layer meta-system that discovers, adapts, evolves, transfers, and validates strategies autonomously.

pi extension license


Builds task-specific reasoning strategies on top of any LLM by running iterative solve-verify-feedback loops with multi-expert ensembling, voting, and a 20-layer meta-system that discovers, adapts, evolves, transfers, and validates strategies autonomously.

JS-exclusive — LLM calls go through pi's native LLM infrastructure (@earendil-works/pi-ai). Code sandbox uses Node's vm module. Zero Python dependency.

How It Works

The core insight (from first-principles analysis of SOTA reasoning systems): LLMs are knowledge stores that require intelligent probing strategies to extract reliable answers. The harness layer (open-source) iteratively generates, verifies, and refines. The meta-system layer (proprietary, rebuilt here) discovers and evolves the strategies themselves.

The 20-Layer Meta-System

Layer Name What it does
0 Problem Critic Inspects problems, proposes targeted deltas to proven templates (not writing from scratch)
1 Strategy Library Persistent store of proven strategies with ROI + quality metrics
2 Meta-Rule Engine Extracts cross-strategy principles that compound over time
3 Model Router Thompson sampling for intelligent model selection per category
4 Budget Bandit Early stopping, budget reallocation, re-exploration when stuck
5 Auto-Trigger Self-improvement runs automatically (on success rate drops, new categories, periodic)
6 Recursive Harness Generation Generates entire solve approach configurations (the "solver of solvers")
7 Ensemble Diversification Each expert uses a fundamentally different approach strategy
8 Sub-problem Decomposition Break hard problems into sub-problems, solve independently, combine
9 Budget Optimization Marginal ROI estimation, reallocate iterations to high-ROI experts
10 Cross-Domain Transfer Transfer proven strategies across analogous categories automatically
11 Confidence-Weighted Voting Weight votes by self-assessed quality, not just output match
12 Progressive Difficulty Train on easiest examples first, build up to harder ones
13 Auto-Transfer Automatically transfer strategies when new categories are encountered
14 Per-Problem Prompt Synthesis Generate + validate specialized prompts for novel problem types
15 Meta-Meta Level Harness-of-harnesses — generate new approach types from performance data
16 Gradient-Based Budget Optimization Trajectory-based improvement estimation with finite-difference gradients
17 Recursive Meta-Meta Nesting Meta-harnesses feed back into solve; recursive evolution of underperformers
18 Multi-Model Decomposition Route sub-questions to different models in parallel based on strengths
19 Per-Iteration Prompt Adaptation Evolve the solver prompt mid-solve based on failure patterns
20 ARC-AGI Benchmark Integration Validate against real ARC-AGI-2 challenges with scoring

Layer 6: Recursive Harness Generation — the "solver of solvers"

The biggest gap with Poetiq: their open-source code shows ONE harness configuration with fixed prompts. Their blog results prove they generate MULTIPLE different configurations per problem type.

A HarnessSpec defines a complete solve approach:

  • Approach type: code-sandbox, decomposition, chain-of-questions, analogy, counter-factual, exhaustive-search, code-direct
  • Solver/feedback prompts: Full templates with $$problem$$ placeholders
  • Config overrides: Temperature, iterations, reasoning level
  • Decomposition config: Max sub-problems, depth, combine strategy
  • Validation data: Score on held-out data, production stats

The system generates multiple specs per problem, validates them, and evolves them over time.

Layer 7: Ensemble Diversification

Instead of N experts with the same prompt (just different seeds/models), each expert uses a fundamentally different approach:

Expert Approach When to use
1 code-sandbox Grid/array problems — generate code, execute, verify
2 decomposition Complex problems — break into sub-problems
3 analogy Hard problems — solve simpler version first
4 chain-of-questions Knowledge tasks — hierarchical probing
5 counter-factual Stubborn problems — generate wrong solutions, invert
6 exhaustive-search Small search spaces — enumerate, filter

Each approach has its own specialized prompt template.

Layer 8: Sub-problem Decomposition

For hard problems that resist direct solving, the decomposer breaks them into independent sub-problems:

  1. LLM analyzes the problem and proposes 2-4 sub-problems
  2. Each sub-problem is solved independently
  3. Sub-solutions are combined (sequentially, in parallel, or hierarchically)

Example: "Rotate 90° clockwise" → Sub-problem 1: "Transpose the grid" → Sub-problem 2: "Reverse each row"

Layer 9: Budget Optimization via Marginal ROI

Not just "stop when stuck" but "spend where ROI is highest":

  • Estimate marginal ROI per expert based on recent improvement rate
  • Reallocate remaining iterations to experts with highest expected improvement
  • Phase-based execution: run all experts for half-iterations, then reallocate

Layer 10-13: Cross-Domain Transfer + Auto-Transfer

When a strategy works in one domain, the system automatically transfers it to analogous domains:

  • Category similarity map: grid-transformation ↔ pattern-completion ↔ spatial-reasoning
  • Transfer adapts domain-specific parts while keeping universal insights
  • Auto-triggered when a new category is encountered with no existing strategies
  • Creates both a strategy entry and a harness spec for the new category

Layer 11: Confidence-Weighted Voting

Voting is weighted by self-assessed quality:

  • Solutions that pass in fewer iterations count more (efficiency bonus)
  • Solutions with high soft scores count more (partial accuracy bonus)
  • Failed solutions grouped by output similarity, ranked by total confidence

Layer 12: Progressive Difficulty

Training examples are ordered from easiest to hardest:

  • Difficulty proxy: grid size + unique value count + input/output asymmetry
  • The solver sees simpler patterns first, building up to complex ones
  • Mirrors Poetiq's per-iteration shuffle but with intelligence

Layer 14: Per-Problem Prompt Synthesis

For truly novel problems where no proven strategy exists, the system synthesizes specialized prompts:

  • Computes a problem fingerprint based on structural features (grid size, unique values, operation type)
  • If a validated synthesized prompt matches the fingerprint, uses it instead of the generic template
  • If no match, generates a new specialized prompt via LLM and validates it on training data
  • Prompts with validation score > 0.5 are persisted for future use
  • Fingerprint-based matching allows cross-problem generalization

Layer 15: Meta-Meta Level — Harness-of-Harnesses

The biggest architectural gap with Poetiq: their system doesn't just generate strategies, it generates new types of harness approaches. Our meta-meta level:

  • Analyzes performance data across all harness specs and meta-harnesses
  • Uses LLM to propose NEW approach types that combine strengths of successful ones
  • Each meta-harness has a name, description, solver prompt, config overrides, and rationale
  • Meta-harnesses can evolve (generation counter, parent lineage) like strategies
  • Example: "Decomposed-Sandbox-Synthesis" — combines decomposition's cognitive offloading with code-sandbox's deterministic verification

Layer 16: Gradient-Based Budget Optimization

Replaces simple proportional reallocation with finite-difference gradient estimation:

  • Estimates dScore/dIteration (improvement rate) using a 5-point window
  • Estimates d²Score/dIteration² (acceleration/deceleration)
  • Predicts expected next score: current + gradient + 0.5 × acceleration
  • Allocates iterations proportional to (expected improvement × confidence)
  • Detects when an expert should switch approaches: stuck (gradient ≈ 0) + decelerating (acceleration < 0)

Layer 17: Recursive Meta-Meta Nesting

Meta-harnesses don't just get generated — they feed back into solve:

  • selectMetaHarnessExpertConfig() assigns the best meta-harness to one expert in the ensemble
  • Meta-harness performance is tracked (useCount, avgScore, successCount)
  • Underperforming meta-harnesses (useCount ≥ 2, avgScore < 0.5, generation < 3) are recursively evolved via recursiveMetaEvolve()
  • This creates a true recursive loop: solve → generate meta-harness → use in solve → evolve if underperforming → repeat

Layer 18: Multi-Model Decomposition

When multiple models are available, the system routes sub-questions to the best-suited model:

  • decomposeAndRoute(): LLM analyzes the problem and assigns sub-problems to models based on heuristic strengths
  • Model strength heuristics: Anthropic (complex reasoning, code), OpenAI (math, creative), Google (multimodal), Groq (fast), Wafer (reasoning), DeepSeek (code, math)
  • Dependency tracking: sub-problems can depend on previous results
  • solveRoutedDecomposition(): solves each sub-problem with its assigned model, combines results
  • Triggered automatically in solve when models.length > 1 and useMeta=true

Layer 19: Per-Iteration Prompt Adaptation

The solver prompt adapts mid-solve based on failure patterns:

  • After 3 consecutive failed iterations (score < 0.5), adaptPromptMidSolve() is called
  • LLM analyzes the failure trajectory and suggests a prompt modification
  • Three adaptation types: pre-insert (add before problem), anti-pattern (warn after problem), section-replace (replace a named section)
  • applyIterationAdaptation() modifies the prompt for subsequent iterations
  • The adaptation persists within the expert's solve loop

Layer 20: ARC-AGI Benchmark Integration

The system can validate against real ARC-AGI-2 challenges:

  • loadArcChallenges(): loads challenges from ARC-AGI JSON files
  • runArcBenchmark(): runs the harness on a batch of challenges with budget limits
  • Re-verifies test outputs against ground truth (when available)
  • Computes: solved, partial solved, avg best score, total cost, total time
  • CLI: pi-reason-harness arc-benchmark --data-path ... --max-challenges 5
  • Benchmark results with wafer/GLM-5.1: 5/7 unique challenges solved (71%), 1 near-miss (0.97), cost ~$0.04/challenge

Layers 0-5: Core Meta-System

These were implemented in the previous iteration and remain the foundation:

  • Layer 0: Critique, Don't Create — The critic receives proven templates and proposes targeted deltas (insertions, anti-patterns, examples). This is code review, not writing from zero.
  • Layer 2: Meta-Rules Compound — When a child strategy outperforms its parent, generalizable principles are extracted and applied to other categories.
  • Layer 3: Thompson Sampling — Beta(α,β) sampling with Laplace smoothing picks the best model per category.
  • Layer 4: Budget Bandit — Early stopping, re-exploration when all experts fail.
  • Layer 5: Auto-Trigger — Runs automatically on success rate drops, new categories, and every 5th problem.

The Harness Layer

Below the meta-system, the harness implements the iterative solve loop with Poetiq-parity features:

  1. Iterative solve-verify-feedback loops — Generate code, sandbox-execute, build detailed feedback
  2. Multi-expert ensembling — Parallel experts with diverse approaches
  3. Confidence-weighted voting — Group by output, rank by confidence
  4. Poetiq-parity feedback — Element-by-element diff grids, shape mismatch detection
  5. Poetiq-parity formatting<Diagram> text with Fisher-Yates shuffle
  6. Self-audit verification — LLM checks its own answers
  7. Budget tracking — Per-problem cost/time limits

Task Types

Type Strategy Verification
code-reasoning Generate JavaScript code → sandbox execute → verify against examples → feedback loop Sandbox (default) or external
knowledge-extraction Chain-of-questions probing → self-audit → confidence bucketing Self-audit (recommended)
hybrid Decide per-problem: code or direct answer → verify → feedback Any method

Approach Types (for ensemble diversification)

Approach Description Best for
code-sandbox Generate JS code, execute in sandbox, verify output Grid/array transformations
code-direct Generate code, extract answer without execution Computation-heavy
decomposition Break into sub-problems, solve each, combine Multi-step problems
chain-of-questions Hierarchical probing from broad to specific Knowledge questions
analogy Solve simpler version first, then scale up Hard spatial problems
counter-factual Generate wrong solutions, analyze failures, invert Stubborn problems
exhaustive-search Enumerate possibilities, filter by constraints Small search spaces

Persistent Data

The meta-system persists across server restarts at ~/.pi-reason-harness/:

File Contents
strategies.json Strategy library with ROI, quality metrics, lineage
meta-rules.json Cross-strategy principles with validation stats
model-routes.json Per model×category routing stats
harness-specs.json Complete harness specifications per category×approach
synthesized-prompts.json Per-problem-type specialized prompts with validation
meta-harnesses.json Generated approach types with evolution lineage

Quick Start

# Initialize a reasoning session
pi-reason-harness init --name "ARC solver" --type code-reasoning \
  --models '["anthropic/claude-sonnet-4-5","openai/gpt-4o"]' --num-experts 3

# Solve with the full 13-layer meta-system pipeline
pi-reason-harness solve --meta --problem "Transform the grid..." \
  --train-inputs '[[1,2],[3,4]]' \
  --train-outputs '[[4,3],[2,1]]' \
  --test-inputs '[[5,6]]'

# Analyze a problem without solving
pi-reason-harness meta-analyze --problem "Rotate a 2x2 grid 90 degrees clockwise"

# Decompose a hard problem into sub-problems
pi-reason-harness decompose --problem "Rotate a 3x3 grid 90 degrees clockwise. Input: [[1,2,3],[4,5,6],[7,8,9]]"

# Check harness specs
pi-reason-harness harness-specs

# Evolve the worst-performing spec
pi-reason-harness evolve-harness

# Check the strategy library
pi-reason-harness strategies

# Transfer a strategy from grid-transformation to pattern-completion
pi-reason-harness transfer --source-category grid-transformation --target-category pattern-completion

# Check meta-rules
pi-reason-harness meta-rules

# Check model routing stats
pi-reason-harness model-routes

Architecture

┌───────────────────────────────────────────────────────────┐
│  META-SYSTEM V3 (16 layers — the proprietary layer)       │
│                                                           │
│  Layer 0: Problem Critic (critique-don't-create)          │
│  Layer 1: Strategy Library (ROI + quality metrics)        │
│  Layer 2: Meta-Rule Engine (cross-strategy principles)    │
│  Layer 3: Model Router (Thompson sampling)                │
│  Layer 4: Budget Bandit (early stopping + re-explore)     │
│  Layer 5: Auto-Trigger (self-improving loop)              │
│  Layer 6: Recursive Harness Generation (solver-of-solvers)│
│  Layer 7: Ensemble Diversification (different approaches) │
│  Layer 8: Sub-problem Decomposition (break & combine)     │
│  Layer 9: Budget Optimization (marginal ROI realloc)      │
│  Layer 10: Cross-Domain Transfer (analogous categories)   │
│  Layer 11: Confidence-Weighted Voting (quality-ranked)    │
│  Layer 12: Progressive Difficulty (easiest-first)         │
│  Layer 13: Auto-Transfer (new category handling)          │
│  Layer 14: Per-Problem Prompt Synthesis (novel types)     │
│  Layer 15: Meta-Meta Level (harness-of-harnesses)         │
│  Layer 16: Gradient-Based Budget Optimization             │
│  Layer 17: Recursive Meta-Meta Nesting (harness↔solve)    │
│  Layer 18: Multi-Model Decomposition (model routing)      │
│  Layer 19: Per-Iteration Prompt Adaptation (mid-solve)    │
│  Layer 20: ARC-AGI Benchmark Integration (validation)     │
└───────────────────────┬───────────────────────────────────┘
                        │ generates (with deltas + rules + specs)
                        ▼
┌─────────────────────────────────────────────────────────┐
│  HARNESS (iterative solve-verify-feedback)              │
│                                                         │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────┐     │
│  │ Expert 1    │  │ Expert 2     │  │ Expert N    │     │
│  │ code-sandbox│  │ decomposition│  │ analogy     │     │
│  │ (pi-ai      │  │ (pi-ai       │  │ (pi-ai      │     │
│  │  LLM call   │  │  LLM call    │  │  LLM call   │     │
│  │  + sandbox  │  │  + sub-      │  │  + analogy  │     │
│  │  + verify   │  │  solve       │  │  + verify   │     │
│  │  + feedback │  │  + combine   │  │  + feedback │     │
│  └────┬────────┘  └─┬────────────┘  └──┬──────────┘     │
│       └─────────────┼──────────────────┘                │
│                     ▼                                   │
│     CONFIDENCE-WEIGHTED VOTING                          │
│     (group by output, rank by confidence)               │
│                     │                                   │
│                     ▼                                   │
│     LEARN + ADAPT + EVOLVE + TRANSFER                   │
│     (update strategies, extract rules, evolve specs,    │
│      transfer to new categories, auto-improve)          │
└─────────────────────────────────────────────────────────┘

CLI Reference

Command Description
init Initialize session with task config, models, verification
solve Run iterative solve-verify-feedback loop
status Show session state, budget, learned adaptations
results Show iteration results
learn Inspect strategy adaptations
reset-learn Clear learned strategies
clear Clear session
meta-analyze Analyze a problem with the critic (no solving)
meta-improve Manually trigger strategy evolution + rule extraction
strategies List strategy library with ROI + quality metrics
meta-rules List meta-rules with validation stats
model-routes List model routing stats per model×category
harness-specs List harness specifications with validation + production stats
evolve-harness Evolve the worst-performing harness spec
transfer Transfer strategy from one category to another
decompose Decompose a problem into sub-problems
synth-prompts List synthesized prompts with validation stats
meta-harnesses List meta-harnesses (generated approach types)
generate-meta-harness Generate a new approach type from performance data
arc-benchmark Run ARC-AGI benchmark validation against real challenges
route-decompose Decompose a problem across multiple models

init flags

--name, --type, --models, --num-experts, --verification, --verify-command, --max-cost, --max-time

solve flags

--problem, --train-inputs, --train-outputs, --test-inputs, --meta / -m

transfer flags

--source-category, --target-category

decompose flags

--problem

LLM Integration

The harness uses @earendil-works/pi-ai for all LLM calls. Models are specified in provider/model format (e.g., anthropic/claude-sonnet-4-5, openai/gpt-4o). API keys are resolved from the same environment variables pi uses:

  • ANTHROPIC_API_KEY — Anthropic models
  • OPENAI_API_KEY — OpenAI models
  • GEMINI_API_KEY — Google models
  • GROQ_API_KEY — Groq models
  • WAFER_API_KEY — Wafer Pass models (GLM-5.1, Qwen3.5-397B-A17B)
  • etc.

Custom Providers

The harness also supports custom providers (like Wafer Pass) that aren't in pi-ai's built-in model registry. Custom providers use direct OpenAI-compatible API calls. Currently supported:

Provider Base URL Models Notes
wafer https://pass.wafer.ai/v1 GLM-5.1, Qwen3.5-397B-A17B Reasoning models with reasoning_content field

To add a new custom provider, add it to the CUSTOM_PROVIDERS map in server.ts.

No additional setup required — if pi can call the model, so can the harness.

Tests

npm test

104 tests covering: vm sandbox, formatProblem, arrayDiff, buildDetailedFeedback, PromptDelta application, budget bandit, Thompson sampling, meta-rule engine, prompt quality metrics, harness specs, ensemble diversification, budget optimization, cross-domain transfer, confidence-weighted voting, progressive difficulty, decomposition, problem fingerprinting, synthesized prompts, meta-harnesses, gradient estimation, approach switching, recursive meta-meta nesting, multi-model decomposition routing, per-iteration prompt adaptation, ARC-AGI benchmark.

License

MIT