pi-cache-optimizer

Improve Pi prompt/KV cache hit rates with stable prompts, OpenAI-compatible cache keys, proxy compat warnings, and footer cache stats.

Packages

Package details

extension

Install pi-cache-optimizer from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:pi-cache-optimizer

Package: pi-cache-optimizer
Version: 2.6.25
Published: Jul 27, 2026
Downloads: 4,672/mo · 1,259/wk
Author: freescheme
License: MIT
Types: extension
Size: 373.6 KB
Dependencies: 0 dependencies · 1 peer

Pi manifest JSON

{
  "extensions": [
    "./index.ts"
  ],
  "image": "https://img.shields.io/badge/Pi-Cache%20Optimizer-4A90D9"
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

Pi Cache Optimizer

中文说明

Pi extension for improving provider-side KV / prompt cache hit rates. It keeps stable prompt content near the front, adds a conservative OpenAI-compatible prompt_cache_key fallback, warns about common proxy cache-routing gaps, and shows read-only footer cache stats.

Renamed from pi-deepseek-cache-optimizer. Existing footer counters migrate automatically. The extension does not touch Pi's models.json during normal operation (default: ~/.pi/agent/models.json; custom agent dirs use PI_CODING_AGENT_DIR); only /cache-optimizer fix can edit it, and only after an explicit interactive preview + confirmation with an automatic timestamped backup.

What it does
Install
Commands
Persistent opt-out
OpenAI-compatible proxy setup
Adaptive thinking models
Auto-repair with /cache-optimizer fix
Footer stats
For router / virtual-channel extension authors
Uninstall
Verify effect
License

What it does

Reorders uniquely identifiable stable system-prompt content before dynamic context. If the same candidate appears more than once (for example, quoted inside dynamic context), it is left unchanged to avoid removing the wrong occurrence.
Compresses Pi skill listings and strips session-overview churn.
Requests long cache retention when Pi/provider compat supports it.
Adds a session-id prompt_cache_key fallback for openai-completions / openai-responses payloads when no effective key exists, including Pi's built-in llama.cpp provider as Pi 0.82 core does.
Warns once for third-party OpenAI-compatible proxies missing cache/session-affinity compat flags.
Detects adaptive-thinking compat for Claude (opus-4.6+, sonnet-4.6+ including Sonnet 5, fable-5+) and Kimi Coding K3 / kimi-for-coding custom channels.
Shows restart-persistent provider/model footer stats for supported model families.
Supports optional router-extension integration through versioned global protocols (Symbol.for("pi.routing.registry.v1") and Symbol.for("pi.cache.hints.v1")) without importing router packages.

Caching is provider-side and best-effort. Third-party proxies and router extensions can still hide cache usage, reject unsupported parameters, or route requests across multiple upstreams.

Install

pi install npm:pi-cache-optimizer

If you previously installed the old package:

pi remove npm:pi-deepseek-cache-optimizer && pi install npm:pi-cache-optimizer

Run /reload in Pi after install/update/remove so extension hooks refresh.

On Pi 0.79.7 and newer, pi update updates Pi itself only. To update installed Pi packages such as this extension, run pi update --extensions (packages only) or pi update --all (Pi + packages).

Commands

Command	Effect
`/cache-optimizer`	Interactive menu when UI supports it; otherwise prints help and current state.
`/cache-optimizer enable`	Enables runtime optimizations for the current Pi process, resets local footer stats, and starts a fresh “enabled” measurement.
`/cache-optimizer disable`	Disables optimization for the current Pi process, resets local footer stats, and keeps collecting footer stats in disabled comparison mode. Run `/reload` or restart Pi to return to startup behavior.
`/cache-optimizer doctor`	Shows active model/provider/API/base URL/compat plus low-hit diagnosis.
`/cache-optimizer compat`	Shows copyable compat advice for the active model, if applicable.
`/cache-optimizer stats`	Shows today's local provider/model counters and recent trend for the active model.
`/cache-optimizer reset`	Resets local footer stats for the active provider/model; upstream provider cache is not modified.
`/cache-optimizer fix`	Auto-repairs safe compat issues for the active model (adaptive thinking, DeepSeek reasoning, OpenAI proxy session affinity). Shows preview + risk warning, requires confirmation. Only modifies `models.json` after explicit user approval.

enable / disable are current-process switches. For a persistent opt-out, use environment variables below.

Persistent opt-out

Env var	Effect
`PI_CACHE_OPTIMIZER_NO_PROMPT_REWRITE=1`	Disable prompt mutations only; footer stats and cache-key fallback remain active.
`PI_CACHE_OPTIMIZER_NO_SKILL_COMPRESSION=1`	Keep Pi's verbose skill XML.
`PI_CACHE_OPTIMIZER_NO_OPENAI_CACHE_KEY=1`	Disable the OpenAI-compatible `prompt_cache_key` fallback. Preferred explicit opt-out.
`PI_CACHE_OPTIMIZER_OPENAI_CACHE_KEY=0`	Disable the same fallback via the legacy inverse switch. Values `0`, `false`, `no`, or `off` disable it.

OpenAI-compatible proxy setup

Third-party openai-completions proxies (LiteLLM / OneAPI / NewAPI / OpenRouter-like channels) often route one session across multiple upstream backends. That splits provider-side prompt caches.

Pi 0.81+ also has a built-in llama.cpp provider using an OpenAI-shaped transport. Pi 0.82 core generates a session prompt_cache_key for it when cache retention is enabled, so this extension preserves that key and may add the same conservative fallback when missing. The built-in provider's explicit compat fingerprint is excluded from generic proxy routing/session-affinity advice, but a custom or overridden provider that merely reuses the id llama.cpp is treated like any other OpenAI-compatible channel. prompt_cache_retention remains subject to the normal safety rule: keep it only for official OpenAI or an explicit effective supportsLongCacheRetention: true opt-in in models.json; otherwise strip it before sending. Effective values follow Pi's precedence: modelOverrides[modelId].compat first, then the matching models[].compat, then provider-level compat. An explicit false at a higher layer overrides true below it.

For real proxies, start with session affinity:

{
  "providers": {
    "your-provider-id": {
      "api": "openai-completions",
      "baseUrl": "https://example.com/v1",
      "apiKey": "env:YOUR_API_KEY",
      "compat": {
        "sendSessionAffinityHeaders": true
      },
      "models": [
        { "id": "gpt-5.5", "name": "GPT-5.5" }
      ]
    }
  }
}

Notes:

sendSessionAffinityHeaders: true is the safe default when your proxy supports sticky routing.
supportsLongCacheRetention: true is optional. Add it only when the endpoint explicitly supports OpenAI long prompt cache retention.
If you see 400 Unsupported parameter: prompt_cache_retention, remove/avoid supportsLongCacheRetention for that channel. Keep sendSessionAffinityHeaders if supported.
Use /cache-optimizer compat or /cache-optimizer doctor to see model-specific advice.
For DeepSeek models, the Pi Mono guidance expects compat.requiresReasoningContentOnAssistantMessages: true and compat.thinkingFormat: "deepseek" alongside cache/session-affinity flags when the endpoint supports them.
This extension's doctor and compat commands only advise; they do not modify models.json.

Anthropic cache TTL compatibility

Anthropic processes cache breakpoints in tools → system → messages order and rejects a ttl: "1h" breakpoint that appears after a 5-minute breakpoint. An ephemeral cache_control without ttl uses the default 5-minute retention.

For every anthropic-messages channel, the extension checks the final serialized payload and immediately downgrades a visibly invalid short-to-long transition. Legal long-only and 1h → 5m payloads remain unchanged, including third-party endpoints that genuinely support 1-hour retention.

Some proxies rewrite or insert hidden 5-minute breakpoints after Pi's request hook. If a provider returns Anthropic's explicit TTL-ordering error, the extension records a process-local provider/model fallback so the next subsequent request uses the default 5-minute TTL. This error is a non-retryable HTTP 400 in Pi 0.82.1, so the extension does not claim that Pi's automatic retry will rerun the failed turn; if another layer retries it, that retry also uses the fallback. /cache-optimizer doctor reports the fallback and /cache-optimizer fix can persist a model-scoped supportsLongCacheRetention: false with the normal confirmation/backup flow. Other 400 errors and prompt-too-long failures do not activate this fallback. The runtime observation lasts until the current process exits and survives extension reloads within that process.

Adaptive thinking models

Claude models from opus-4.6 / sonnet-4.6 (including Sonnet 5) / fable-5 onwards require forceAdaptiveThinking: true in compat. Kimi Coding K3 (k3) and kimi-for-coding also use adaptive thinking and need allowEmptySignature: true so replayed empty-signature thinking blocks remain valid. Without the required compat, Pi may send a legacy thinking payload or replay thinking incorrectly.

Pi's built-in catalog already sets this flag for official models. Custom channels in models.json that override these models must include the flag:

{
  "providers": {
    "your-claude-channel": {
      "api": "anthropic-messages",
      "baseUrl": "https://...",
      "apiKey": "env:YOUR_KEY",
      "compat": {
        "forceAdaptiveThinking": true
      },
      "models": [
        { "id": "claude-sonnet-5", "name": "Claude Sonnet 5" }
      ]
    }
  }
}

Or use model-level override:

{
  "providers": {
    "your-claude-channel": {
      "modelOverrides": {
        "claude-sonnet-5": {
          "compat": {
            "forceAdaptiveThinking": true
          }
        }
      }
    }
  }
}

For Kimi Coding K3 custom channels, use model-level compat when the provider contains mixed models:

{
  "providers": {
    "your-kimi-coding-channel": {
      "modelOverrides": {
        "k3": {
          "compat": {
            "forceAdaptiveThinking": true,
            "allowEmptySignature": true
          }
        }
      }
    }
  }
}

Pi 0.80.9+ already includes Kimi K3 in built-in Kimi Coding, Moonshot AI / China, OpenRouter, and Vercel AI Gateway catalogs. The Moonshot/OpenRouter variants use their OpenAI-compatible transport and continue through the normal Kimi footer/proxy path; the adaptive compat above applies only to anthropic-messages Kimi Coding channels.

/cache-optimizer doctor and /cache-optimizer compat detect missing flags and show copyable JSON.

Auto-repair with `/cache-optimizer fix`

v2.6.0+ adds a fix subcommand that can auto-repair safe compat issues:

Adaptive thinking (forceAdaptiveThinking: true; Kimi Coding K3 / kimi-for-coding also allowEmptySignature: true)
DeepSeek Pi Mono reasoning compat (thinkingFormat: "deepseek", requiresReasoningContentOnAssistantMessages: true)
OpenAI-compatible proxy session affinity (sendSessionAffinityHeaders: true for openai-completions). Pi 0.80.7+ controls openai-responses header shape with sessionAffinityFormat and auto-detects its default; this extension no longer writes the removed sendSessionIdHeader field.

Scope: only the currently active model. Other channels require switching models and running fix again.

Safety:

Shows full preview of changes (file path, edit location, JSON to write, risks)
Warns: ① changes affect all sessions using that channel, ② automatic backup created at models.json.backup-cache-optimizer-<timestamp>, ③ Pi reload required
Uses comment-preserving surgical editor — existing comments, indentation, and existing key order are preserved
Requires explicit user confirmation (interactive prompt or ui.select)
Writes atomically (temp + rename); self-validates after write
Falls back to manual guidance if JSONC scanner cannot confidently locate the target

Existing modelOverrides[modelId] entries have Pi's highest precedence, so fix repairs them directly. For built-in or API-login models without a custom models[] entry, fix creates a compat-only modelOverrides entry instead of inventing a custom model definition. Self-validation checks the effective three-layer compat result, so a lower-level edit shadowed by an override is rejected.

Non-interactive mode: refuses to write; shows manual edit guidance instead.

Run: /cache-optimizer fix when the active model has detected compat issues. The command shows "nothing to fix" when compat is already complete.

Channels without a `models.json` provider entry

Some Pi channels may be available even when there is no provider block in Pi's agent models.json yet (default: ~/.pi/agent/models.json; with PI_CODING_AGENT_DIR, use $PI_CODING_AGENT_DIR/models.json). Keep existing authentication as-is and do not copy credentials, tokens, or API keys. Add only cache/routing compatibility overrides in models.json.

Provider-level minimal override:

{
  "providers": {
    "your-provider-id": {
      "compat": {
        "sendSessionAffinityHeaders": true
      }
    }
  }
}

If only one model should change, use modelOverrides:

{
  "providers": {
    "your-provider-id": {
      "modelOverrides": {
        "gpt-5.5": {
          "compat": {
            "sendSessionAffinityHeaders": true
          }
        }
      }
    }
  }
}

Footer stats

Stats are read-only local counters stored in Pi's agent directory (default: ~/.pi/agent/pi-cache-optimizer-stats.json; custom agent dirs use PI_CODING_AGENT_DIR) and keyed by provider/model, so the same channel/model keeps today's footer counters after a Pi process or terminal restart. The file also keeps hashed session buckets for migration/reload bookkeeping. It contains only dates and numeric counters — no API keys, prompts, payloads, headers, responses, or model output.

Pi 0.79+ also includes a built-in footer CH marker for the latest prompt cache hit rate. This extension complements that marker with persisted provider/model counters plus proxy compat diagnostics.

Example footer:

OpenAI cache 3/10 · 0.002M/0.005M tok (40%) ⚠️ compat

Format: <label> <hit requests>/<total requests> · <cached input tokens>/<total input tokens> tok (<token hit rate>). Some adapters may also append · write <tokens> tok, and runtime diagnostics may append ⚠️ compat or ⚠️ integrity.

Supported footer labels include: DS, Claude, OpenAI, Gemini, Kimi, Qwen, GLM, MiniMax, Mimo, Hunyuan, Mistral, Grok, Llama, Nemotron, Cohere, Yi, Doubao, ERNIE, Baichuan, StepFun, Spark, InternLM, Gemma, Phi, Jamba, Solar, Sonar, Nova, Reka, Falcon, DBRX, MPT, StableLM, Aquila, EXAONE, HyperCLOVA, Luminous, Hermes, Granite, Arctic, Pangu, SenseNova, Zhinao, MiniCPM, XVERSE, Orion, OpenChat, Vicuna, Wizard, Zephyr, Dolphin, OpenOrca, Starling, BLOOM, RWKV, and Aya.

Adapter selection uses only model id/name (plus assistant message model/name on message end). Generic OpenAI-shaped APIs are not treated as OpenAI-family unless the model id/name matches a supported family.

For router / virtual-channel extension authors

If your Pi extension provides a virtual routing provider (for example router/auto, router/smart, or a profile/channel that forwards to a real upstream), this extension can show cache stats for the real upstream provider/model instead of the virtual shell. Integration is optional, versioned, and does not require importing this package.

Minimum integration: final assistant message metadata

For seamless final cache-stat attribution, relay the real upstream identity on completed assistant messages:

{
  role: "assistant",
  provider: "anthropic",              // real upstream provider
  responseModel: "claude-opus-4-8",   // or model: "..."
  api: "anthropic-messages",          // upstream Pi API id when known
  usage: {
    input: 1200,       // Pi-normalized uncached input tokens, if available
    cacheRead: 8000,   // tokens read from provider prompt cache
    cacheWrite: 500,   // tokens newly written to provider prompt cache
  },
}

message_end treats these assistant-message fields as authoritative. If provider + model/responseModel + cache usage are present, stats update the upstream bucket even when the active model is still router/auto. If upstream usage does not expose cache fields, leave them absent/zero; this extension will not fake cache hits.

Optional: live route registry for pre-response UX

Final message metadata is enough for post-response stats. For pre-response flows — footer display before the first response, /cache-optimizer doctor, /cache-optimizer compat, /cache-optimizer reset, and OpenAI-compatible prompt_cache_key fallback — register a live route adapter under Symbol.for("pi.routing.registry.v1").

Protocol shape:

type PiRouteSnapshot = {
  virtualProvider: string;
  virtualModelId: string;
  provider: string;
  modelId: string;
  api?: string;
  canonicalModelId?: string;
  routeLabel?: string;
  status?: "planned" | "trying" | "selected" | "success" | "failed";
  sessionIdHash?: string;
  requestId?: string;
  timestamp: number;
};

type PiRouterAdapterV1 = {
  virtualProvider: string;
  resolveActiveRoute(
    virtualModelId: string,
    hint?: { sessionIdHash?: string; requestId?: string },
  ): PiRouteSnapshot | undefined;
  resolveCandidateRoutes?(virtualModelId: string): PiRouteSnapshot[];
  subscribe?(listener: (event: PiRouteSnapshot) => void): () => void;
};

Registration pattern:

const ROUTING = Symbol.for("pi.routing.registry.v1");
const registry = (globalThis as Record<symbol, unknown>)[ROUTING] as
  | { version: 1; registerRouter(adapter: PiRouterAdapterV1): () => void }
  | undefined;

registry?.registerRouter({
  virtualProvider: "router",
  resolveActiveRoute(virtualModelId, hint) {
    return {
      virtualProvider: "router",
      virtualModelId,
      provider: "deepseek",
      modelId: "deepseek-v4",
      api: "openai-completions",
      sessionIdHash: hint?.sessionIdHash,
      timestamp: Date.now(),
    };
  },
});

Do not overwrite an existing registry. If your extension loads before this optimizer, retry registration on session_start or create the same V1 registry shape only if no registry exists.

Optional: query-scoped cache hints

Routers that forward to an inner Pi request path can read query-scoped hints from Symbol.for("pi.cache.hints.v1"):

const CACHE_HINTS = Symbol.for("pi.cache.hints.v1");
const hints = (globalThis as Record<symbol, any>)[CACHE_HINTS]?.getHints?.({
  sessionIdHash,
  virtualProvider: "router",
  virtualModelId: "auto",
  upstreamProvider: "deepseek",
  upstreamModelId: "deepseek-v4",
  api: "openai-completions",
});

When the query matches the current session/route, hints may contain systemPrompt, promptCacheKey, and cacheRetention: "long". Treat these as advisory and sensitive: do not log them, do not expose prompt text, and do not overwrite an existing request-level prompt_cache_key / promptCacheKey.

Security and correctness rules

Do not import pi-cache-optimizer; use Symbol.for(...) discovery only.
Do not expose API keys, prompts, payloads, headers, response bodies, or model output in route snapshots or logs.
Use assistant-message metadata for final attribution; live registry data is advisory and may be stale by response time.
Preserve truthful usage. Missing cache usage should show as 0/under-reported, not as synthetic hits.

Uninstall

pi remove npm:pi-cache-optimizer

Then run /reload or restart Pi. Optional local stats cleanup (if you use PI_CODING_AGENT_DIR, delete the same files from that directory instead):

Platform	Delete local stats files
Linux / macOS / WSL	`rm -f ~/.pi/agent/pi-cache-optimizer-stats.json ~/.pi/agent/deepseek-cache-optimizer-stats.json`
Windows PowerShell	`Remove-Item -Force "$env:USERPROFILE\.pi\agent\pi-cache-optimizer-stats.json", "$env:USERPROFILE\.pi\agent\deepseek-cache-optimizer-stats.json" -ErrorAction SilentlyContinue`
Windows Command Prompt	`del /f /q "%USERPROFILE%\.pi\agent\pi-cache-optimizer-stats.json" "%USERPROFILE%\.pi\agent\deepseek-cache-optimizer-stats.json" 2>nul`

Do not delete models.json during cleanup; it contains your Pi model/provider configuration and is not owned by this package.

Verify effect

Select a model whose provider exposes cache usage.
Send several similar turns in the same Pi session.
Watch the footer or run /cache-optimizer stats.
For third-party proxies, also run /cache-optimizer doctor and confirm sticky routing / session affinity on the proxy side.

License

MIT