pi-minimax-m3-caching-fix
MiniMax-M3 on the OpenAI-compatible endpoint with passive caching. Wraps the built-in openai-completions streamSimple driver to clean duplicated and inline <think>…</think> thinking in flight, mirroring the upstream skipThinkingBlock compat flag for pi-ai
Package details
Install pi-minimax-m3-caching-fix from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-minimax-m3-caching-fix- Package
pi-minimax-m3-caching-fix- Version
0.2.0- Published
- Jun 12, 2026
- Downloads
- 141/mo · 141/wk
- Author
- frugally3683
- License
- MIT
- Types
- extension
- Size
- 27.9 KB
- Dependencies
- 0 dependencies · 2 peers
Pi manifest JSON
{
"extensions": [
"./index.ts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-minimax-m3
A standalone pi extension that fixes two issues with the built-in MiniMax-M3 integration:
- Silent over-billing on the Anthropic-compatible endpoint. M3's
/anthropic/v1/messagesendpoint ignorescache_controlmarkers, so every turn was billed at the full input price ($0.60/Mtok) instead of the cache-read price ($0.12/Mtok). M3 does support passive/automatic prompt caching on its OpenAI-compatible endpoint (/v1/chat/completions). - Duplicated thinking in the response. M3 emits thinking content twice:
once in
reasoning_content(consumed by pi as athinkingblock) and once incontentwrapped in<think>…</think>markers (which would otherwise appear inside the visible text).
This extension registers two new providers — minimax-m3-clean and
minimax-cn-m3-clean — that route MiniMax-M3 to the OpenAI-compatible
endpoint so passive caching works. The thinking cleanup is performed
during the stream by wrapping the built-in openai-completions
streamSimple driver and rewriting the event stream in flight: duplicated
thinking from M3's reasoning_content / reasoning field alternation is
suppressed, <think>…</think> spans are filtered out of text deltas (and
their inner content is routed to a real thinking block when no reasoning
fields were streamed), and text_start is deferred until the first
non-whitespace character.
It mirrors the upstream fix in
pi-mono@b85b91c9
("route MiniMax-M3 to openai-completions for passive caching") so users can
get the fix on any pi version without waiting for an upstream release.
Install
From npm:
pi install npm:pi-minimax-m3-caching-fix
From a git checkout (latest, or pinned):
pi install git:github.com/rwese/pi-minimax-m3-caching-fix
pi install git:github.com/rwese/pi-minimax-m3-caching-fix@v0.2.0
For local development from a clone:
git clone https://github.com/rwese/pi-minimax-m3-caching-fix
pi install ./pi-minimax-m3-caching-fix
The extension reuses the env vars you already have for the built-in minimax
provider — no new credentials required:
| Provider | Env var | Endpoint |
|---|---|---|
| minimax-m3-clean | MINIMAX_API_KEY |
https://api.minimax.io/v1 |
| minimax-cn-m3-clean | MINIMAX_CN_API_KEY |
https://api.minimaxi.com/v1 |
Quickstart (for the impatient)
# 1. Make sure your MiniMax API key is exported
export MINIMAX_API_KEY="sk-..."
# 2. Install the extension
pi install npm:pi-minimax-m3-caching-fix
# 3. Restart any running pi session, then start one
pi
# 4. Inside pi, switch the model
/model
# pick: minimax-m3-clean / MiniMax-M3 (clean)
# 5. Verify caching — look at the footer or session log
# Turn 1: ~99% cache miss (system prompt being written to cache)
# Turn 2+: ~99% cache read (system prompt being reused)
That's it. No new credentials, no config file, no restart of the upstream
minimax provider. Just pick the right model in /model and the rest
happens automatically.
Use
- Run
pi. - Open the model picker with
/model. - Pick
minimax-m3-clean / MiniMax-M3 (clean)for the global endpoint orminimax-cn-m3-clean / MiniMax-M3 (clean — CN)for the China endpoint. - Send a prompt. The first turn is a cache miss; subsequent turns of the same
session show a
CH(cache hit rate) in the footer as the system prompt gets reused.
In the session log, the usage object on each assistant message shows the
cache reads. For example, a 3-turn session looks like:
| Turn | input | cacheRead | Hit rate |
|---|---|---|---|
| 1 | 8932 | 114 | 1% |
| 2 | 128 | 8946 | 99% |
| 3 | 128 | 8946 | 99% |
Why a separate provider (not overriding the built-in)
pi.registerProvider(name, { models }) replaces every model registered
for that provider. There are two ways that breaks the built-in integration:
- Override
minimaxwithbaseUrlonly — this lumps M2.x onto the OpenAI-compatible endpoint too, breaking M2.x. - Override
minimaxwith newmodels— this wipes M2.x from the registry.
So this extension registers new provider names (minimax-m3-clean,
minimax-cn-m3-clean) that don't collide with minimax or
minimax-cn. Users opt in by switching the model in /model. The built-in
minimax / MiniMax-M3 model is still listed — pick the one with
"(clean)" in the name.
Limitations
- Two
MiniMax-M3entries in/model. The built-in (broken, billing at full input price) and the extension's (clean) both appear. Pick the one with(clean)in the name. - Requires both env vars for both providers to show. pi only lists
providers that have auth configured. If you only have
MINIMAX_API_KEY, onlyminimax-m3-cleanshows up; setMINIMAX_CN_API_KEY(even to a dummy value) to also seeminimax-cn-m3-clean.
How the fix works
The extension does two things:
Routes M3 to
/v1/chat/completionsby registering the two new providers under a customapiid (the provider name) so the wrapper below only intercepts these models. The model metadata mirrorspackages/ai/src/models.generated.tsfrom the upstream fix:input: ["text", "image"],reasoning: true, cost$0.6 / $2.4 / $0.12per million tokens, 1M-token context window, 512K max output.Cleans M3's thinking in the stream wrapper. The wrapper sits in front of the built-in
openai-completionsstreamSimpledriver and rewrites events as they arrive:- All driver thinking blocks are merged into ONE thinking block.
M3 re-streams the same reasoning when it switches between
reasoning_contentandreasoningfields, which would otherwise start a new (truncated) thinking block on every field switch. The wrapper dedupes by prefix and emits only the new portion of reasoning. - A
ThinkScannerfilters<think>…</think>spans from text deltas in real time and holds back bytes that look like the start of a tag so markers split across deltas are classified correctly. If the model never streamed reasoning fields, the captured inner content is routed to a real thinking block instead of being dropped; otherwise it's a duplicate of the reasoning fields and is discarded. text_startis deferred until the first non-whitespace character so empty / whitespace-only text blocks are not rendered.
This is the same effect as the upstream
compat.skipThinkingBlockflag, but applied in the stream wrapper because the user's installed@earendil-works/pi-ai(0.79.1) predates that compat field. When a future pi-ai release includesskipThinkingBlock, the wrapper becomes a thin pass-through and can be deleted.- All driver thinking blocks are merged into ONE thinking block.
M3 re-streams the same reasoning when it switches between
Removing the extension (when upstream ships the fix)
When pi-mono ships a release that includes b85b91c9 (or any release whose
models.generated.ts lists MiniMax-M3 with api: "openai-completions"
and skipThinkingBlock: true), retire the extension:
pi remove npm:pi-minimax-m3-caching-fix
The built-in minimax / MiniMax-M3 model will then route correctly out of
the box.
License
MIT — see LICENSE.
Credits
The in-flight thinking-cleanup wrapper introduced in v0.2.0 (the
ThinkScanner, the merged-thinking block, and the deferred text_start)
was contributed by Thunder Guardian (Discord: @Thunder Guardian).
Development
npm run check # tsc --noEmit using the bundled tsconfig.json
The tsconfig.json configures --skipLibCheck and --moduleResolution bundler so the type check is reproducible without depending on transitive
type packages of the user's installed pi.