@wenjinnn/pi-mimo-voice

Voice input and output for pi powered by Xiaomi MiMo V2.5 TTS

Packages

Package details

extension

Install @wenjinnn/pi-mimo-voice from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:@wenjinnn/pi-mimo-voice

Package: @wenjinnn/pi-mimo-voice
Version: 1.1.1
Published: May 31, 2026
Downloads: not available
Author: wenjinnn
License: MIT
Types: extension
Size: 47 KB
Dependencies: 0 dependencies · 3 peers

Pi manifest JSON

{
  "extensions": [
    "./index.ts"
  ],
  "image": "https://github.com/wenjinnn/pi-mimo-voice/releases/download/v1.0.5/screenshot.png",
  "video": "https://github.com/wenjinnn/pi-mimo-voice/releases/download/v1.0.5/video.mp4"
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-mimo-voice

🇨🇳 中文文档

Voice input (STT) and output (TTS) for pi powered by Xiaomi MiMo V2.5 API.

pi-mimo-voice screenshot

🔊 Please turn on sound for the demo below

https://github.com/user-attachments/assets/9a9f9984-9476-4c4e-bbeb-ae088a7d875c

Features

🎤 Speech-to-Text (STT) — Record audio from microphone and transcribe
🔊 Text-to-Speech (TTS) — Speak text aloud through speakers
🗣️ Auto-speak — Automatically read all assistant replies
🎙️ Live mode — Continuous voice conversation loop
🎛️ Interactive config — Voice, model, engine, and region settings

Installation

# Install from npm
pi install npm:@wenjinnn/pi-mimo-voice

# Or install from GitHub
pi install git:github.com/wenjinnn/pi-mimo-voice

# Or clone manually
cd ~/.pi/agent/extensions
git clone https://github.com/wenjinnn/pi-mimo-voice.git

Quick Start

# 1. Set API key (via environment variable or pi /login)
export XIAOMI_TOKEN_PLAN_CN_API_KEY="your-key"  # China region
# Or: export XIAOMI_API_KEY="your-key"          # Global

# 2. Restart pi or /reload

# 3. Try it
/speak Hello from MiMo!
/listen
/listen stop

Commands

Text-to-Speech

Command	Description
`/speak <text>`	Speak text aloud using MiMo TTS

Speech-to-Text

Command	Description
`/listen`	Start recording (manual stop with `/listen stop`)
`/listen stop`	Stop recording and transcribe
`/listen auto-stop N`	Record for N seconds then transcribe
`/listen N`	Record for N seconds (max 60) then transcribe

Auto-speak & Live Mode

Command	Description
`/auto-speak`	Toggle auto-speak (read all assistant replies). Use `/auto-speak on\|off` for explicit control
`/live`	Start/stop live voice mode (auto-speak ON, begins recording)
`/live reply`	Stop recording, transcribe, and send to LLM

Live mode flow:

/live → recording starts
Speak when ready
/live reply → transcribes and sends
AI responds → auto-speak reads it → next recording starts
Repeat from step 2
/live to stop

Configuration

Command	Description
`/voice-config`	Interactive settings: voice, TTS model, STT engine, API region

LLM Tools

When installed, the LLM can call these tools directly:

`mimo_tts`

Convert text to speech. Parameters:

text (required) — Text to speak
style (optional) — Style instruction (e.g., "excited", "calm", "东北话")

`mimo_stt`

Record and transcribe speech. Parameters:

duration (optional) — Recording duration in seconds (default: 10, max: 60)
auto_stop (optional) — Auto-stop when silence detected (default: false)

Voices

Preset Voices

Voice	ID	Language	Gender
Default	`mimo_default`	auto	auto
冰糖	`冰糖`	zh	female
茉莉	`茉莉`	zh	female
苏打	`苏打`	zh	male
白桦	`白桦`	zh	male
Mia	`Mia`	en	female
Chloe	`Chloe`	en	female
Milo	`Milo`	en	male
Dean	`Dean`	en	male

TTS Models

Model	Description
`mimo-v2.5-tts`	Preset voices (default)
`mimo-v2.5-tts-voicedesign`	Custom voice via text description
`mimo-v2.5-tts-voiceclone`	Clone voice from audio sample

API Configuration

The extension auto-detects the API region from your pi auth.json provider:

Provider	auth.json Key	Environment Variable	Region
Xiaomi MiMo	`xiaomi`	`XIAOMI_API_KEY`	Global
Token Plan CN	`xiaomi-token-plan-cn`	`XIAOMI_TOKEN_PLAN_CN_API_KEY`	China
Token Plan AMS	`xiaomi-token-plan-ams`	`XIAOMI_TOKEN_PLAN_AMS_API_KEY`	Amsterdam
Token Plan SGP	`xiaomi-token-plan-sgp`	`XIAOMI_TOKEN_PLAN_SGP_API_KEY`	Singapore

Setup Options

Option 1: Environment Variable

export XIAOMI_TOKEN_PLAN_CN_API_KEY="your-key"

Option 2: pi /login

/login  → Select provider → Enter API key

Option 3: auth.json

{
  "xiaomi-token-plan-cn": { "type": "api_key", "key": "your-key" }
}

See pi providers docs for more details.

Requirements

Node.js ≥ 18
MiMo API key — configured via pi /login or environment variable (see API Configuration)
ffmpeg — for audio recording and playback (cross-platform)

Platform-Specific Audio Tools

Platform	Recording	Playback (priority order)
Linux	`parecord` (PulseAudio) → ffmpeg	`paplay` → `aplay` → `ffplay` → `mpv`
macOS	ffmpeg + avfoundation	`afplay` (built-in) → `ffplay` → `mpv`
Windows	ffmpeg + dshow	`ffplay` → `mpv`

Linux: PulseAudio is recommended (parecord/paplay). ALSA (aplay) also works for playback.

macOS: No extra tools needed — afplay is built-in, ffmpeg handles recording via avfoundation.

Windows: Install ffmpeg and ensure it's in PATH. Default recording device is audio=麦克风, override with MIC_DEVICE env var.

⚠️ Cross-platform note: macOS and Windows support is based on ffmpeg's platform-specific audio backends (avfoundation/dshow) but has not been fully tested on real hardware. Linux is the primary tested platform. If you encounter issues on macOS or Windows, please open an issue — feedback and contributions are very welcome!

Environment Variables

Variable	Description
`MIC_DEVICE`	Override audio recording device (all platforms)
`WHISPER_MODEL`	Path to whisper.cpp model file

npm Dependencies

The following dependencies are provided by pi automatically:

@earendil-works/pi-coding-agent — pi extension API
@earendil-works/pi-tui — pi TUI components
typebox — type validation

Optional

whisper.cpp — for local STT (faster, no API calls)
- Set path: /voice-config → Whisper.cpp Path
- Download model: whisper-cpp-download-ggml-model base

How It Works

TTS Flow

Text → MiMo TTS API → WAV audio → paplay/aplay/ffplay/mpv

STT Flow

Microphone → parecord/ffmpeg → WAV file → whisper.cpp or MiMo API → Text

Live Mode Flow

/live → Start recording
User speaks
/live reply → Stop recording → Transcribe → Send to LLM
LLM responds → Auto-speak reads response
Auto-start next recording

Feedback & Contributing

This project is actively maintained. If you have questions, bug reports, or feature requests:

🐛 Open an issue
🔀 Submit a pull request
💬 Share your experience and suggestions

Contributions of any kind are welcome!

License

MIT