@codexstar/pi-listen

Voice in + voice out for Pi CLI — hold-to-talk STT (Deepgram or 19 offline models) plus TTS (Kitten Nano, Piper, Kokoro, or Deepgram Aura)

Packages

Package details

extension

Install @codexstar/pi-listen from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:@codexstar/pi-listen

Package: @codexstar/pi-listen
Version: 7.2.2
Published: May 1, 2026
Downloads: 820/mo · 145/wk
Author: engaze
License: MIT
Types: extension
Size: 662.1 KB
Dependencies: 0 dependencies · 2 peers

Pi manifest JSON

{
  "extensions": [
    "./extensions/voice.ts"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-listen

Hold-to-talk voice input for Pi. Cloud streaming via Deepgram or fully offline with local models.

v7.0.0 — World-class TTS UX — pick models from /voice-settings Speak tab (no more JSON editing), auto-download on selection with progress, voice picker for every backend, first-run onboarding with smart-default recommendation by your system locale, and ttsAutoSpeak: true finally works — auto-speaks the agent's responses with code-block stripping and rate limiting. Diagnostic command /voice-speak-info shows everything. Resume-on-interrupt downloads. Plus all v6 features (14 local models from 25 MB Kitten Nano up, Deepgram Aura cloud, region-strict language matching, sentence-aware chunking). Full changelog →

See How It Works

Setup (2 minutes)

1. Install the extension

# In a regular terminal (not inside Pi)
pi install npm:@codexstar/pi-listen

2. Choose your backend

pi-listen supports two transcription backends:

	Deepgram (cloud)	Local models (offline)
How it works	Live streaming — text appears as you speak	Batch mode — transcribes after you finish recording
Setup	API key required	No API key, models auto-download on first use
Internet	Required	Not required after model download
Latency	Real-time interim results	2–10 seconds after recording stops
Languages	56+ with live streaming	Depends on model (1–57 languages)
Cost	$200 free credit (lasts 6–12 months for most developers)	Free forever

Run /voice-settings inside Pi to choose your backend and configure everything from one panel.

Option A: Deepgram (recommended for live streaming)

export DEEPGRAM_API_KEY="your-key-here"    # add to ~/.zshrc or ~/.bashrc

Option B: Local models (fully offline)

No setup needed — run /voice-settings, switch backend to Local, and select a model. It downloads automatically.

Note: Local models use batch mode — they transcribe after you finish recording, not while you speak. For live streaming as you speak, use Deepgram.

3. Open Pi

On first launch, pi-listen checks your setup and tells you what's ready:

Backend configured (Deepgram key or local model)
Audio capture tool detected (sox, ffmpeg, or arecord)
If everything checks out, voice activates immediately

Audio capture

pi-listen auto-detects your audio tool. No manual install needed if you already have sox or ffmpeg.

Priority	Tool	Platforms	Install
1	SoX (`rec`)	macOS, Linux, Windows	`brew install sox` / `apt install sox` / `choco install sox`
2	ffmpeg	macOS, Linux, Windows	`brew install ffmpeg` / `apt install ffmpeg`
3	arecord	Linux only	Pre-installed (ALSA)

Settings Panel

All configuration lives in one place: /voice-settings. Four tabs cover everything you need.

General — backend, language, scope

Toggle between Deepgram (cloud, live streaming) and Local (offline, batch mode). Change language, scope, and enable/disable voice — all with keyboard shortcuts.

Models — browse, search, install

Browse 19 models from Parakeet, Whisper, Moonshine, SenseVoice, and GigaAM. Each model shows accuracy and speed ratings (●●●●○/●●●●○), fitness badges, and download status. Fuzzy search to find models fast. Press Enter to activate and download.

Downloaded — manage installed models

See what's installed, total disk usage, and which model is active. Press Enter to activate, x to delete. Models from Handy are auto-detected and can be imported without re-downloading.

Device — hardware profile and dependencies

See your hardware profile (RAM, CPU, GPU), dependency status (sherpa-onnx runtime), available disk space, and total downloaded models. Model recommendations are based on this profile.

Usage

Keybindings

Action	Key	Notes
Record to editor	Hold `SPACE` (≥1.2s)	Release to finalize. Pre-records during warmup so you don't miss words.
Toggle recording	`Ctrl+Shift+V`	Works in all terminals — press to start, press again to stop.
Clear editor	`Escape` × 2	Double-tap within 500ms to clear all text.

How recording works

Hold SPACE — warmup countdown appears, audio capture starts immediately (pre-recording)
Keep holding — live transcription streams into the editor (Deepgram) or audio buffers (local)
Release SPACE — recording continues for 1.5s (tail recording) to catch your last word, then finalizes
Text appears in the editor, ready to send

Commands

Command	Description
`/voice-settings`	Settings panel — backend, models, language, scope, device
`/voice-models`	Settings panel (Models tab)
`/voice-speak <text>`	Speak text out loud (TTS)
`/voice-speak-test`	Speak a sample sentence
`/voice-speak-toggle`	Enable / disable TTS
`/voice-autosubmit` `[on	off]`
`/voice-speak-models`	Browse / install TTS voice models
`/voice-speak-info`	Diagnose TTS state
`/voice-help`	Keyboard + command reference (or press `F1`)
`/voice test`	Full diagnostics — audio tool, mic, API key
`/voice on` / `off`	Enable or disable voice
`/voice dictate`	Continuous dictation (no key hold)
`/voice stop`	Stop active recording or dictation
`/voice history`	Recent transcriptions
`/voice`	Toggle on/off

v7.1 keyboard

While in the settings panel:

Key	Action
`← →`	switch tab
`↑ ↓`	navigate row (skips group headings)
`↵`	select / activate
`esc`	back to main / close panel
`type`	filter (search)
`bksp`	clear last search char

While an install widget or playback indicator is mounted (no overlay in front):

Key	Action
`esc`	cancel active install (most-recent first), then stop playback
`F1`	open help overlay (always available)

Local Models

19 models across 5 families. Sorted by quality — best models first.

Top picks

Model	Accuracy	Speed	Size	Languages	Notes
Parakeet TDT v3	●●●●○	●●●●○	671 MB	25 (auto-detect)	Best overall. WER 6.3%.
Parakeet TDT v2	●●●●●	●●●●○	661 MB	English	Best English. WER 6.0%.
Whisper Turbo	●●●●○	●●○○○	1.0 GB	57	Broadest language support.

Fast and lightweight

Model	Accuracy	Speed	Size	Languages	Notes
Moonshine v2 Tiny	●●○○○	●●●●●	43 MB	English	34ms latency. Raspberry Pi friendly.
Moonshine Base	●●●○○	●●●●●	287 MB	English	Handles accents well.
SenseVoice Small	●●●○○	●●●●●	228 MB	zh/en/ja/ko/yue	Best for CJK languages.

Specialist

Model	Accuracy	Speed	Size	Languages	Notes
GigaAM v3	●●●●○	●●●●○	225 MB	Russian	50% lower WER than Whisper on Russian.
Whisper Medium	●●●●○	●●●○○	946 MB	57	Good accuracy, medium speed.
Whisper Large v3	●●●●○	●○○○○	1.8 GB	57	Highest Whisper accuracy. Slow on CPU.

Plus 8 language-specialized Moonshine v2 variants for Japanese, Korean, Arabic, Chinese, Ukrainian, Vietnamese, and Spanish.

How local models work

Hold SPACE → audio captured to memory buffer
                ↓
Release SPACE → buffer sent to sherpa-onnx (in-process)
                ↓
         ONNX inference on CPU (2–10 seconds)
                ↓
         Final transcript inserted into editor

Models download automatically on first use. Downloads are resumable, verified after completion, and deduplicated (no double-downloads). The settings panel shows real-time download progress with speed and ETA.

Models from Handy (~/Library/Application Support/com.pais.handy/models/) are auto-detected and can be imported via symlink (zero disk duplication).

Features

Feature	Description
Dual backend	Deepgram (cloud, live streaming) or local models (offline, batch) — switch in settings
19 local models	Parakeet, Whisper, Moonshine, SenseVoice, GigaAM — with accuracy/speed ratings
Unified settings panel	One overlay panel for all configuration — `/voice-settings`
Device-aware recommendations	Scores models against your hardware. Only best-in-class models get [recommended].
Enterprise download pipeline	Pre-checks (disk, network, permissions), live progress with speed/ETA, post-verification
Handy integration	Auto-detects models from Handy app, imports via symlink
Audio fallback chain	Tries sox, ffmpeg, arecord in order
Pre-recording	Audio capture starts during warmup — you never miss the first word
Tail recording	Keeps recording 1.5s after release so your last word isn't clipped
Live streaming	Deepgram Nova 3 WebSocket — interim transcripts as you speak
56+ languages	Deepgram: 56+ with live streaming. Local: up to 57 depending on model.
Continuous dictation	`/voice dictate` for long-form input without holding keys
Typing cooldown	Space holds within 400ms of typing are ignored
Sound feedback	macOS system sounds for start, stop, and error events
Cross-platform	macOS, Windows, Linux — Kitty protocol + non-Kitty fallback

Architecture

extensions/voice.ts                Main extension — state machine, recording, UI, settings panel
extensions/voice/config.ts         Config loading, saving, migration
extensions/voice/onboarding.ts     First-run wizard, language picker
extensions/voice/deepgram.ts       Deepgram URL builder, API key resolver
extensions/voice/local.ts          Model catalog (19 models), in-process transcription
extensions/voice/device.ts         Device profiling — RAM, GPU, CPU, container detection
extensions/voice/model-download.ts Download manager — resume, progress, verification, Handy import
extensions/voice/sherpa-engine.ts   sherpa-onnx bindings — recognizer lifecycle, inference
extensions/voice/settings-panel.ts  Settings panel — Component interface, overlay, 4 tabs

Configuration

Settings stored in Pi's settings files under the voice key:

Scope	Path
Global	`~/.pi/agent/settings.json`
Project	`<project>/.pi/settings.json`

{
  "voice": {
    "version": 2,
    "enabled": true,
    "language": "en",
    "backend": "local",
    "localModel": "parakeet-v3",
    "scope": "global",
    "onboarding": { "completed": true, "schemaVersion": 2 }
  }
}

DEEPGRAM_API_KEY from your shell is used at runtime and is not copied back into ~/.pi/agent/settings.json. If you paste a key during onboarding, that is an explicit save and it still goes to ~/.env.secrets or ~/.zshrc.

Troubleshooting

Run /voice test inside Pi for full diagnostics.

Problem	Solution
"DEEPGRAM_API_KEY not set"	Get a key → `export DEEPGRAM_API_KEY="..."` in `~/.zshrc`
"No audio capture tool found"	`brew install sox` or `brew install ffmpeg`
Space doesn't activate voice	Run `/voice-settings` — voice may be disabled
Local model not transcribing	Check `/voice-settings` → Device tab for sherpa-onnx status
Download failed	Partial downloads auto-resume on retry. Check disk space in Device tab.
`dyld: Library not loaded: libsimdjson` on macOS	Homebrew Node ABI mismatch — run `brew reinstall node` or switch to version-managed Node (`mise`, `fnm`, `nvm`)

Security

Cloud STT — audio is sent to Deepgram for transcription (Deepgram backend only)
Local STT — audio never leaves your machine (local backend)
No telemetry — pi-listen does not collect or transmit usage data
API key — stored in env var or Pi settings, never logged

See SECURITY.md for vulnerability reporting.