pi-xai-voice
Pi extension for xAI voice and audio workflows
Package details
Install pi-xai-voice from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-xai-voice- Package
pi-xai-voice- Version
0.1.0- Published
- Apr 22, 2026
- Downloads
- 122/mo · 10/wk
- Author
- luxusai
- License
- MIT
- Types
- extension
- Size
- 138.8 KB
- Dependencies
- 0 dependencies · 0 peers
Pi manifest JSON
{
"extensions": [
"./index.ts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
Pi xAI Voice
Pi extension for xAI voice workflows.
Install
Via npm registry:
pi install pi-xai-voice
Via git:
pi install github:luxus/pi-xai-voice
Why this exists
xAI shipped dedicated Grok STT and TTS APIs here: Introducing Grok 3 Speech.
The original motivation for this project was not “build every possible voice feature inside Pi.” The main use case was Telegram and bot integrations.
For that use case, voice often feels much more natural than typing. Talking to a bot is faster, lower friction, and more conversational than constantly writing messages by hand. Once STT and TTS become good enough on price and latency, voice stops being a gimmick and starts feeling like the right interface.
This repository packages that idea in two layers:
- a reusable xAI voice layer for STT, TTS, voice listing, and realtime helpers
- Pi integrations on top, so the same APIs can also be used directly inside the editor
The Pi-specific features exist because the core voice plumbing was already useful and easy to expose here as well:
- fast voice input/output loop directly in the editor
- local playback and local mic capture for low-friction workflow
- configurable live transcript polling so you can trade responsiveness against cost
- explicit STT on/off, ghost text, shortcut mode, and quality settings instead of hardcoded defaults
So the short version is: the main reason for this project is voice-first bot usage, especially Telegram-style flows. The Pi extension features are the practical extra integrations that fell out of building that core voice layer properly.
A concrete downstream target for this work is luxus/pi-telegram. That project builds on llblab/pi-telegram, which itself is a fork of badlogic/pi-telegram.
In practice, that Telegram add-on is where the voice-first idea becomes especially compelling. It turns the bot into something closer to a spoken assistant than a text-only chat. Through that integration, you can use this voice layer to:
- switch the xAI voice/model used for spoken replies
- run a continuous voice mode where voice messages can trigger voice replies
- receive direct spoken answers as Telegram voice messages
- use xAI speech tags so spoken output sounds more human and expressive, not just flat text readout
- make the interaction feel more like talking to an assistant than typing commands into a chat box
Speech tags are short inline cues for delivery style. In the Telegram integration, the actual xAI-style tags used in source include tags like [pause], [long-pause], [laugh], [giggle], [sigh], <whisper>...</whisper>, <slow>...</slow>, and <emphasis>...</emphasis>. Examples:
We shipped it. [laugh]
[pause] I think this will work.
<whisper>this part stays quiet</whisper>
<slow><soft>Okay — let’s do this carefully.</soft></slow>
<emphasis>The build is finally green.</emphasis>
That matters for bots because it makes spoken replies feel less robotic. For Telegram voice replies in particular, this helps the assistant sound more like a real voice and less like a flat screen reader.
The tag set is intentionally constrained. The Telegram integration uses an explicit allowlist instead of arbitrary free-form tags, so spoken output stays predictable and compatible with provider behavior.
At the moment, cloning and adapting projects is often faster than waiting for upstream alignment, so this repository intentionally keeps its own fork path open. Upstream adoption would be nice, but it is not required for this extension to be useful.
Features
text_to_speech— unary/v1/tts, saves audio to temp file, optional local playback withplay: truelist_tts_voices— list available xAI voicesspeech_to_text— unary/v1/sttfrom local file or remote URLcreate_realtime_voice_client_secret— mint short-lived browser/mobile token for/v1/realtimerealtime_voice_text_turn— one-shot text roundtrip over/v1/realtime, saves returned PCM as WAVcheck_xai_voice_health— verify auth, base URL, defaults, visible models/xai-speak [text]— speak provided text, current editor text, or last assistant reply/xai-record— toggle microphone capture, transcribe, paste into editor/xai-voice-settings— configure voice defaults, STT toggle, shortcut, live transcript, polling, ghost textAlt+Mby default — editor voice shortcut; configurable in/xai-voice-settings/xai-voice-health— command alias for health check
Structure
xai-client.ts # shared HTTP client
xai-config.ts # shared config loading
xai-media-shared.ts # shared helpers/constants
xai-image.ts # copied from pi-xai-imagine
xai-video.ts # copied from pi-xai-imagine
xai-understanding.ts # copied from pi-xai-imagine
xai-voice.ts # voice-specific API implementation
local-audio.ts # local mic capture + playback helpers
voice-editor.ts # push-to-talk editor wrapper
index.ts # Pi tool registration
Config
Shared xAI namespace:
{
"xai": {
"apiKey": "xai-...",
"baseUrl": "https://api.x.ai/v1",
"voice": {
"defaultVoice": "eve",
"defaultLanguage": "en",
"ephemeralTokenSeconds": 300,
"microphoneDeviceIndex": 0,
"sttLanguage": "de",
"shortcut": "alt+m",
"shortcutMode": "push-to-talk",
"sttEnabled": true,
"liveTranscriptEnabled": true,
"liveTranscriptPollingMs": 1000,
"liveTranscriptGhostText": true
}
}
}
Config lookup order:
XAI_API_KEY./.pi/settings.json~/.pi/agent/settings.json
Notes
- xAI voice docs currently expose fixed TTS/STT/realtime endpoints — no request-level model selector used here.
realtime_voice_text_turnis smoke-test style. No live mic streaming tool yet.- Microphone shortcut uses local
ffmpegcapture on macOS via AVFoundation, then sends saved WAV into/v1/stt. - Default shortcut is
Alt+M. Shortcut and mode (push-to-talkortoggle) are configurable in/xai-voice-settings. - Push-to-talk depends on terminal key release support. Fallback:
/xai-recordor switch shortcut mode totoggle. - Playback uses local
afplayon macOS. New playback stops previous playback. - Temp audio files land under OS temp dir in
pi-xai-voice/audio/. - Voice settings can be saved per-project (
.pi/settings.json) or globally (~/.pi/agent/settings.json). - Live transcript preview, polling interval, STT enable/disable, language hint, and ghost text are configurable in
/xai-voice-settings.
Usage
Low-level runtime example:
import { XaiClient, getRequiredXaiApiKey, resolveXaiConfig } from "./xai-media.ts";
const config = resolveXaiConfig();
const { apiKey } = getRequiredXaiApiKey(config);
const client = new XaiClient({ apiKey, baseUrl: config.xai.baseUrl });
const health = await client.checkHealth();
Dev
bun install
bunx tsgo -p tsconfig.json --noEmit