pi-vision-tool
Pi Agent extension that adds a describe_image tool, letting non-multimodal models delegate image analysis to a vision-capable model (like Qwen VL)
Package details
Install pi-vision-tool from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-vision-tool- Package
pi-vision-tool- Version
1.3.6- Published
- Jun 10, 2026
- Downloads
- not available
- Author
- xezpeleta
- License
- MIT
- Types
- extension
- Size
- 198.1 KB
- Dependencies
- 0 dependencies · 2 peers
Pi manifest JSON
{
"extensions": [
"./extensions"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
Pi Vision Tool
A Pi Agent extension that adds a describe_image tool, letting non-multimodal models (like DeepSeek V4 Pro, GPT-5 Codex without image support, etc.) delegate image analysis to a vision-capable model.
Screenshots
Features
The calling model has full control over every call, deciding what matters for each image:
| Feature | Parameter | What the model controls |
|---|---|---|
| Compression | compress |
true for faster/general use, false for pixel-perfect accuracy |
| Reasoning depth | reasoning |
"off" for instant answers, "high"/"xhigh" for complex analysis |
| Prompt | prompt |
Free-text instruction: "describe", "extract text", "find the bug", ... |
| Image source | image_path |
File path, data URL, or raw base64 |
This means the model itself decides the cost/quality tradeoff per call — no pre-configuration needed. Just like a developer chooses between a quick cat and a deep git bisect, the model picks the right tool settings for the job.
How it works
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ DeepSeek Pro │────▶│ describe_image │────▶│ Qwen VL / any │
│ (no vision) │ │ (this tool) │ │ vision model │
│ │◀────│ │◀────│ │
│ "that's red" │ │ text response │ │ "it's red" │
└──────────────────┘ └──────────────────┘ └──────────────────┘
- The calling model decides it needs to understand an image
- It calls
describe_imagewith an image path and a specific prompt - The tool sends the image + prompt to your vision model
- The vision model's text response is returned to the calling model as a tool result
- The calling model integrates the result into its reasoning
Reasoning / extended thinking
For vision models with reasoning: true, the calling model can choose the reasoning effort per call via the reasoning parameter:
| Level | When to use |
|---|---|
off |
Simple queries: "what color is this?" |
minimal |
Quick checks: "is there an error on this screenshot?" |
low |
Basic descriptions, text extraction |
medium |
UI analysis, layout descriptions |
high |
Architecture diagrams, complex screenshots |
xhigh |
Bug hunting, multi-step visual reasoning |
When omitted, the tool uses the configured default (off by default). The calling model should decide based on task complexity — similar to how it picks compress: true/false. Read the models.md thinking level map section for per-model tuning.
Important: For non-OpenAI vision models (Qwen, llama.cpp, DeepSeek, etc.), you must set compat.thinkingFormat in models.json so the tool sends the correct parameter. Without it, the tool defaults to reasoning_effort (OpenAI format), which your provider may reject.
{
"id": "qwen3.5",
"reasoning": true,
"input": ["text", "image"],
"compat": {
"thinkingFormat": "qwen"
}
}
Supported formats:
| Format | API parameter sent | Use case |
|---|---|---|
| (default, no compat) | reasoning_effort |
OpenAI, any OpenAI-compatible proxy |
qwen |
enable_thinking |
Qwen via llama.cpp, vLLM, Ollama |
qwen-chat-template |
chat_template_kwargs.enable_thinking |
llama-server with Qwen chat template |
deepseek |
reasoning: { effort } |
DeepSeek API |
openrouter |
reasoning: { effort } |
OpenRouter |
together |
reasoning: { enabled: boolean } + reasoning_effort |
Together AI |
Additionally, thinkingLevelMap in models.json maps pi's level names to provider-specific values.
Use this when a provider uses non-standard level strings (e.g., Kimi K2.6 uses "none" instead of "off"):
{
"id": "Kimi-K2.6",
"reasoning": true,
"input": ["text", "image"],
"thinkingLevelMap": {
"off": "none",
"xhigh": null
}
}
Set the default reasoning level via:
/vision config reasoning-effort medium
# or via env var:
export PI_VISION_REASONING_EFFORT=medium
Installation
Via npm (recommended)
pi install npm:pi-vision-tool
This is the primary installation method and the way it's listed in the Pi package gallery.
Via git
pi install git:github.com/xezpeleta/pi-vision-tool
Via local path
pi install /path/to/pi-vision-tool
Quick test (no install)
pi -e /path/to/pi-vision-tool
Configuration
1. Add a vision model to ~/.pi/agent/models.json
{
"providers": {
"my-vision-provider": {
"baseUrl": "https://your-llm-server/v1",
"apiKey": "$VISION_API_KEY",
"api": "openai-completions",
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false
},
"models": [
{
"id": "my-vision-model",
"reasoning": true,
"input": ["text", "image"]
}
]
}
}
}
The input: ["text", "image"] field is required — it tells Pi the model supports images.
2. Set the API key in ~/.pi/agent/auth.json
{
"my-vision-provider": {
"type": "api_key",
"key": "sk-your-key-here"
}
}
3. Configure the vision model
Recommended: Use the /vision command (persistent)
In any Pi session with the extension loaded:
/vision config provider my-vision-provider
/vision config model my-vision-model
Settings are saved to ~/.pi/agent/vision-tool.json and persist across all sessions. Changes take effect immediately — no /reload or restart needed.
Run /vision with no arguments to see current configuration.
Enable / disable
/vision on
/vision off
Running /vision off disables the tool entirely: the 👁 indicator disappears from the footer and any describe_image call returns an error. Use /vision on to re-enable it. The toggle is persisted across sessions.
Legacy: Environment variables
export PI_VISION_PROVIDER=my-vision-provider
export PI_VISION_MODEL=my-vision-model
Env vars work but must be set before starting Pi and don't persist between sessions. When a config file exists, it takes priority over env vars.
4. (Optional) Install sharp for image compression
npm install sharp
If sharp is available, images are automatically compressed before sending:
- Downscaled to 1568px max dimension (screenshots, high-res photos)
- Alpha channel stripped (RGBA → RGB)
- Lossless PNG converted to JPEG (quality 85)
This reduces payload size ~4x and speeds up responses significantly.
Without sharp, images are sent as raw bytes.
Compression controls
| Env var | Default | Description |
|---|---|---|
PI_VISION_MAX_DIM |
1568 |
Max width/height in pixels before downscaling |
PI_VISION_JPEG_QUALITY |
85 |
JPEG quality (1-100) for converted images |
The calling model controls per-call compression via the compress parameter. Set compress: false when pixel-perfect accuracy is needed (e.g., reading coordinates or detecting small UI elements).
Usage
Once installed, any model in your session will see the describe_image tool. Just reference an image in your prompt and the model will call it automatically.
Example prompts
| What you need | How to ask |
|---|---|
| Description | "Describe everything visible in this screenshot" |
| Pixel coordinates | "Give [x,y,w,h] bounding boxes for all buttons" |
| Text extraction | "Read all visible text, preserving structure" |
| Error analysis | "What error is shown in this terminal screenshot?" |
| UI inspection | "List all interactive elements and their states" |
| Color values | "What hex color is the header bar?" |
| Layout analysis | "Describe the page layout: sidebar, main content, etc." |
| Comparison | "Compare these two screenshots — what changed?" |
For complex analysis, the calling model can set reasoning: "high":
{
"image_path": "/tmp/architecture.png",
"prompt": "Analyze this system architecture diagram in detail",
"compress": true,
"reasoning": "high"
}
Image formats
- File path:
/tmp/screenshot.png,~/Desktop/photo.jpg - Data URL:
data:image/png;base64,iVBORw0KGgo... - Raw base64: A base64-encoded string over 100 characters
Supported formats: PNG, JPEG, GIF, WebP, BMP.
How it works (technical)
The tool:
- Resolves the vision model from Pi's model registry using
ctx.modelRegistry.find() - Resolves the API key via
ctx.modelRegistry.getApiKeyAndHeaders() - Decodes the image (file path, data URL, or raw base64)
- Optionally compresses the image (resize, strip alpha, convert to JPEG) via
sharp - Makes a direct OpenAI-compatible
/chat/completionscall to the vision model's base URL - Returns the vision model's text response as the tool result
License
MIT