glm-vision
Pi extension that gives non-vision GLM models (z.ai) image understanding via GLM-4.6V
Package details
Install glm-vision from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:glm-vision- Package
glm-vision- Version
1.2.0- Published
- May 25, 2026
- Downloads
- 334/mo · 334/wk
- Author
- eiei114
- License
- MIT
- Types
- extension
- Size
- 45.4 KB
- Dependencies
- 0 dependencies · 0 peers
Pi manifest JSON
{
"extensions": [
"./src/index.ts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
glm-vision
Pi extension that gives non-vision GLM models (z.ai) image understanding by routing images through a GLM vision model.
How it works
When using a z.ai GLM text model (for example glm-5.1) and the read tool encounters one or more image files, glm-vision:
- Intercepts the image data in the order Pi provided it.
- Builds a prompt from the active preset or custom prompt.
- Sends the images together to a GLM vision model (
glm-4.6vby default). - Caches the response by image hash, prompt, and model.
- Returns a combined text description to the main model.
Image file(s) -> read tool -> [glm-vision intercepts]
-> GLM-4.6V describes Image 1, Image 2, ...
-> Combined text description -> main GLM model
This lets non-vision GLM models inspect screenshots, diagrams, scanned text, and error images through a vision-capable sibling model.
Multiple images
When a tool result contains multiple images, glm-vision sends them in their original order and asks the vision model to refer to them as Image 1, Image 2, and so on. The answer includes per-image observations plus any cross-image comparison or combined conclusion the vision model can infer.
Single-image behavior is backward compatible: one image is still described as a normal vision result, now with images: 1 in the result header.
Limits and fallback behavior
maxImagescontrols how many images are sent in one vision request. Default:4.- If a tool result contains more than
maxImagesextractable images, glm-vision sends the firstmaxImagesin order and notes the skipped count in the prompt/result header. - If no extractable image data is present, glm-vision leaves the tool result unchanged.
- If authentication is missing or the vision request fails, glm-vision returns an error text and preserves the original image blocks so Pi can continue with the normal fallback path.
Requirements
- A z.ai account with Coding Plan access.
- Pi with the
zaiprovider configured and authenticated. - A z.ai model selected in Pi when reading images. glm-vision is inactive for non-
zaiproviders.
Installation
Via npm
pi install npm:glm-vision
Or add to .pi/settings.json:
{
"packages": ["npm:glm-vision"]
}
From GitHub
pi install git:github.com/eiei114/glm-vision
Or add to .pi/settings.json:
{
"packages": ["git:github.com/eiei114/glm-vision"]
}
Usage
No setup is required after installation. glm-vision runs automatically when all of these are true:
- The active Pi model uses the
zaiprovider. - The
readtool returns image content. - glm-vision is enabled.
Example prompt:
Read ./screenshots/checkout-error.png and explain what is wrong with this UI.
glm-vision replaces the raw image result with a text description such as:
[glm-vision: glm-4.6v]
The screenshot shows a checkout form with a red validation message under the card number field...
Commands
| Command | Description |
|---|---|
/glm-vision or /glm-vision status |
Show status, model, prompt mode, cache stats, and active prompt. |
/glm-vision on |
Enable image description. |
/glm-vision off |
Disable image description and forward images as-is. |
/glm-vision check |
Probe z.ai Coding Plan availability for known vision models. |
/glm-vision check <model> |
Probe a new candidate model before adding it. |
/glm-vision glm-4.6v |
Switch to GLM-4.6V (default). |
/glm-vision glm-4.6v-flash |
Switch to GLM-4.6V Flash (lighter). |
/glm-vision glm-4.6v-flashx |
Switch to GLM-4.6V FlashX (lightweight paid tier). |
/glm-vision glm-5v-turbo |
Switch to GLM-5V-Turbo (multimodal coding model). |
/glm-vision <preset> |
Switch prompt preset, e.g. /glm-vision ocr. |
/glm-vision mode <preset> |
Switch prompt preset, e.g. /glm-vision mode ui. |
/glm-vision prompt |
Show active prompt text. |
/glm-vision prompt <text> |
Save and use a custom prompt. |
/glm-vision reset |
Reset model, prompt mode, and cache settings to defaults. |
/glm-vision cache status |
Show cache status and cache file path. |
/glm-vision cache on |
Enable response cache. |
/glm-vision cache off |
Disable response cache without deleting entries. |
/glm-vision cache clear |
Clear cached responses. |
/glm-vision cache max <n> |
Set maximum cache entries and prune older entries. |
Prompt presets
| Preset | Best for | Behavior |
|---|---|---|
default |
General image understanding | Detailed description with text, code, and UI handling. |
ocr |
Screenshots, scans, documents | Exact text transcription with layout preservation. |
ui |
App or website screenshots | Layout, visual hierarchy, controls, labels, states, UX notes. |
code |
Code screenshots | Code extraction, language hints, indentation, visible errors. |
diagram |
Flowcharts, architecture diagrams | Nodes, labels, arrows, relationships, process summary. |
brief |
Quick context | 2-4 concise sentences with important visible details. |
Cache keys include the image hash, active prompt text, and model. Switching presets or models naturally creates separate cache entries.
Cache hits are visible in returned tool content:
[glm-vision: glm-4.6v, prompt=ocr, cache hit]
Fresh API calls show cache miss and are saved for later reuse when the cache is enabled.
Checking Coding Plan model availability
z.ai Coding Plan availability can change as new GLM vision models roll out. Run:
/glm-vision check
The command uses your existing zai provider API key and probes the known vision-model candidates. It reports which models are currently accepted by the Coding Plan API, so maintainers can quickly decide whether MODELS and this README need an update.
To test a newly announced model before editing the extension, pass it explicitly:
/glm-vision check glm-new-vision-model
Maintainers can also run the upstream watcher outside Pi:
npm run check:upstream
That watcher reads official Z.AI sources, including https://docs.z.ai/llms.txt, the GLM-4.6V guide, and the GLM Coding Plan quick start. If ZAI_API_KEY is set, it also probes the Coding Plan API and fails when a newly accepted probe model is not yet in MODELS / this README. The included GitHub Actions workflow runs this weekly, on manual dispatch, and when model-related files change.
Available vision models
| Model | Context | Notes |
|---|---|---|
glm-4.6v |
128K | Default. Visual reasoning + tool calling. |
glm-4.6v-flash |
128K | Lighter and faster for simple descriptions. |
glm-4.6v-flashx |
128K | Lightweight, faster paid option. |
glm-5v-turbo |
200K | Multimodal coding model for harder UI/code vision tasks. |
Note:
glm-4.5vis tracked as a probe candidate but not selectable until confirmed available on the z.ai Coding Plan.
Direct API vs Vision MCP Server
glm-vision keeps using the direct Z.AI HTTP API by default. That is the best fit for this package because it automatically intercepts Pi read results and returns a text description to the active GLM model without requiring the user to call a separate tool.
Z.AI also provides an official Vision MCP Server for MCP-compatible clients. It is useful when you want specialized tools such as OCR, UI screenshot analysis, technical diagram understanding, UI diff checks, image analysis, or video analysis. Use it alongside glm-vision when your client already supports MCP and you prefer explicit vision tools. Do not treat it as a replacement for glm-vision's automatic image-read interception.
See docs/decisions/0001-vision-mcp-and-model-selection.md for the decision record.
Configuration
Config is stored at ~/.pi/glm-vision.json:
{
"model": "glm-4.6v",
"enabled": true,
"promptMode": "default",
"cacheEnabled": true,
"cacheMaxEntries": 100,
"maxImages": 4
}
Use a custom prompt when you want a consistent style for image summaries. For example, OCR-heavy workflows can ask the vision model to transcribe all visible text before describing layout.
Custom prompts are stored as:
{
"model": "glm-4.6v",
"enabled": true,
"promptMode": "custom",
"prompt": "Describe only visible chart data and axis labels.",
"cacheEnabled": true,
"cacheMaxEntries": 100
}
Response cache stored at ~/.pi/glm-vision-cache.json.
If ~/.pi or this config file is missing, glm-vision uses defaults. If the
config JSON is invalid, not an object, has invalid field types, names an
unavailable vision model, or names an unavailable prompt mode, glm-vision leaves
the original image attached and returns an actionable config warning instead of
crashing.
API failures and retry behavior
Z.AI requests time out after 30 seconds. Transient failures (408, 409,
425, 429, and 5xx) are retried up to 3 total attempts with exponential
backoff (500ms, then 1000ms). Authentication, model-access, invalid JSON,
and empty-response failures return clear glm-vision error messages while
preserving the original image content.
How authentication works
glm-vision reuses the same API key that Pi uses for the zai provider. No additional API key setup is needed: if your z.ai model works in Pi, glm-vision works too.
Usage scenarios
UI screenshot review
Use when reviewing visual regressions, app states, design implementation, or accessibility issues.
Read ./screenshots/settings-page.png. Describe the layout, visible controls, error states, and anything that looks inconsistent with a modern settings page.
Good follow-up prompts:
- "Compare the described UI with our expected settings flow."
- "List likely CSS or component bugs from the screenshot."
- "Suggest regression tests that would catch this state."
OCR and text extraction
Use when an image contains logs, scanned docs, terminal output, PDFs rendered as screenshots, or handwritten notes.
Read ./captures/install-log.png. Transcribe all visible text exactly, then summarize the failure.
Tips:
- Ask for exact transcription first when accuracy matters.
- Use
glm-4.6vinstead of flash for dense text. - Crop noisy screenshots before reading if the key text is small.
Diagram reading
Use when an image contains architecture diagrams, flowcharts, UML, database schemas, or whiteboards.
Read ./docs/auth-flow.png. Convert the diagram into a numbered sequence and call out every system boundary.
Good follow-up prompts:
- "Turn this into Mermaid."
- "Identify missing failure paths."
- "Map each box in the diagram to files in this repo."
Error-image diagnosis
Use when a bug report only includes a screenshot of an error, stack trace, browser console, or broken screen.
Read ./bug-reports/payment-error.jpg. Extract the exact error message, identify the failing area, and suggest the first three debugging steps.
Tips:
- Include surrounding code or logs in the same conversation after reading the image.
- Ask the model to separate observed facts from inferred causes.
- Keep original images attached to issues so maintainers can verify the generated description.
Troubleshooting
glm-vision does not run
- Confirm the active Pi model uses the
zaiprovider. - Run
/glm-visionand confirm the status isON. - Confirm the file is read through the
readtool and contains supported image data. - Restart the Pi session after installing or changing packages.
no zai API key found
glm-vision reuses the same API key that Pi uses for the zai provider. If your z.ai model works in Pi, glm-vision should work too.
Fixes:
- Re-authenticate or reconfigure the
zaiprovider in Pi. - Start a new Pi session.
- Run
/glm-visionto confirm the extension loaded.
Vision response is incomplete or misses text
- Switch to
/glm-vision glm-4.6vfor detailed reasoning. - Crop the image to the relevant area.
- Increase contrast or resolution before reading the image.
- Customize
~/.pi/glm-vision.jsonwith an OCR-focused prompt or theocrprompt preset.
Image is forwarded instead of described
This can happen when glm-vision is disabled, the active provider is not zai, the image format is not represented as supported image content, or the vision API request fails. Error responses include the original image content when possible so the main model can still proceed.
Vision API returns an error
- Check z.ai plan access for
glm-4.6vorglm-4.6v-flash. - Try the other model with
/glm-vision glm-4.6v-flashor/glm-vision glm-4.6v. - Retry with a smaller or cropped image.
- Include the exact
[glm-vision error: ...]text when filing a bug.
Release operations
Maintainer release steps, semantic versioning policy, and release note template live in RELEASE.md. User-visible changes are tracked in CHANGELOG.md.
Contributing
Use the GitHub issue templates for bug reports and feature requests. Bug reports should include Pi version, package version, OS, selected z.ai model, image type, reproduction steps, and any [glm-vision error: ...] output.
License
MIT