pi-web-fetch
Pi extension: fetch web pages via headless Chrome, extract content with trafilatura, and optionally process with an LLM
Package details
Install pi-web-fetch from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-web-fetch- Package
pi-web-fetch- Version
1.1.0- Published
- Mar 3, 2026
- Downloads
- 283/mo · 143/wk
- Author
- georgebashi
- License
- MIT
- Types
- extension
- Size
- 62.2 KB
- Dependencies
- 2 dependencies · 3 peers
Pi manifest JSON
{
"extensions": [
"./index.ts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-web-fetch
A pi extension that gives your agent a proper web browser. Fetches pages via headless Chrome, extracts clean content via trafilatura, and optionally distills it with an LLM.
Why not use Claude Code's built-in WebFetch?
Claude Code's WebFetch uses Turndown to convert HTML to markdown — a simple regex-based converter that can't handle JavaScript-rendered pages, doesn't strip boilerplate well, and has no extensibility. pi-web-fetch improves on this in several ways:
| Claude Code WebFetch | pi-web-fetch | |
|---|---|---|
| Rendering | Static HTTP fetch | Headless Chrome (handles SPAs, JS-rendered content) |
| Extraction | Turndown (regex HTML→md) | trafilatura (ML-based boilerplate removal) |
| Raw content | Never — prompt is mandatory | Optional — omit prompt to get full markdown |
| Batch fetching | One URL at a time | Up to 10 URLs concurrently with per-URL progress |
| Concurrency | Sequential | Browser pool with 6 parallel tabs |
| Extensibility | None | Hook system for site-specific handling |
| Smart redirects | Generic | Context-aware (e.g. GitHub URLs → gh CLI suggestions) |
Features
web_fetchtool — registered in pi's tool system, callable by the LLM- Headless Chrome via puppeteer — handles JavaScript-rendered pages, SPAs, pages behind cookie banners
- Content extraction — strips boilerplate (nav, ads, footers) using trafilatura's ML-based extraction, outputs clean markdown
- LLM processing — optionally distills page content to answer a specific question via a pi sub-agent
- Batch fetching — fetch up to 10 pages concurrently in a single tool call with per-URL status in the UI
- Browser pool — reuses a single Chrome instance with up to 6 parallel tabs, avoiding repeated browser startup overhead
- Smart large-page handling — when content exceeds ~50KB and no prompt is given, automatically generates a structured summary
- 15-minute cache — avoids redundant fetches; enables summarize-then-drill-down workflows
- Cross-host redirect detection — reported to the LLM for explicit follow-up rather than silently following
- HTTP→HTTPS auto-upgrade
- Extension hooks — customize fetch behavior for specific sites (redirect, replace HTML, transform markdown, override summarization)
- Built-in site handlers — GitHub URLs redirect to
ghCLI, Google Docs redirect to workspace tools
Prerequisites
- pi coding agent
- A Python tool runner for trafilatura (auto-detected in priority order):
- Node.js 18+
Installation
npm install pi-web-fetch
Or clone for development:
git clone https://github.com/georgebashi/pi-web-fetch
cd pi-web-fetch
npm install
Note:
npm installwill download puppeteer's bundled Chromium (~300MB). If you already have Chrome/Chromium installed, you can setPUPPETEER_EXECUTABLE_PATHto skip the download:export PUPPETEER_EXECUTABLE_PATH=/path/to/chrome
Note: The first time
web_fetchruns,uvxwill download the trafilatura package (~10MB). Subsequent runs use the cached environment and are fast.
Add to pi
Add the package to your pi settings (~/.pi/agent/settings.json):
{
"extensions": [
"pi-web-fetch"
]
}
Or point to a local checkout:
{
"extensions": [
"/path/to/pi-web-fetch"
]
}
Or use the -e flag for quick testing:
pi -e pi-web-fetch
Usage
The extension registers a web_fetch tool. The LLM will use it automatically when it needs to fetch web content.
Single URL
With a prompt (recommended):
Fetch https://docs.example.com/api and tell me what authentication methods are supported.
Without a prompt (full content):
Fetch the content of https://example.com/changelog
Batch fetching
Fetch multiple pages concurrently by asking for several URLs at once:
Read these three pages and compare their approaches to error handling:
The agent will use the pages parameter to fetch all URLs in parallel (up to 10 per call). The UI shows live per-URL progress:
● docs.python.org/3/tutorial/errors.html
◐ go.dev/blog/error-handling-and-go · fetching
◑ doc.rust-lang.org/book/ch09-00-error-... · extracting
Each URL independently transitions through: pending → fetching → extracting → summarizing → done/error. The browser pool manages concurrency automatically (6 tabs max).
Extensions
pi-web-fetch has a hook system for site-specific fetch behavior. Extensions can intercept any stage of the pipeline: before fetch, after fetch (HTML), after extraction (markdown), or at summarization time.
Built-in extensions
- GitHub redirect — matches
github.com/**, redirects toghCLI with context-aware suggestions (e.g.gh issue view 123) - Google Docs redirect — matches
docs.google.com/**, redirects to Google Workspace MCP tools
Writing extensions
Extensions are TypeScript modules with a factory function default export:
import type { WebFetchExtension } from "pi-web-fetch/types";
export default function (): WebFetchExtension {
return {
name: "my-handler",
matches: ["example.com/**"],
async beforeFetch(ctx) {
// Return a HookResult to short-circuit, or void to continue
return ctx.redirect("Use a different tool for this site.");
},
async afterFetch(ctx) {
// ctx.html — replace HTML or short-circuit
},
async afterExtract(ctx) {
// ctx.markdown — replace markdown or short-circuit
},
async summarize(ctx) {
// Override default LLM summarization
},
};
}
Extension sources (in priority order)
- Event bus — other pi extensions can register handlers via
pi.events.emit("web-fetch:register", extension) - Local — TypeScript/JS files in
~/.pi/extensions/web-fetch/(or configuredextensionsDir) - Built-in — shipped with pi-web-fetch in the
extensions/directory
Configuration
Create ~/.pi/agent/web-fetch.json to override defaults:
{
"model": "provider/model-id",
"thinkingLevel": "medium",
"extensionsDir": "~/.pi/extensions/web-fetch"
}
| Key | Default | Description |
|---|---|---|
model |
Current session model | Model for LLM content processing |
thinkingLevel |
Current session thinking level | Thinking level for the sub-agent |
extensionsDir |
~/.pi/extensions/web-fetch/ |
Directory for local extensions |
Without a config file, the extension uses whatever model and thinking level the current session is using.
Architecture
web_fetch(url, prompt?)
│
├─ URL validation & normalization (http→https, scheme check)
├─ Cache check (15-min TTL)
├─ Extension: beforeFetch hook
├─ Browser pool → Puppeteer page (networkidle2, 30s timeout)
├─ Cross-host redirect detection
├─ Extension: afterFetch hook
├─ trafilatura extraction (HTML → clean markdown)
├─ Extension: afterExtract hook
├─ Cache store
└─ Content processing:
├─ With prompt → pi sub-agent (focused extraction)
├─ Small content, no prompt → return raw markdown
└─ Large content, no prompt → pi sub-agent (structured summary)
For batch mode, each URL runs through this pipeline independently with its own browser tab. The browser pool (6 tabs max, 60s idle timeout) provides backpressure.
License
MIT