pi-scraper
Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.
Package details
Install pi-scraper from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-scraper- Package
pi-scraper- Version
0.5.1- Published
- May 13, 2026
- Downloads
- 1,061/mo · 256/wk
- Author
- brandonkramercc
- License
- MIT
- Types
- extension, skill
- Size
- 716.5 KB
- Dependencies
- 15 dependencies · 0 peers
Pi manifest JSON
{
"extensions": [
"./src/index.ts"
],
"skills": [
"./skills"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-scraper
Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.
pi-scraper reads known URLs and small sites. Use it to scrape, summarize one page, crawl, map URLs, diff snapshots, retrieve stored results, or extract deterministic/structured data.
Install
pi install npm:pi-scraper
Quick start
Ask naturally; Pi can choose the right web tool automatically:
Read https://example.com as markdown.
List the URLs available from https://example.com.
Crawl https://example.com, up to 25 pages.
Compare https://example.com against my homepage snapshot.
Add cacheTtlSeconds when you want opt-in fetch-cache reuse; omit it for fresh fetches.
Requirements
- Node.js
>=22.19.0 - Pi
>=0.74.0 - Optional Chromium binaries for
mode: "browser"
Browser mode lazy-loads Playwright. Chromium is not bundled; install only if needed:
npx playwright install chromium
Set PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 when browsers are managed externally. mode: "fingerprint" uses bundled impit for Chrome-class TLS fingerprints; no extra install. Native binary is ~8 MB (one prebuild per platform). Set browserProfile: "chrome" (default) or "firefox".
Public tools
| Tool | Capability | Use it for | Contract tokens ≈ | Input overhead ≈ |
|---|---|---|---|---|
web_scrape |
Local; model only for task: "summarize" |
Read one URL as markdown/text/LLM text/HTML/JSON, including raw Markdown, MDX, RST, and source docstrings. | 170 | +140 |
web_summarize |
Model/LLM; local scrape input | Summarize one URL or provided content; page-scoped only, not multi-source research. | 116 | +100 |
web_crawl |
Local; browser optional through scrape pipeline | Run/resume a breadth-first crawl, inspect crawl status by crawlId, list prior crawl metadata, or compile crawled docs into API-surface/context packages. |
181 | +158 |
web_map |
Local | Discovery-only URL inventory from robots, sitemaps, gzipped sitemaps, sitemap.xml, and llms.txt; no page-content extraction. |
58 | +67 |
web_batch |
Local; browser optional through scrape pipeline | Scrape many independent URLs with ordered per-URL success/failure results and optional context-package compilation. | 195 | +166 |
web_diff |
Local | Re-scrape, normalize, compare against unnamed, named, or tagged snapshots, and store deterministic diff metadata. | 91 | +82 |
web_extract |
Local/model depending on action | List/run deterministic extractors, inspect patterns, compile API surfaces, run selector extraction with adaptive repair, or extract via schema/prompt. | 290 | +289 |
web_get_result |
Local | Retrieve a stored response by responseId, structured job manifest by jobId, or snapshot listing by snapshotUrl. |
56 | +74 |
Token counts are approximate: Contract is the full serialized tool declaration including schema; Input overhead is the empirical Pi JSON-mode input token delta against a no-tools baseline, which includes provider serialization and hidden wrapper metadata and varies by provider/model.
Capability labels:
| Label | Meaning |
|---|---|
| Local | Runs from local HTTP/parsing/storage code without search API keys. |
| Browser optional | Uses lazy Playwright only when requested or auto-escalation justifies it. |
| Model/LLM | Needs Pi's selected model or a configured model adapter after scraping clean page text. |
Parameter quick reference
| Area | Parameters |
|---|---|
| Input | url, urls, content |
| Scrape output | mode, format, onlyMainContent, maxChars, timeoutSeconds |
| Freshness/safety | respectRobots defaults true; use refresh: true for time-sensitive facts |
| Session | sessionId only for stateful flows (cookies/login/consent/locale/cart); saveSession: true persists across reloads; clearSession: true deletes. |
| Crawl | action, maxPages, maxDepth, sameOrigin, crawlId, resume, seed, status, limit |
| Concurrency | concurrency, perHostConcurrency; HTTP politeness reacts to 429 and Retry-After |
| Context packages | compile: true on web_crawl/web_batch stores a bounded package artifact |
| API surface | extract: "api-surface" builds a local module/function tree when possible |
| Diff | snapshotName, snapshotTag, compareTag, maxSnapshotAgeSeconds |
| Extract | action, extractor, prompt, schema, sourceFormat, markers, contains, excerpts, regexes, sections, include, extractSchema |
| Retrieve | responseId, jobId, snapshotUrl, snapshotName, snapshotTag |
Examples:
{ "url": "https://example.com", "snapshotName": "homepage" }
{ "url": "https://example.com/docs", "compile": true, "extract": "api-surface" }
{
"action": "pattern",
"url": "https://raw.githubusercontent.com/vitejs/vite/main/README.md",
"sections": [
{ "name": "packages", "start": "## Packages", "end": "## Contribution" }
]
}
Session rule — default stateless. Use sessionId only when prior state affects later requests: cookies, consent, locale, login, cart/account/dashboard, or multi-step crawl/batch. Add saveSession: true only when state must survive later tool calls; use clearSession: true to reset.
Session example — log in once and reuse cookies across scrapes:
web_scrape({ url: "https://example.com/login", sessionId: "example", saveSession: true })
web_scrape({ url: "https://example.com/dashboard", sessionId: "example" })
web_batch({ urls: ["https://example.com/page1", "https://example.com/page2"], sessionId: "example" })
{ "crawlId": "abc-123", "sessionId": "example", "saveSession": true }
## Selector extraction
Extract structured content from HTML using CSS selectors, XPath, or text search. Optionally save a fingerprint of the matched element and relocate it later after page layout changes.
```text
Extract all product cards from https://example.com/products with selector .product-card
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
selector |
string | — | CSS selector, XPath, or text to find |
selectorType |
string | "css" | "css" or "xpath" or "text" |
attribute |
string | — | Extract a specific attribute instead of text |
identifier |
string | (selector) | Stable key for fingerprint storage |
adaptive |
boolean | false | Enable relocation when selector no longer matches |
autoSave |
boolean | false | Save fingerprint after a successful match |
threshold |
number | 0.35 | Minimum similarity score (0–1) for adaptive fallback |
limit |
number | 10 | Maximum elements to return |
Examples:
// Extract all links with href
{ "url": "https://example.com", "selector": "a", "attribute": "href", "identifier": "example-links", "autoSave": true }
// Extract product cards and save fingerprint for future layout stability
{ "url": "https://example.com", "selector": ".product-card", "identifier": "products-v1", "autoSave": true }
// Later — if the layout changes but the content stays the same
{ "url": "https://example.com", "selector": ".product-card", "identifier": "products-v1", "adaptive": true, "threshold": 0.5 }
Scrape modes
| Mode | JavaScript support | Playwright required | Typical latency | Extraction quality | Best use case |
|---|---|---|---|---|---|
fast |
No | No | Lowest | Good for static pages | Static HTML, docs, product pages, quick link/text extraction. |
fingerprint |
No | No | Low-medium | Same parser as static path | Sites that block plain HTTP clients but do not require JavaScript. Bundled Chrome/Firefox TLS fingerprint via impit; per-hop SSRF validation owned by pi-scraper. Proxy support deferred (impit's HTTP/3 and proxy are mutually exclusive). |
readable |
No | No | Medium | Higher for articles/main content | Articles, blogs, noisy pages where Readability improves main content. |
browser |
Yes | Yes, optional/lazy | Highest | Best for rendered DOM | JavaScript-rendered pages when static/data-island recovery is insufficient. |
auto |
Only if justified | Only if escalated | Adaptive | Adaptive | Default. Starts local/static, reuses fetched HTML, tries recovery/readable/fingerprint before browser only when block/rendering signals justify it. |
fingerprint mode notes
- Body size enforcement is incremental, not pre-check. Response body streams chunk-by-chunk through a
maxBytes-bounded collector. A server lying aboutContent-Lengthcannot bypass the limit — actual bytes are counted and the upstream stream is cancelled mid-flight if exceeded. The trade-off: at least one chunk is read before a too-large response can be rejected. - DNS rebinding has a residual TOCTOU window. impit does not expose the connected peer IP, so post-handshake validation is not possible without an upstream change (apify/impit issue tracker). We mitigate via a double DNS resolve: pi-scraper resolves at preflight, resolves again immediately before handing off to impit, and rejects with
DNS_REBINDING_DETECTEDif the address sets differ. The residual window is the sub-millisecond gap between the second resolve and impit's actualconnect(2). RequiresresolveDns: true(the default). For arbitrary user-submitted URLs where even a narrow window is unacceptable, setfingerprintTrustLevel: "untrusted"to refuse fingerprint mode entirely, or usemode: "browser"for Chromium-managed DNS pinning. - Proxy support deferred. impit's
ImpitOptionsmakesproxyUrland HTTP/3 mutually exclusive, and HTTP/3 ALPN advertisement is part of the Chrome fingerprint we're impersonating. Until a per-calldisableHttp3escape hatch lands (or impit upstream supports both simultaneously), usemode: "fast"with the standard HTTP client for proxied scrapes.
Vertical extraction
Vertical extractors return typed JSON for known sites, preferring public APIs/feeds over browser or LLM extraction.
| Extractor | Input patterns | Primary strategy | Browser/cloud/LLM requirement |
|---|---|---|---|
github_repo |
GitHub repository URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
github_issue |
GitHub issue URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
github_pr |
GitHub pull request URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
github_release |
GitHub release tag URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
npm |
npm package URLs | npm registry JSON | No browser; no LLM. |
pypi |
PyPI package URLs | PyPI JSON API | No browser; no LLM. |
crates_io |
crates.io crate URLs | crates.io API | No browser; no LLM. |
docker_hub |
Docker Hub repository URLs | Docker Hub repository API | No browser; no LLM. |
huggingface_model |
Hugging Face model URLs | Hugging Face public model API | No browser; no LLM. |
huggingface_dataset |
Hugging Face dataset URLs | Hugging Face public dataset API | No browser; no LLM. |
hackernews |
Hacker News item URLs | Hacker News Firebase item API | No browser; no LLM. |
reddit |
Public Reddit post URLs | Reddit structured JSON endpoint | No browser; no LLM; returns blocked/rate-limit errors instead of bot-like HTML scraping. |
arxiv |
arXiv abstract/PDF entry URLs | arXiv Atom export feed | No browser; no LLM. |
deepwiki |
DeepWiki URLs | Static HTML metadata parsing | No browser; no LLM. |
docsite |
Docs sites, MDN, GitBook, ReadTheDocs, Docusaurus | Static HTML section parsing | No browser; no LLM; returns platform with unknown fallback. |
docstrings |
Raw .ts, .js, .py, and .rs source URLs |
Surface docstring parsing | No browser; no LLM; extracts documented exports without typechecking. |
web_extract modes:
action: "list"— inspect runtime extractor declarations.action: "vertical"— known-site typed JSON, includingdocstrings.action: "pattern"— deterministic length, markers, contains, regex, excerpts, start/endsections, symbolinclude, andextractSchemapresets.action: "selector"— CSS/XPath/text selector extraction with optional adaptive fingerprint relocation (see Selector extraction).extract: "api-surface"— local hierarchical module/function tree.action: "adhoc"— custom schema/prompt extraction; model-backed.
Reddit returns structured blocked/rate-limit errors rather than bypassing robots, auth, CAPTCHA, or anti-bot controls. Substack/Shopify are not built-ins yet because reliable machine-readable surfaces vary.
Storage, cache, and history
Large outputs are stored locally and returned with compact summaries plus responseId / fullOutputPath. Inline previews follow Pi defaults: 50KB or 2000 lines.
Storage uses a local SQLite metadata index plus content-addressed blobs. Cache reuse is opt-in with cacheTtlSeconds; default behavior is fresh network fetches. Use refresh: true for time-sensitive facts, web_crawl action: "list"|"status" for prior crawl freshness, and web_diff maxSnapshotAgeSeconds for stale baselines.
Persistent paths:
| Data | Path |
|---|---|
| Config | ~/.pi/scraper/config/web.json |
| SQLite index | ~/.pi/scraper/index.db |
| Payload blobs | ~/.pi/scraper/blobs/<aa>/... |
| Legacy snapshots | ~/.pi/scraper/snapshots/ |
| Legacy backups | ~/.pi/scraper/results.bak/, ~/.pi/scraper/crawl.bak/ after migration |
Safety and anti-bot scope
- SSRF/private-network protection is applied before fetches and at the HTTP connect/redirect layer.
- HTTP cookies are scoped to the response origin:
Set-CookieDomainattributes are validated against the response host (RFC 6265 §5.1.3 / §5.3 step 6) and thePathattribute follows RFC 6265 default-path semantics (§5.1.4 / §5.2.4 — last validPathwins, invalid values fall back to the request-URI directory). - In
mode: "browser", service workers are blocked, every subresource URL is re-validated through the same SSRF guard, and DNS dedup is per-page so concurrent renders sharing a session cannot bleed safety decisions. respectRobotsdefaults totrue.- Response body sizes are bounded before allocation and while streaming.
- Browser rendering is optional and lazy-loaded.
- The package may detect bot-block pages and return structured blocked/error results.
- It does not promise CAPTCHA solving, residential proxy rotation, stealth guarantees, or guaranteed access to protected sites.
Packaged skill
Includes the compact web-scraping Pi skill for tool routing.
Configuration command
Use /scrape-config to inspect effective settings and persist defaults interactively or via direct arguments.
| Sub-action | What it does |
|---|---|
| (no args) | Interactive picker (falls back to status when UI unavailable) |
status |
Effective config + live adapter-resolution preview |
model-provider <value> |
Set modelProvider (auto / off / <adapter-id>) |
scrape-mode <mode> [format] |
Set scrapeMode + outputFormat |
cache stats |
Inspect response cache size and entry counts |
cache clear |
Clear response cache (confirm prompt) |
robots on/off |
Toggle respectRobots default |
reload |
Reload config from disk, clearing the in-memory cache |
The effective config is cached in memory for the session. After hand-editing ~/.pi/scraper/config/web.json, run /scrape-config reload (or restart the session) to pick up changes.
Model adapters
web_summarize and web_extract action="adhoc" need an LLM transport. When Pi has a model configured (OpenAI, Anthropic, Google, etc.), the tools use it automatically via the host context — no extra extension needed. Any Pi extension can also supply one via pi.events for cross-extension provider lending. With no adapter available, the tools return MODEL_ADAPTER_MISSING and the LLM falls back to web_scrape + summarize-in-reply.
Capabilities
| Capability | What it does | Used by |
|---|---|---|
summarize |
Page-scoped natural-language summary of scraped content. | web_summarize |
extract |
Schema- or prompt-driven structured extraction (JSON shape) from scraped content. | web_extract action="adhoc" |
Configuration
Highest layer wins:
| Layer | Mechanism | Use |
|---|---|---|
| Programmatic | options.modelAdapter (test / injected) |
Direct override |
| Pi host | ctx.model — Pi's currently selected model |
Automatic when available |
| Per-call | provider param on the tool call |
LLM routes a single call |
| Pi flag | --web-model-provider=auto|<id>|off |
Per Pi session |
| Env var | PI_WEB_MODEL_PROVIDER |
Shell / scripts |
| Config file | modelProvider (string or { summarize, extract }) |
Persistent default |
| Default | "auto" |
Out-of-box |
"auto" picks the highest-priority adapter that supports the requested capability. "off" returns MODEL_ADAPTER_MISSING and (at config level) hides the model-backed tools from Pi's tool list.
Errors: MODEL_ADAPTER_MISSING (none registered, LLM redirected to web_scrape), MODEL_ADAPTER_NOT_FOUND (explicit ID unknown — error lists known IDs), MODEL_ADAPTER_INCOMPATIBLE (ID registered but lacks the requested capability).
Event protocol
| Event | Direction | Payload | Purpose |
|---|---|---|---|
pi:model-adapter/register |
provider → pi-scraper | entry (shape in the example below) |
Announce availability |
pi:model-adapter/unregister |
provider → pi-scraper | { id } |
Withdraw (hot-reload / dispose) |
pi:model-adapter/discover |
pi-scraper → provider | { capabilities?, minPriority? } | {} |
Ask providers to re-announce |
Adapters SHOULD honor the discover filter (capability overlap, priority >= minPriority) but MAY re-register unconditionally — pi-scraper's resolver filters by capability anyway, so the unfiltered path is harmless, just noisier.
Implementing an adapter
Simple — works for any single-adapter setup:
const entry = {
id: "my-adapter",
label: "My Adapter",
capabilities: ["summarize"] as const, // summarize | extract
priority: 50, // higher wins in "auto"
adapter: {
async run(req, signal) {
// req.task | req.input | req.prompt | req.schema (extract only)
// Return: { data, text?, raw?, usage? }
// usage: { provider?, model?, inputTokens?, outputTokens?, totalTokens?, costUSD? }
// All usage fields optional — supply what you have.
},
},
};
pi.events?.emit?.("pi:model-adapter/register", entry);
pi.events?.on?.("pi:model-adapter/discover", () => {
pi.events?.emit?.("pi:model-adapter/register", entry);
});
Advanced — honors the discover filter (cuts re-registration noise in multi-adapter setups) and tidies up on unload:
pi.events?.on?.("pi:model-adapter/discover", (payload) => {
const filter = ((payload as object | null) ?? {}) as {
capabilities?: readonly string[];
minPriority?: number;
};
if (filter.capabilities?.length) {
const overlap = entry.capabilities.some((c) =>
filter.capabilities!.includes(c),
);
if (!overlap) return;
}
if (
typeof filter.minPriority === "number" &&
entry.priority < filter.minPriority
)
return;
pi.events?.emit?.("pi:model-adapter/register", entry);
});
pi.events?.emit?.("pi:model-adapter/unregister", { id: entry.id }); // on unload
web_summarize issues a filtered discover ({ capabilities: ["summarize"] }) on its first invocation when no summarize-capable adapter is registered, then caches per capability so subsequent invocations don't re-emit. web_extract action="adhoc" will adopt the same pattern.
When an adapter returns usage, web_summarize (and web_extract action="adhoc") render a compact footer in the expanded view, for example: gemini-acp · gemini-2.0-flash · 234 in · 187 out · $0.0023. Adapters supply only the fields they have; pi-scraper hides absent fields automatically. Cost is in USD and is the adapter's responsibility to compute — pi-scraper ships no pricing table.
Development and release checks
Install dependencies from a checkout:
nvm use 22.19.0
npm install
Run the core checks:
npm run typecheck
npm test
npm run test:tools
npm pack --dry-run
Optional checks before a release:
npm run smoke:install
npm run audit:strict
PI_SCRAPER_LIVE=1 npm run smoke:live
Optional browser smoke:
export PLAYWRIGHT_BROWSERS_PATH="${TMPDIR:-/tmp}/pi-scraper-ms-playwright"
npx playwright install chromium
PI_SCRAPER_BROWSER=1 npm run smoke:browser