pi-scraper

Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.

Packages

Package details

extensionskill

Install pi-scraper from npm and Pi will load the resources declared by the package manifest.

$ pi install npm:pi-scraper
Package
pi-scraper
Version
0.5.1
Published
May 13, 2026
Downloads
1,061/mo · 256/wk
Author
brandonkramercc
License
MIT
Types
extension, skill
Size
716.5 KB
Dependencies
15 dependencies · 0 peers
Pi manifest JSON
{
  "extensions": [
    "./src/index.ts"
  ],
  "skills": [
    "./skills"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-scraper

Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.

pi-scraper reads known URLs and small sites. Use it to scrape, summarize one page, crawl, map URLs, diff snapshots, retrieve stored results, or extract deterministic/structured data.

Install

pi install npm:pi-scraper

Quick start

Ask naturally; Pi can choose the right web tool automatically:

Read https://example.com as markdown.
List the URLs available from https://example.com.
Crawl https://example.com, up to 25 pages.
Compare https://example.com against my homepage snapshot.

Add cacheTtlSeconds when you want opt-in fetch-cache reuse; omit it for fresh fetches.

Requirements

  • Node.js >=22.19.0
  • Pi >=0.74.0
  • Optional Chromium binaries for mode: "browser"

Browser mode lazy-loads Playwright. Chromium is not bundled; install only if needed:

npx playwright install chromium

Set PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 when browsers are managed externally. mode: "fingerprint" uses bundled impit for Chrome-class TLS fingerprints; no extra install. Native binary is ~8 MB (one prebuild per platform). Set browserProfile: "chrome" (default) or "firefox".

Public tools

Tool Capability Use it for Contract tokens ≈ Input overhead ≈
web_scrape Local; model only for task: "summarize" Read one URL as markdown/text/LLM text/HTML/JSON, including raw Markdown, MDX, RST, and source docstrings. 170 +140
web_summarize Model/LLM; local scrape input Summarize one URL or provided content; page-scoped only, not multi-source research. 116 +100
web_crawl Local; browser optional through scrape pipeline Run/resume a breadth-first crawl, inspect crawl status by crawlId, list prior crawl metadata, or compile crawled docs into API-surface/context packages. 181 +158
web_map Local Discovery-only URL inventory from robots, sitemaps, gzipped sitemaps, sitemap.xml, and llms.txt; no page-content extraction. 58 +67
web_batch Local; browser optional through scrape pipeline Scrape many independent URLs with ordered per-URL success/failure results and optional context-package compilation. 195 +166
web_diff Local Re-scrape, normalize, compare against unnamed, named, or tagged snapshots, and store deterministic diff metadata. 91 +82
web_extract Local/model depending on action List/run deterministic extractors, inspect patterns, compile API surfaces, run selector extraction with adaptive repair, or extract via schema/prompt. 290 +289
web_get_result Local Retrieve a stored response by responseId, structured job manifest by jobId, or snapshot listing by snapshotUrl. 56 +74

Token counts are approximate: Contract is the full serialized tool declaration including schema; Input overhead is the empirical Pi JSON-mode input token delta against a no-tools baseline, which includes provider serialization and hidden wrapper metadata and varies by provider/model.

Capability labels:

Label Meaning
Local Runs from local HTTP/parsing/storage code without search API keys.
Browser optional Uses lazy Playwright only when requested or auto-escalation justifies it.
Model/LLM Needs Pi's selected model or a configured model adapter after scraping clean page text.

Parameter quick reference

Area Parameters
Input url, urls, content
Scrape output mode, format, onlyMainContent, maxChars, timeoutSeconds
Freshness/safety respectRobots defaults true; use refresh: true for time-sensitive facts
Session sessionId only for stateful flows (cookies/login/consent/locale/cart); saveSession: true persists across reloads; clearSession: true deletes.
Crawl action, maxPages, maxDepth, sameOrigin, crawlId, resume, seed, status, limit
Concurrency concurrency, perHostConcurrency; HTTP politeness reacts to 429 and Retry-After
Context packages compile: true on web_crawl/web_batch stores a bounded package artifact
API surface extract: "api-surface" builds a local module/function tree when possible
Diff snapshotName, snapshotTag, compareTag, maxSnapshotAgeSeconds
Extract action, extractor, prompt, schema, sourceFormat, markers, contains, excerpts, regexes, sections, include, extractSchema
Retrieve responseId, jobId, snapshotUrl, snapshotName, snapshotTag

Examples:

{ "url": "https://example.com", "snapshotName": "homepage" }
{ "url": "https://example.com/docs", "compile": true, "extract": "api-surface" }
{
  "action": "pattern",
  "url": "https://raw.githubusercontent.com/vitejs/vite/main/README.md",
  "sections": [
    { "name": "packages", "start": "## Packages", "end": "## Contribution" }
  ]
}

Session rule — default stateless. Use sessionId only when prior state affects later requests: cookies, consent, locale, login, cart/account/dashboard, or multi-step crawl/batch. Add saveSession: true only when state must survive later tool calls; use clearSession: true to reset.

Session example — log in once and reuse cookies across scrapes:

web_scrape({ url: "https://example.com/login", sessionId: "example", saveSession: true })
web_scrape({ url: "https://example.com/dashboard", sessionId: "example" })
web_batch({ urls: ["https://example.com/page1", "https://example.com/page2"], sessionId: "example" })
{ "crawlId": "abc-123", "sessionId": "example", "saveSession": true }

## Selector extraction

Extract structured content from HTML using CSS selectors, XPath, or text search. Optionally save a fingerprint of the matched element and relocate it later after page layout changes.

```text
Extract all product cards from https://example.com/products with selector .product-card

Parameters:

Parameter Type Default Description
selector string CSS selector, XPath, or text to find
selectorType string "css" "css" or "xpath" or "text"
attribute string Extract a specific attribute instead of text
identifier string (selector) Stable key for fingerprint storage
adaptive boolean false Enable relocation when selector no longer matches
autoSave boolean false Save fingerprint after a successful match
threshold number 0.35 Minimum similarity score (0–1) for adaptive fallback
limit number 10 Maximum elements to return

Examples:

// Extract all links with href
{ "url": "https://example.com", "selector": "a", "attribute": "href", "identifier": "example-links", "autoSave": true }

// Extract product cards and save fingerprint for future layout stability
{ "url": "https://example.com", "selector": ".product-card", "identifier": "products-v1", "autoSave": true }

// Later — if the layout changes but the content stays the same
{ "url": "https://example.com", "selector": ".product-card", "identifier": "products-v1", "adaptive": true, "threshold": 0.5 }

Scrape modes

Mode JavaScript support Playwright required Typical latency Extraction quality Best use case
fast No No Lowest Good for static pages Static HTML, docs, product pages, quick link/text extraction.
fingerprint No No Low-medium Same parser as static path Sites that block plain HTTP clients but do not require JavaScript. Bundled Chrome/Firefox TLS fingerprint via impit; per-hop SSRF validation owned by pi-scraper. Proxy support deferred (impit's HTTP/3 and proxy are mutually exclusive).
readable No No Medium Higher for articles/main content Articles, blogs, noisy pages where Readability improves main content.
browser Yes Yes, optional/lazy Highest Best for rendered DOM JavaScript-rendered pages when static/data-island recovery is insufficient.
auto Only if justified Only if escalated Adaptive Adaptive Default. Starts local/static, reuses fetched HTML, tries recovery/readable/fingerprint before browser only when block/rendering signals justify it.

fingerprint mode notes

  • Body size enforcement is incremental, not pre-check. Response body streams chunk-by-chunk through a maxBytes-bounded collector. A server lying about Content-Length cannot bypass the limit — actual bytes are counted and the upstream stream is cancelled mid-flight if exceeded. The trade-off: at least one chunk is read before a too-large response can be rejected.
  • DNS rebinding has a residual TOCTOU window. impit does not expose the connected peer IP, so post-handshake validation is not possible without an upstream change (apify/impit issue tracker). We mitigate via a double DNS resolve: pi-scraper resolves at preflight, resolves again immediately before handing off to impit, and rejects with DNS_REBINDING_DETECTED if the address sets differ. The residual window is the sub-millisecond gap between the second resolve and impit's actual connect(2). Requires resolveDns: true (the default). For arbitrary user-submitted URLs where even a narrow window is unacceptable, set fingerprintTrustLevel: "untrusted" to refuse fingerprint mode entirely, or use mode: "browser" for Chromium-managed DNS pinning.
  • Proxy support deferred. impit's ImpitOptions makes proxyUrl and HTTP/3 mutually exclusive, and HTTP/3 ALPN advertisement is part of the Chrome fingerprint we're impersonating. Until a per-call disableHttp3 escape hatch lands (or impit upstream supports both simultaneously), use mode: "fast" with the standard HTTP client for proxied scrapes.

Vertical extraction

Vertical extractors return typed JSON for known sites, preferring public APIs/feeds over browser or LLM extraction.

Extractor Input patterns Primary strategy Browser/cloud/LLM requirement
github_repo GitHub repository URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
github_issue GitHub issue URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
github_pr GitHub pull request URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
github_release GitHub release tag URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
npm npm package URLs npm registry JSON No browser; no LLM.
pypi PyPI package URLs PyPI JSON API No browser; no LLM.
crates_io crates.io crate URLs crates.io API No browser; no LLM.
docker_hub Docker Hub repository URLs Docker Hub repository API No browser; no LLM.
huggingface_model Hugging Face model URLs Hugging Face public model API No browser; no LLM.
huggingface_dataset Hugging Face dataset URLs Hugging Face public dataset API No browser; no LLM.
hackernews Hacker News item URLs Hacker News Firebase item API No browser; no LLM.
reddit Public Reddit post URLs Reddit structured JSON endpoint No browser; no LLM; returns blocked/rate-limit errors instead of bot-like HTML scraping.
arxiv arXiv abstract/PDF entry URLs arXiv Atom export feed No browser; no LLM.
deepwiki DeepWiki URLs Static HTML metadata parsing No browser; no LLM.
docsite Docs sites, MDN, GitBook, ReadTheDocs, Docusaurus Static HTML section parsing No browser; no LLM; returns platform with unknown fallback.
docstrings Raw .ts, .js, .py, and .rs source URLs Surface docstring parsing No browser; no LLM; extracts documented exports without typechecking.

web_extract modes:

  • action: "list" — inspect runtime extractor declarations.
  • action: "vertical" — known-site typed JSON, including docstrings.
  • action: "pattern" — deterministic length, markers, contains, regex, excerpts, start/end sections, symbol include, and extractSchema presets.
  • action: "selector" — CSS/XPath/text selector extraction with optional adaptive fingerprint relocation (see Selector extraction).
  • extract: "api-surface" — local hierarchical module/function tree.
  • action: "adhoc" — custom schema/prompt extraction; model-backed.

Reddit returns structured blocked/rate-limit errors rather than bypassing robots, auth, CAPTCHA, or anti-bot controls. Substack/Shopify are not built-ins yet because reliable machine-readable surfaces vary.

Storage, cache, and history

Large outputs are stored locally and returned with compact summaries plus responseId / fullOutputPath. Inline previews follow Pi defaults: 50KB or 2000 lines.

Storage uses a local SQLite metadata index plus content-addressed blobs. Cache reuse is opt-in with cacheTtlSeconds; default behavior is fresh network fetches. Use refresh: true for time-sensitive facts, web_crawl action: "list"|"status" for prior crawl freshness, and web_diff maxSnapshotAgeSeconds for stale baselines.

Persistent paths:

Data Path
Config ~/.pi/scraper/config/web.json
SQLite index ~/.pi/scraper/index.db
Payload blobs ~/.pi/scraper/blobs/<aa>/...
Legacy snapshots ~/.pi/scraper/snapshots/
Legacy backups ~/.pi/scraper/results.bak/, ~/.pi/scraper/crawl.bak/ after migration

Safety and anti-bot scope

  • SSRF/private-network protection is applied before fetches and at the HTTP connect/redirect layer.
  • HTTP cookies are scoped to the response origin: Set-Cookie Domain attributes are validated against the response host (RFC 6265 §5.1.3 / §5.3 step 6) and the Path attribute follows RFC 6265 default-path semantics (§5.1.4 / §5.2.4 — last valid Path wins, invalid values fall back to the request-URI directory).
  • In mode: "browser", service workers are blocked, every subresource URL is re-validated through the same SSRF guard, and DNS dedup is per-page so concurrent renders sharing a session cannot bleed safety decisions.
  • respectRobots defaults to true.
  • Response body sizes are bounded before allocation and while streaming.
  • Browser rendering is optional and lazy-loaded.
  • The package may detect bot-block pages and return structured blocked/error results.
  • It does not promise CAPTCHA solving, residential proxy rotation, stealth guarantees, or guaranteed access to protected sites.

Packaged skill

Includes the compact web-scraping Pi skill for tool routing.

Configuration command

Use /scrape-config to inspect effective settings and persist defaults interactively or via direct arguments.

Sub-action What it does
(no args) Interactive picker (falls back to status when UI unavailable)
status Effective config + live adapter-resolution preview
model-provider <value> Set modelProvider (auto / off / <adapter-id>)
scrape-mode <mode> [format] Set scrapeMode + outputFormat
cache stats Inspect response cache size and entry counts
cache clear Clear response cache (confirm prompt)
robots on/off Toggle respectRobots default
reload Reload config from disk, clearing the in-memory cache

The effective config is cached in memory for the session. After hand-editing ~/.pi/scraper/config/web.json, run /scrape-config reload (or restart the session) to pick up changes.

Model adapters

web_summarize and web_extract action="adhoc" need an LLM transport. When Pi has a model configured (OpenAI, Anthropic, Google, etc.), the tools use it automatically via the host context — no extra extension needed. Any Pi extension can also supply one via pi.events for cross-extension provider lending. With no adapter available, the tools return MODEL_ADAPTER_MISSING and the LLM falls back to web_scrape + summarize-in-reply.

Capabilities

Capability What it does Used by
summarize Page-scoped natural-language summary of scraped content. web_summarize
extract Schema- or prompt-driven structured extraction (JSON shape) from scraped content. web_extract action="adhoc"

Configuration

Highest layer wins:

Layer Mechanism Use
Programmatic options.modelAdapter (test / injected) Direct override
Pi host ctx.model — Pi's currently selected model Automatic when available
Per-call provider param on the tool call LLM routes a single call
Pi flag --web-model-provider=auto|<id>|off Per Pi session
Env var PI_WEB_MODEL_PROVIDER Shell / scripts
Config file modelProvider (string or { summarize, extract }) Persistent default
Default "auto" Out-of-box

"auto" picks the highest-priority adapter that supports the requested capability. "off" returns MODEL_ADAPTER_MISSING and (at config level) hides the model-backed tools from Pi's tool list.

Errors: MODEL_ADAPTER_MISSING (none registered, LLM redirected to web_scrape), MODEL_ADAPTER_NOT_FOUND (explicit ID unknown — error lists known IDs), MODEL_ADAPTER_INCOMPATIBLE (ID registered but lacks the requested capability).

Event protocol

Event Direction Payload Purpose
pi:model-adapter/register provider → pi-scraper entry (shape in the example below) Announce availability
pi:model-adapter/unregister provider → pi-scraper { id } Withdraw (hot-reload / dispose)
pi:model-adapter/discover pi-scraper → provider { capabilities?, minPriority? } | {} Ask providers to re-announce

Adapters SHOULD honor the discover filter (capability overlap, priority >= minPriority) but MAY re-register unconditionally — pi-scraper's resolver filters by capability anyway, so the unfiltered path is harmless, just noisier.

Implementing an adapter

Simple — works for any single-adapter setup:

const entry = {
  id: "my-adapter",
  label: "My Adapter",
  capabilities: ["summarize"] as const, // summarize | extract
  priority: 50, // higher wins in "auto"
  adapter: {
    async run(req, signal) {
      // req.task | req.input | req.prompt | req.schema (extract only)
      // Return: { data, text?, raw?, usage? }
      //   usage: { provider?, model?, inputTokens?, outputTokens?, totalTokens?, costUSD? }
      //   All usage fields optional — supply what you have.
    },
  },
};

pi.events?.emit?.("pi:model-adapter/register", entry);
pi.events?.on?.("pi:model-adapter/discover", () => {
  pi.events?.emit?.("pi:model-adapter/register", entry);
});

Advanced — honors the discover filter (cuts re-registration noise in multi-adapter setups) and tidies up on unload:

pi.events?.on?.("pi:model-adapter/discover", (payload) => {
  const filter = ((payload as object | null) ?? {}) as {
    capabilities?: readonly string[];
    minPriority?: number;
  };

  if (filter.capabilities?.length) {
    const overlap = entry.capabilities.some((c) =>
      filter.capabilities!.includes(c),
    );
    if (!overlap) return;
  }
  if (
    typeof filter.minPriority === "number" &&
    entry.priority < filter.minPriority
  )
    return;

  pi.events?.emit?.("pi:model-adapter/register", entry);
});

pi.events?.emit?.("pi:model-adapter/unregister", { id: entry.id }); // on unload

web_summarize issues a filtered discover ({ capabilities: ["summarize"] }) on its first invocation when no summarize-capable adapter is registered, then caches per capability so subsequent invocations don't re-emit. web_extract action="adhoc" will adopt the same pattern.

When an adapter returns usage, web_summarize (and web_extract action="adhoc") render a compact footer in the expanded view, for example: gemini-acp · gemini-2.0-flash · 234 in · 187 out · $0.0023. Adapters supply only the fields they have; pi-scraper hides absent fields automatically. Cost is in USD and is the adapter's responsibility to compute — pi-scraper ships no pricing table.

Development and release checks

Install dependencies from a checkout:

nvm use 22.19.0
npm install

Run the core checks:

npm run typecheck
npm test
npm run test:tools
npm pack --dry-run

Optional checks before a release:

npm run smoke:install
npm run audit:strict
PI_SCRAPER_LIVE=1 npm run smoke:live

Optional browser smoke:

export PLAYWRIGHT_BROWSERS_PATH="${TMPDIR:-/tmp}/pi-scraper-ms-playwright"
npx playwright install chromium
PI_SCRAPER_BROWSER=1 npm run smoke:browser

License

MIT