pi-scraper

Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.

Package details

extensionskill

Install pi-scraper from npm and Pi will load the resources declared by the package manifest.

$ pi install npm:pi-scraper
Package
pi-scraper
Version
0.2.1
Published
May 4, 2026
Downloads
317/mo · 317/wk
Author
brandonkramercc
License
MIT
Types
extension, skill
Size
379.9 KB
Dependencies
12 dependencies · 2 peers
Pi manifest JSON
{
  "extensions": [
    "./src/index.ts"
  ],
  "skills": [
    "./skills"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-scraper

Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.

pi-scraper is a Pi extension for reading web pages and small sites. It focuses on fast scraping, recursive crawling, URL/site mapping, brand extraction, content diffing, PDF text extraction, local result history, and deterministic vertical extraction.

Use it when you already have URLs and want to read, crawl, compare, or extract them. Use a companion search/research extension such as pi-gemini-acp when you need broad source discovery or multi-source synthesis first.

Install

From npm:

pi install npm:pi-scraper

Quick start

Ask naturally; Pi can choose the right web tool automatically:

Read https://example.com as markdown.
List the URLs available from https://example.com.
Crawl https://example.com, up to 25 pages.
Compare https://example.com against my homepage snapshot.

For repeated local work, Pi can opt into the fetch cache:

{ "url": "https://example.com", "cacheTtlSeconds": 3600 }

Omit cacheTtlSeconds for always-fresh behavior.

Requirements

  • Node.js >=22.19.0
  • Pi >=0.65.0
  • Optional Chromium binaries for mode: "browser"

Normal installs include the optional Playwright package but do not bundle Chromium browser binaries. Install Chromium only if you need browser rendering:

npx playwright install chromium

If optional dependencies were omitted, first run npm install playwright in the pi-scraper checkout/install directory.

Managed environments that install browsers separately can set:

PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1

mode: "fingerprint" is an optional static-fetch capability. The package exposes a safe backend boundary for a no-redirect TLS/HTTP fingerprint adapter, but does not bundle a fingerprint backend by default. Without one, fingerprint mode returns structured FINGERPRINT_BACKEND_MISSING metadata; other modes continue to work.

Pi manifest

The package declares its extension entrypoint and packaged skills in package.json:

{
  "pi": {
    "extensions": ["./src/index.ts"],
    "skills": ["./skills"]
  }
}

Public tools

Tool Capability Use it for
web_scrape Local; browser/fingerprint optional Fetch and extract one URL to markdown, text, LLM text, HTML, or JSON.
web_crawl Local; browser optional through scrape pipeline Breadth-first crawl with depth/page limits, robots, resume state, and compact stored results.
web_map Local Discovery-only URL inventory from robots, sitemaps, gzipped sitemaps, sitemap.xml, and llms.txt; no page-content extraction.
web_batch Local; browser optional through scrape pipeline Scrape many independent URLs with ordered per-URL success/failure results.
web_brand Local; browser optional via mode Extract colors, fonts, logos, favicons, manifests, JSON-LD, Open Graph, and Twitter assets.
web_diff Local Re-scrape, normalize, compare against unnamed or named snapshots, and store deterministic diff metadata.
web_list_extractors Local List deterministic vertical extractors and their browser/cloud/LLM capability declarations.
web_vertical_scrape Local/API depending on extractor Run known-site extractors that prefer public APIs/feeds over HTML scraping.
web_extract Model/LLM Ad hoc schema or prompt extraction from one page after scraping clean text.
web_summarize Model/LLM Page-scoped summary after scraping clean page text.
web_get_result Local storage Retrieve full stored output by responseId, crawl status by crawlId, or diff snapshot metadata by URL/name.
web_history Local storage List prior local scrapes/fetches for a URL so recent stored content can be reused deliberately.
web_crawls Local storage List prior crawls with staleness and recommended resume/reuse/recrawl guidance.
web_search_scrapes Local storage Full-text stored scrape recall when runtime SQLite has FTS5; otherwise returns a clean unsupported response.

Capability labels:

Label Meaning
Local Runs from local HTTP/parsing/storage code without search API keys.
Browser optional Uses lazy Playwright only when requested or auto-escalation justifies it.
Model/LLM Needs Pi's selected model or a configured model adapter after scraping clean page text.

Common parameters

Scrape-like tools

Used by web_scrape, web_batch, web_crawl, web_brand, web_diff, web_extract, and web_summarize.

Parameter Description
url / urls HTTP(S) URL or URLs. Private-network and unsupported schemes are blocked by default.
mode auto, fast, fingerprint, readable, or browser. Use auto unless the user requests a path.
format markdown, text, llm, html, or json.
include / exclude Optional CSS selectors for content inclusion/exclusion where supported.
onlyMainContent Prefer main/article-like content.
timeoutSeconds Per-request timeout.
maxBytes / maxChars Response/output bounds.
respectRobots Defaults to true; disabling must be explicit.
headers Optional HTTP headers.
proxy Optional proxy for supported modes/providers.
browserProfile / osProfile Optional browser/fingerprint profile hints.
cacheTtlSeconds Opt-in fetch cache TTL in seconds. Omit for always-fresh behavior.
maxAgeSeconds Hard maximum cache age before forcing a network fetch.
refresh Bypass cache lookup while still recording a fresh fetch when caching is enabled.

Crawl and map

Parameter Description
maxPages Maximum pages to crawl or discover.
maxDepth Maximum link depth from the seed URL.
sameOrigin Defaults to same-origin crawling.
include / exclude URL pattern filters.
concurrency / per-host options Bound crawl work while HTTP politeness also enforces host limits.
crawlId Resume/persist crawl state under the local SQLite crawl index.
resume For web_crawl, resume existing crawlId state; defaults to true when available.

Diff snapshots

web_diff compares the current normalized page content against a previous snapshot. Pass snapshotName to keep a repeatable baseline per URL:

{ "url": "https://example.com", "snapshotName": "homepage" }

Reusing the same snapshotName compares against and then replaces that named baseline. Use web_get_result({ "responseId": "..." }) to retrieve full diff details, or web_get_result({ "snapshotUrl": url, "snapshotName": "homepage" }) to inspect snapshot metadata.

Scrape modes

Mode JavaScript support Playwright required Typical latency Extraction quality Best use case
fast No No Lowest Good for static pages Static HTML, docs, product pages, quick link/text extraction.
fingerprint No No Low-medium Same parser as static path Sites that block plain HTTP clients but do not require JavaScript. Requires a configured optional no-redirect fingerprint backend; proxy is rejected until equivalent SSRF guarantees exist.
readable No No Medium Higher for articles/main content Articles, blogs, noisy pages where Readability improves main content.
browser Yes Yes, optional/lazy Highest Best for rendered DOM JavaScript-rendered pages when static/data-island recovery is insufficient.
auto Only if justified Only if escalated Adaptive Adaptive Default. Starts local/static, reuses fetched HTML, tries recovery/readable/fingerprint before browser only when block/rendering signals justify it.

Vertical extraction

Vertical extractors return typed JSON for known sites. They prefer public APIs and feeds over browser or LLM extraction.

Extractor Input patterns Primary strategy Browser/cloud/LLM requirement
github_repo GitHub repository URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
github_issue GitHub issue URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
github_pr GitHub pull request URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
github_release GitHub release tag URLs GitHub public REST API No browser; no LLM; no cloud provider beyond public GitHub access.
npm npm package URLs npm registry JSON No browser; no LLM.
pypi PyPI package URLs PyPI JSON API No browser; no LLM.
crates_io crates.io crate URLs crates.io API No browser; no LLM.
docker_hub Docker Hub repository URLs Docker Hub repository API No browser; no LLM.
huggingface_model Hugging Face model URLs Hugging Face public model API No browser; no LLM.
huggingface_dataset Hugging Face dataset URLs Hugging Face public dataset API No browser; no LLM.
hackernews Hacker News item URLs Hacker News Firebase item API No browser; no LLM.
arxiv arXiv abstract/PDF entry URLs arXiv Atom export feed No browser; no LLM.
deepwiki DeepWiki URLs Static HTML metadata parsing No browser; no LLM.

Use web_list_extractors to inspect exact runtime declarations. Use web_extract for arbitrary pages that need a custom schema or prompt and model-backed extraction.

Reddit, Substack, and Shopify candidates are intentionally not listed as built-ins yet because their reliable machine-readable surfaces vary by community, publication, or storefront.

Storage, cache, and history

Tool results use Pi's standard shell:

{
  content: [{ type: "text", text }],
  details: {
    url,
    finalUrl,
    status,
    mode,
    format,
    timing,
    truncated,
    fullOutputPath,
    responseId,
    data
  }
}

Large crawl, batch, diff, and scrape outputs are stored locally and returned with a compact summary plus responseId. Retrieve full content later with web_get_result.

Inline truncation follows Pi defaults:

  • 50KB
  • 2000 lines

The storage backend uses a local SQLite metadata index plus content-addressed blob files. Cache reuse is opt-in with cacheTtlSeconds; default behavior remains fresh network fetches. Cached results include cache.cached, fetchedAt, ageSeconds, ttlSeconds, and staleness metadata when returned from the fetch cache.

Use history tools deliberately:

  • web_history + web_get_result when existing content is recent enough.
  • web_crawls to find prior crawls and decide whether to resume, reuse, or recrawl.
  • web_search_scrapes to recall stored markdown/text when SQLite FTS5 is available.
  • refresh: true for time-sensitive questions such as prices, news, status pages, availability, or anything the user asks about “now”.

The fetch cache currently records in-memory text/buffer responses. Streamed binary downloads are saved as normal result blobs but are not reused as raw HTTP cache hits.

Persistent paths:

Data Path
Config ~/.pi/scraper/config/web.json
SQLite index ~/.pi/scraper/index.db
Payload blobs ~/.pi/scraper/blobs/<aa>/...
Legacy snapshots ~/.pi/scraper/snapshots/
Legacy backups ~/.pi/scraper/results.bak/, ~/.pi/scraper/crawl.bak/ after migration

Safety and anti-bot scope

  • SSRF/private-network protection is applied before fetches and at the HTTP connect/redirect layer.
  • respectRobots defaults to true.
  • Response body sizes are bounded before allocation and while streaming.
  • Browser rendering is optional and lazy-loaded.
  • The package may detect bot-block pages and return structured blocked/error results.
  • It does not promise CAPTCHA solving, residential proxy rotation, stealth guarantees, or guaranteed access to protected sites.

Packaged skill

This package includes a small Pi skill, web-scraping, with guidance for choosing between scrape, map, crawl, batch, brand, diff, vertical extraction, history, and page-scoped extraction tools.

Development and release checks

Install dependencies from a checkout:

nvm use 22.19.0
npm install

Run the core checks:

npm run typecheck
npm test
npm run test:tools
npm pack --dry-run

Optional checks before a release:

npm run smoke:install
npm run audit:strict
PI_SCRAPER_LIVE=1 npm run smoke:live

Benchmark suites live under bench/suites/; generated summaries and ignored JSON history live under bench/results/. See bench/README.md for the current layout and output paths.

Optional browser smoke:

export PLAYWRIGHT_BROWSERS_PATH="${TMPDIR:-/tmp}/pi-scraper-ms-playwright"
npx playwright install chromium
PI_SCRAPER_BROWSER=1 npm run smoke:browser

License

MIT