pi-scraper
Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.
Package details
Install pi-scraper from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-scraper- Package
pi-scraper- Version
0.2.1- Published
- May 4, 2026
- Downloads
- 317/mo · 317/wk
- Author
- brandonkramercc
- License
- MIT
- Types
- extension, skill
- Size
- 379.9 KB
- Dependencies
- 12 dependencies · 2 peers
Pi manifest JSON
{
"extensions": [
"./src/index.ts"
],
"skills": [
"./skills"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-scraper
Crawl, map, and structured extraction for Pi — scraper-first, Pi-native, and local-first.
pi-scraper is a Pi extension for reading web pages and small sites. It focuses on fast scraping, recursive crawling, URL/site mapping, brand extraction, content diffing, PDF text extraction, local result history, and deterministic vertical extraction.
Use it when you already have URLs and want to read, crawl, compare, or extract them. Use a companion search/research extension such as pi-gemini-acp when you need broad source discovery or multi-source synthesis first.
Install
From npm:
pi install npm:pi-scraper
Quick start
Ask naturally; Pi can choose the right web tool automatically:
Read https://example.com as markdown.
List the URLs available from https://example.com.
Crawl https://example.com, up to 25 pages.
Compare https://example.com against my homepage snapshot.
For repeated local work, Pi can opt into the fetch cache:
{ "url": "https://example.com", "cacheTtlSeconds": 3600 }
Omit cacheTtlSeconds for always-fresh behavior.
Requirements
- Node.js
>=22.19.0 - Pi
>=0.65.0 - Optional Chromium binaries for
mode: "browser"
Normal installs include the optional Playwright package but do not bundle Chromium browser binaries. Install Chromium only if you need browser rendering:
npx playwright install chromium
If optional dependencies were omitted, first run npm install playwright in the pi-scraper checkout/install directory.
Managed environments that install browsers separately can set:
PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1
mode: "fingerprint" is an optional static-fetch capability. The package exposes a safe backend boundary for a no-redirect TLS/HTTP fingerprint adapter, but does not bundle a fingerprint backend by default. Without one, fingerprint mode returns structured FINGERPRINT_BACKEND_MISSING metadata; other modes continue to work.
Pi manifest
The package declares its extension entrypoint and packaged skills in package.json:
{
"pi": {
"extensions": ["./src/index.ts"],
"skills": ["./skills"]
}
}
Public tools
| Tool | Capability | Use it for |
|---|---|---|
web_scrape |
Local; browser/fingerprint optional | Fetch and extract one URL to markdown, text, LLM text, HTML, or JSON. |
web_crawl |
Local; browser optional through scrape pipeline | Breadth-first crawl with depth/page limits, robots, resume state, and compact stored results. |
web_map |
Local | Discovery-only URL inventory from robots, sitemaps, gzipped sitemaps, sitemap.xml, and llms.txt; no page-content extraction. |
web_batch |
Local; browser optional through scrape pipeline | Scrape many independent URLs with ordered per-URL success/failure results. |
web_brand |
Local; browser optional via mode | Extract colors, fonts, logos, favicons, manifests, JSON-LD, Open Graph, and Twitter assets. |
web_diff |
Local | Re-scrape, normalize, compare against unnamed or named snapshots, and store deterministic diff metadata. |
web_list_extractors |
Local | List deterministic vertical extractors and their browser/cloud/LLM capability declarations. |
web_vertical_scrape |
Local/API depending on extractor | Run known-site extractors that prefer public APIs/feeds over HTML scraping. |
web_extract |
Model/LLM | Ad hoc schema or prompt extraction from one page after scraping clean text. |
web_summarize |
Model/LLM | Page-scoped summary after scraping clean page text. |
web_get_result |
Local storage | Retrieve full stored output by responseId, crawl status by crawlId, or diff snapshot metadata by URL/name. |
web_history |
Local storage | List prior local scrapes/fetches for a URL so recent stored content can be reused deliberately. |
web_crawls |
Local storage | List prior crawls with staleness and recommended resume/reuse/recrawl guidance. |
web_search_scrapes |
Local storage | Full-text stored scrape recall when runtime SQLite has FTS5; otherwise returns a clean unsupported response. |
Capability labels:
| Label | Meaning |
|---|---|
| Local | Runs from local HTTP/parsing/storage code without search API keys. |
| Browser optional | Uses lazy Playwright only when requested or auto-escalation justifies it. |
| Model/LLM | Needs Pi's selected model or a configured model adapter after scraping clean page text. |
Common parameters
Scrape-like tools
Used by web_scrape, web_batch, web_crawl, web_brand, web_diff, web_extract, and web_summarize.
| Parameter | Description |
|---|---|
url / urls |
HTTP(S) URL or URLs. Private-network and unsupported schemes are blocked by default. |
mode |
auto, fast, fingerprint, readable, or browser. Use auto unless the user requests a path. |
format |
markdown, text, llm, html, or json. |
include / exclude |
Optional CSS selectors for content inclusion/exclusion where supported. |
onlyMainContent |
Prefer main/article-like content. |
timeoutSeconds |
Per-request timeout. |
maxBytes / maxChars |
Response/output bounds. |
respectRobots |
Defaults to true; disabling must be explicit. |
headers |
Optional HTTP headers. |
proxy |
Optional proxy for supported modes/providers. |
browserProfile / osProfile |
Optional browser/fingerprint profile hints. |
cacheTtlSeconds |
Opt-in fetch cache TTL in seconds. Omit for always-fresh behavior. |
maxAgeSeconds |
Hard maximum cache age before forcing a network fetch. |
refresh |
Bypass cache lookup while still recording a fresh fetch when caching is enabled. |
Crawl and map
| Parameter | Description |
|---|---|
maxPages |
Maximum pages to crawl or discover. |
maxDepth |
Maximum link depth from the seed URL. |
sameOrigin |
Defaults to same-origin crawling. |
include / exclude |
URL pattern filters. |
concurrency / per-host options |
Bound crawl work while HTTP politeness also enforces host limits. |
crawlId |
Resume/persist crawl state under the local SQLite crawl index. |
resume |
For web_crawl, resume existing crawlId state; defaults to true when available. |
Diff snapshots
web_diff compares the current normalized page content against a previous snapshot. Pass snapshotName to keep a repeatable baseline per URL:
{ "url": "https://example.com", "snapshotName": "homepage" }
Reusing the same snapshotName compares against and then replaces that named baseline. Use web_get_result({ "responseId": "..." }) to retrieve full diff details, or web_get_result({ "snapshotUrl": url, "snapshotName": "homepage" }) to inspect snapshot metadata.
Scrape modes
| Mode | JavaScript support | Playwright required | Typical latency | Extraction quality | Best use case |
|---|---|---|---|---|---|
fast |
No | No | Lowest | Good for static pages | Static HTML, docs, product pages, quick link/text extraction. |
fingerprint |
No | No | Low-medium | Same parser as static path | Sites that block plain HTTP clients but do not require JavaScript. Requires a configured optional no-redirect fingerprint backend; proxy is rejected until equivalent SSRF guarantees exist. |
readable |
No | No | Medium | Higher for articles/main content | Articles, blogs, noisy pages where Readability improves main content. |
browser |
Yes | Yes, optional/lazy | Highest | Best for rendered DOM | JavaScript-rendered pages when static/data-island recovery is insufficient. |
auto |
Only if justified | Only if escalated | Adaptive | Adaptive | Default. Starts local/static, reuses fetched HTML, tries recovery/readable/fingerprint before browser only when block/rendering signals justify it. |
Vertical extraction
Vertical extractors return typed JSON for known sites. They prefer public APIs and feeds over browser or LLM extraction.
| Extractor | Input patterns | Primary strategy | Browser/cloud/LLM requirement |
|---|---|---|---|
github_repo |
GitHub repository URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
github_issue |
GitHub issue URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
github_pr |
GitHub pull request URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
github_release |
GitHub release tag URLs | GitHub public REST API | No browser; no LLM; no cloud provider beyond public GitHub access. |
npm |
npm package URLs | npm registry JSON | No browser; no LLM. |
pypi |
PyPI package URLs | PyPI JSON API | No browser; no LLM. |
crates_io |
crates.io crate URLs | crates.io API | No browser; no LLM. |
docker_hub |
Docker Hub repository URLs | Docker Hub repository API | No browser; no LLM. |
huggingface_model |
Hugging Face model URLs | Hugging Face public model API | No browser; no LLM. |
huggingface_dataset |
Hugging Face dataset URLs | Hugging Face public dataset API | No browser; no LLM. |
hackernews |
Hacker News item URLs | Hacker News Firebase item API | No browser; no LLM. |
arxiv |
arXiv abstract/PDF entry URLs | arXiv Atom export feed | No browser; no LLM. |
deepwiki |
DeepWiki URLs | Static HTML metadata parsing | No browser; no LLM. |
Use web_list_extractors to inspect exact runtime declarations. Use web_extract for arbitrary pages that need a custom schema or prompt and model-backed extraction.
Reddit, Substack, and Shopify candidates are intentionally not listed as built-ins yet because their reliable machine-readable surfaces vary by community, publication, or storefront.
Storage, cache, and history
Tool results use Pi's standard shell:
{
content: [{ type: "text", text }],
details: {
url,
finalUrl,
status,
mode,
format,
timing,
truncated,
fullOutputPath,
responseId,
data
}
}
Large crawl, batch, diff, and scrape outputs are stored locally and returned with a compact summary plus responseId. Retrieve full content later with web_get_result.
Inline truncation follows Pi defaults:
- 50KB
- 2000 lines
The storage backend uses a local SQLite metadata index plus content-addressed blob files. Cache reuse is opt-in with cacheTtlSeconds; default behavior remains fresh network fetches. Cached results include cache.cached, fetchedAt, ageSeconds, ttlSeconds, and staleness metadata when returned from the fetch cache.
Use history tools deliberately:
web_history+web_get_resultwhen existing content is recent enough.web_crawlsto find prior crawls and decide whether to resume, reuse, or recrawl.web_search_scrapesto recall stored markdown/text when SQLite FTS5 is available.refresh: truefor time-sensitive questions such as prices, news, status pages, availability, or anything the user asks about “now”.
The fetch cache currently records in-memory text/buffer responses. Streamed binary downloads are saved as normal result blobs but are not reused as raw HTTP cache hits.
Persistent paths:
| Data | Path |
|---|---|
| Config | ~/.pi/scraper/config/web.json |
| SQLite index | ~/.pi/scraper/index.db |
| Payload blobs | ~/.pi/scraper/blobs/<aa>/... |
| Legacy snapshots | ~/.pi/scraper/snapshots/ |
| Legacy backups | ~/.pi/scraper/results.bak/, ~/.pi/scraper/crawl.bak/ after migration |
Safety and anti-bot scope
- SSRF/private-network protection is applied before fetches and at the HTTP connect/redirect layer.
respectRobotsdefaults totrue.- Response body sizes are bounded before allocation and while streaming.
- Browser rendering is optional and lazy-loaded.
- The package may detect bot-block pages and return structured blocked/error results.
- It does not promise CAPTCHA solving, residential proxy rotation, stealth guarantees, or guaranteed access to protected sites.
Packaged skill
This package includes a small Pi skill, web-scraping, with guidance for choosing between scrape, map, crawl, batch, brand, diff, vertical extraction, history, and page-scoped extraction tools.
Development and release checks
Install dependencies from a checkout:
nvm use 22.19.0
npm install
Run the core checks:
npm run typecheck
npm test
npm run test:tools
npm pack --dry-run
Optional checks before a release:
npm run smoke:install
npm run audit:strict
PI_SCRAPER_LIVE=1 npm run smoke:live
Benchmark suites live under bench/suites/; generated summaries and ignored JSON history live under bench/results/. See bench/README.md for the current layout and output paths.
Optional browser smoke:
export PLAYWRIGHT_BROWSERS_PATH="${TMPDIR:-/tmp}/pi-scraper-ms-playwright"
npx playwright install chromium
PI_SCRAPER_BROWSER=1 npm run smoke:browser