pi-webaio

All-in-one web tools for pi with search (Google, Brave, DDG) and fetch with headless browser AI summarization

Packages

Package details

extension

Install pi-webaio from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:pi-webaio

Package: pi-webaio
Version: 0.6.0
Published: Jun 19, 2026
Downloads: 1,426/mo · 395/wk
Author: apmantza
License: MIT
Types: extension
Size: 728.6 KB
Dependencies: 12 dependencies · 1 peer

Pi manifest JSON

{
  "extensions": [
    "./dist/index.js"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-webaio

All-in-one web access tools for pi with search, fetch, crawl, extraction, anti-bot TLS fingerprinting, intelligent resilience, RAG-ready output, and TUI rendering.

What does pi-webaio do?

pi-webaio is a pi extension that gives your agent eyes on the web. It registers six tools that let pi search, fetch, discover, and archive web content — all without API keys or paid services.

When you search, pi-webaio queries 5 engines in parallel (DuckDuckGo, Brave, Yahoo, Bing, and Google via headless Chrome). Results that show up across multiple engines rank higher — consensus is a signal of quality. When you fetch a page, it tries 14 different extraction backends in order, stripping cookie banners and anti-bot noise along the way, so you get clean markdown instead of raw HTML soup. Paywalled news sites (NYT, WaPo, FT, WSJ, etc.) can be bypassed on opt-in with a strategy chain that tries archive.org, bot-UA impersonation, and Playwright with paywall-script blocking.

Long pages are automatically AI-summarized via Google AI Mode (headless Chrome) — you get a concise overview instantly, while the full content is always saved to disk for later inspection. For sites with API-first extractors (GitHub, YouTube, npm, PyPI, crates.io, RubyGems, Packagist, pub.dev, Go, NuGet, Reddit, Hacker News, arXiv, Stack Exchange, Wikipedia, Open Library, DEV.to, SonarCloud, docs sites), pi-webaio bypasses HTML scraping entirely and pulls structured data directly.

For RAG pipelines, fetches can be returned as paragraph-bounded chunks with overlap (CJK-aware token estimation). All 6 tools ship with polished TUI rendering (real-time progress, elapsed time, phase/category badges, retry hints) and a phase-aware error system (25 failure codes × 10 fetch phases × 7 categories) that includes smart retry-timeout suggestions based on partial download progress.

It's built for agents that need to:

Research — find current information, documentation, or references
Read — pull articles, docs, GitHub repos, PDFs, or YouTube transcripts into markdown
Explore — map out a website's pages before pulling them all
Remember — cached results survive restarts and can be retrieved by URL or ID
Bypass — opt-in paywall bypass for news sites that block non-subscribers
Chunk for RAG — split fetched markdown into pre-sized chunks with optional overlap

No API keys. No subscriptions. No brittle scraping scripts. Just pi install npm:pi-webaio and go.

Installation

pi install npm:pi-webaio

Or from git:

pi install git:github.com/apmantza/pi-webaio

How AI summarization works

When you fetch a single URL with aio-webfetch, long pages are automatically summarized using Google AI Mode (via headless Chrome CDP). Here's the logic:

Short pages (under ~1800 chars) — displayed in full, no summarization needed.
Long pages — if Google Chrome is available, pi-webaio launches a headless Chrome instance, navigates to the URL, and captures the AI Mode summary.
Fallback — if Chrome is unavailable or AI Mode fails, the first ~1800 chars are shown with a note that the full file was saved to disk.

Summarization is automatically skipped for content that already comes from a structured pipeline:

Source	Why skipped
GitHub (repos, blobs, issues, PRs, raw files)	Clean structured data from git clone / REST API — no HTML noise to summarize
YouTube	Transcript + metadata via Innertube API — the transcript IS the content
SonarCloud	Quality metrics fetched via API — structured data in table form
npm / PyPI / crates.io / RubyGems / Packagist / pub.dev / Go / NuGet / Reddit / Hacker News / arXiv / Wikipedia / Stack Exchange / Open Library / DEV.to / SonarCloud / docs sites	API-first extractors return clean markdown directly

Skipping is enforced by both a content marker (> via <source>) and a URL hostname check (covers github.com, raw.githubusercontent.com, gist.github.com), so even if an extractor fails and falls through to the HTML pipeline, GitHub URLs never get AI-summarized.

Special pipelines: GitHub, YouTube, and more

Instead of scraping HTML, pi-webaio routes known sites through purpose-built extractors that fetch structured data directly. This means faster results, cleaner markdown, and no AI summarization of already-clean content.

GitHub

GitHub URLs are intercepted before any HTTP request and handled by a dedicated pipeline:

URL pattern	Method	What you get
`github.com/owner/repo`	`git clone` (or `gh repo clone`) + README extraction	File tree, architecture hints, full README
`github.com/owner/repo/blob/branch/path`	`raw.githubusercontent.com` fetch	Raw file content
`github.com/owner/repo/tree/branch/path`	GitHub REST API	Directory listing
`github.com/owner/repo/issues`	GitHub REST API	Issue list with states
`github.com/owner/repo/pull/123`	GitHub REST API	Single PR details
`github.com/owner/repo/commit/sha`	GitHub REST API	Commit details
`github.com/owner/repo/actions/runs/123`	GitHub REST API + logs	Run status, jobs, step results, error log excerpts
`github.com/owner/repo/commit/sha/checks/{id}/logs/{step?}`	GitHub REST API + `gh run view --log`	Check status, conclusion, annotations, log excerpt (Actions jobs); the `{step}` index resolves to a step name via the job's `steps[]` API field, then the tab-separated log is sliced to that step's section; metadata-only for external CI
`api.github.com/repos/{owner}/{repo}/actions/runs/{runId}/logs`	`gh run view --log` (via gh CLI auth)	Plain-text workflow logs. Previously returned HTTP 403 because the endpoint requires auth + redirects to a zip archive; now uses your existing `gh auth login` session for plain-text output with 302-redirect handling.
`github.com/owner/repo/security/*`	GitHub REST API	Security advisories, code scanning, Dependabot alerts
`raw.githubusercontent.com/owner/repo/branch/path`	Direct fetch + fallback to pipeline	Raw file content with source marker

All GitHub results are tagged with > via GitHub and AI summarization is skipped. Non-existent repos now return a clear "Repository not found or inaccessible" error instead of an empty directory listing.

YouTube

YouTube video URLs are matched by a vertical extractor that uses youtube-transcript-plus (Innertube API — no API key required):

Extracts metadata: title, channel, duration, views, language, tags (first 10), description
Extracts the full transcript (up to ~40K chars, truncated beyond that)
Supports youtube.com/watch?v=, youtu.be/, /shorts/, /embed/ formats
Playlist URLs are detected but not yet supported (fetch individual videos instead)

SonarCloud

SonarCloud URLs (sonarcloud.io/project/...) are fetched via the SonarCloud REST API:

Security hotspots — grouped by category with severity breakdown
Issues — severity, type, file, line number, message
Overview — quality gate metrics in table form
Activity — chronological analysis events

API-first extractors (vertical registry)

These 19 sites are handled by dedicated extractors that use their public APIs:

Site	Extractor	API
npm	`src/verticals/npm.ts`	npm registry JSON API
PyPI	`src/verticals/pypi.ts`	PyPI JSON API
Hacker News	`src/verticals/hackernews.ts`	Firebase API
Reddit	`src/verticals/reddit.ts`	`.json` endpoint
arXiv	`src/verticals/arxiv.ts`	Atom export API
Wikipedia	`src/verticals/wikipedia.ts`	MediaWiki REST API (all editions)
Stack Exchange	`src/verticals/stackexchange.ts`	Stack Exchange API v2.3
Open Library	`src/verticals/openlibrary.ts`	Open Library REST API
DEV.to	`src/verticals/devto.ts`	DEV.to public REST API
crates.io	`src/verticals/cratesio.ts`	crates.io registry JSON API
RubyGems	`src/verticals/rubygems.ts`	RubyGems.org JSON API
Packagist	`src/verticals/packagist.ts`	Packagist JSON API (PHP)
pub.dev	`src/verticals/pubdev.ts`	pub.dev API (Dart/Flutter)
Go packages	`src/verticals/gopackages.ts`	Go module proxy (proxy.golang.org)
NuGet	`src/verticals/nuget.ts`	NuGet Search API v3
GitLab	`src/verticals/gitlab.ts`	GitLab REST API v4 (gitlab.com + self-hosted)
Docs sites	`src/verticals/docs-site.ts`	Docusaurus, GitBook, MDN, VitePress extraction

All vertical extractors tag their output with > via <name>, which automatically skips AI summarization.

Auto-escalation: when scraping gets blocked

When the normal fetch pipeline hits a bot wall (Cloudflare, Anubis, DataDome, PerimeterX, etc.), pi-webaio escalates automatically:

Fingerprint rotation — retries with alternate browser profiles (firefox_147, safari_26, edge_145)
Browser mode — last resort: renders the page with Playwright (headless Chromium)

This is all transparent — the mode parameter controls escalation: auto (default, escalates on detection), fast (no escalation), fingerprint (alternate browsers only), or browser (Playwright always).

Paywall bypass: when the content itself is gated

For paywalled news sites (NYT, WSJ, FT, WaPo, The Economist, Le Monde, etc.), the bypass: true flag runs a strategy chain after the normal fetch detects a paywall. The chain tries each step in order and returns the first response that no longer contains paywall markers:

Step	Mechanism	Cost	Effectiveness
`archive`	Wayback Machine (`web.archive.org/web/2/{url}`) then `archive.ph`	~1-2s, free	~80% of articles have at least one snapshot
`ua:googlebot`	Fetch with `Googlebot/2.1` UA + no `Sec-Ch-Ua`	~500ms, free	~40% (Google News partners + soft paywalls)
`ua:bingbot`	Fetch with `Bingbot/2.0` UA	~500ms, free	Similar to Googlebot, useful for sites that whitelist both
`ua:facebookbot`	Fetch with `facebookexternalhit/1.1` UA	~500ms, free	Small share, useful for sites that whitelist FB
`referer:google`	Fetch with `Referer: https://www.google.com/`	~500ms, free	~5% of sites that check referer only
`block_js`	Playwright with `route.abort()` for 21 known paywall vendors (Piano, Tinypass, Poool, Zephr, Sophi) + DOM override	~3-5s, needs Playwright	~60% of vendor-paywalled sites
`cookies`	Fetch with cookies dropped (rejects `cf_clearance` etc.)	~500ms, free	~10% of sites that track returning readers

Top-50 news sites have curated strategies in src/paywall-sites.ts (NYT = block_js → archive; WSJ = block_js → archive; FT = block_js → archive; unknown domain = archive → ua:googlebot → block_js). The same flags work on aio-webpull to apply bypass to every page in a pull.

The bypass engine also triggers on HTTP 403/401 from known paywall sites (NYT, WSJ, FT, etc. that block before any content is served) — not just on content-marker detection. So even when the server returns a bare 403 with no body for detectPaywall to analyze, the strategy chain still runs and the archive.org snapshot is returned.

Set PI_WEBAIO_DEBUG=1 to log every bypass attempt and confidence score — useful when triaging sites that still block.

Note: The bypass flag is opt-in. A normal aio-webfetch(url) gets the regular auto-escalation pipeline. You must explicitly pass bypass: true to trigger the strategy chain — this is intentional, since paywall circumvention is a deliberate user action.

Output formats

aio-webfetch accepts a format parameter that controls what the tool returns:

Format	Behavior
`markdown`	(default) Save to disk under `os.tmpdir()/pi-webaio/<host>/<path>.md`. Return body inline.
`html`	Return raw HTML body inline. No disk write.
`text`	Return plain-text rendering of the markdown (strips headers, bold, links, code fences, HTML tags).
`json`	Return a structured JSON object with `url`, `title`, `author`, `published`, `site`, `language`, `wordCount`, `mimeType`, `content`, `rawHtml`.
`raw`	Return the original raw HTML body. No markdown conversion.

Non-markdown formats stay in-memory and never touch disk — useful for piping into other tools or for JSON consumers that need structured data. The compile parameter auto-skips when any result is non-markdown.

RAG chunking

aio-webfetch accepts a chunks parameter that splits the fetched markdown into paragraph-bounded chunks for RAG pipelines:

aio-webfetch url: https://en.wikipedia.org/wiki/Node.js
            chunks: true
            maxTokens: 512
            overlapTokens: 50

chunks: true enables chunking (default: false)
maxTokens is the soft target size per chunk in tokens (default: 512)
overlapTokens is the tail-overlap from the previous chunk (default: 50)
Only applies to format: "markdown" (other formats stay in-memory; the caller can chunk them)
Token estimation is CJK-aware (counts CJK chars at 1.5x Latin weight)
The tool result includes both the original markdown and a chunks array with metadata (index, total, token count, content)

The chunks are also formatted as a readable numbered text section in the tool output for direct inspection.

Error handling

aio-webfetch uses a phase-aware FetchError system with 25 failure codes × 10 fetch phases × 7 categories. Each error carries:

code (e.g. http_error, tls_error, timeout, blocked, paywall, security_blocked)
phase (e.g. connecting, loading, headers, downloading, processing)
category (e.g. network, server, security, client)
retryable (boolean — whether the agent should try again)
statusCode, downloadedBytes, contentLength, elapsedMs (for smart retry-timeout suggestions)

When a partial download is interrupted by a timeout, the suggested retry timeout is extrapolated from the download rate. Security blocks (secrets in URL, private IPs) are flagged with a clear security_blocked code instead of a generic "Could not reach server" error.

The TUI error view shows the phase + category badge and the suggested retry hint when the error is retryable.

How search ranking works

When you search, pi-webaio queries 5 engines in parallel: DuckDuckGo, Brave, Yahoo, Bing, and Google (via headless Chrome). Results are scored by two signals:

Engine authority — Google (5), Bing (3), DDG (2), Brave (2), Yahoo (1)
Cross-engine consensus — +2 for each additional engine that agrees on the same URL

A result returned by all 5 engines outranks a Google-only result. Metadata (title/snippet) comes from the highest-weight engine for each URL.

Usage Examples

Search the web

Use aio-websearch to find the latest React documentation

Google search is on by default (via headless Chrome CDP). To skip it:

Use aio-websearch to search for "Rust serde" (google: false)

Fetch a single URL

Use aio-webfetch to download https://example.com/article

After fetching, use the built-in read tool to inspect the full saved file.

Fetch multiple URLs in batch

Use aio-webfetch to download these URLs:
  - https://example.com/page1
  - https://example.com/page2
  - https://example.com/page3

Fetch as JSON for structured downstream processing

Use aio-webfetch to download https://api.github.com/repos/apmantza/pi-webaio (format: "json")

Returns a structured JSON object with url, title, author, published, site, language, wordCount, content, rawHtml. Useful for piping into other tools.

Fetch with RAG chunking

Use aio-webfetch to download https://en.wikipedia.org/wiki/Node.js (chunks: true, maxTokens: 512)

Splits the markdown into paragraph-bounded chunks with 50-token overlap. Result includes both the markdown and a chunks array.

Fetch a GitHub Actions run log

Use aio-webfetch to download https://api.github.com/repos/apmantza/pi-drykiss/actions/runs/27479618304/logs

Routes through gh run view --log (uses your existing gh auth login session) to get plain-text logs with auth + 302-redirect handling. No more HTTP 403.

Fetch with a specific browser fingerprint

Use aio-webfetch to download https://example.com (browser: "firefox_147", os: "linux")

Retrieve stored content (no re-download)

Use aio-webcontent to get the full content from https://example.com/article

Pull an entire site

Use aio-webpull to download https://docs.example.com (max: 50 pages)

Pull with URL pattern routing

Use aio-webpull to download https://example.com with routes:
  - { pattern: "*/api/*", mode: "fast" }
  - { pattern: "*/docs/*", mode: "browser" }

Routes different URL patterns to different fetcher modes. First match wins.

Pull with resume from checkpoint

Use aio-webpull to download https://docs.example.com (resume: true)

Skips pages that were already pulled (checks for existing .md files in the output directory).

Bypass a paywall (single URL)

Use aio-webfetch to download https://www.nytimes.com/2024/01/01/some-article (bypass: true)

If the normal fetch hits a paywall, pi-webaio tries archive → ua:googlebot → ua:bingbot → ua:facebookbot → referer:google → block_js → cookies in order, returning the first response that doesn't contain paywall markers.

Bypass with a custom strategy chain

Use aio-webfetch to download https://example.com/paywalled (bypass: true, bypassStrategies: ["archive", "ua:googlebot"])

Only tries Wayback Machine and Googlebot impersonation. Useful when you know a site only responds to specific strategies.

Bypass on a whole pull (every page)

Use aio-webpull to download https://www.ft.com (max: 50, bypass: true)

Applies the per-domain strategy chain to every page in the pull. NYT pages use block_js → archive; FT pages use block_js → archive; unknown sites fall through to the generic chain.

Tools

Tool	Description
`aio-websearch`	Search the web using DuckDuckGo, Brave, Yahoo, Bing, and Google in parallel (no API keys required). Returns compact results with title, URL, and snippet. Results are ranked by cross-engine consensus — URLs returned by multiple engines rank higher. 7s cap. Google runs via headless Chrome CDP (auto-launched). 10-minute cache.
`aio-webfetch`	Fetch a single URL (or batch of URLs) and convert to markdown with anti-bot TLS fingerprinting. Long content is AI-summarized via Google AI Mode; full file always saved. Detects PDFs, GitHub repos, and Next.js RSC. Supports `format: "markdown\|html\|text\|json\|raw"`, `chunks` for RAG, auto escalation, and opt-in paywall bypass.
`aio-webcontent`	Retrieve previously fetched content from session storage by URL. Returns full untruncated content — no data loss.
`aio-webmap`	Discovery-only tool — finds pages via robots.txt, sitemaps, navigation links, and llms.txt without fetching content. Returns structured URL list.
`aio-webresult`	Retrieve a previously fetched result by persistent response ID. Survives restarts. Shows recent results if ID not found.
`aio-webpull`	Pull any public website or docs site into local markdown files. Discovers pages via sitemap, navigation links, or crawling. Rewrites internal links to relative `.md` paths. Supports `routes` for per-pattern routing, `resume` for checkpoint resume, `adaptive` selectors, and `bypass` for opt-in paywall bypass on every page.

Tool Parameters

`aio-websearch`

Parameter	Type	Default	Description
`query`	`string`	—	Search query (e.g. 'React Server Components RFC')
`max`	`number`	`15`	Max results per engine. Up to 25 total after dedup across all engines.
`google`	`boolean`	`true`	Also search Google via headless Chrome CDP. Set to `false` to use only DDG/Brave.

`aio-webfetch`

Parameter	Type	Default	Description
`url`	`string`	—	Single URL to fetch. Use either `url` or `urls`, not both.
`urls`	`string[]`	—	Multiple URLs to fetch in parallel. Use either `url` or `urls`, not both.
`out`	`string`	auto-derived	Output file path under temp (for single url only)
`format`	`string`	`markdown`	Output format: `markdown` (saves to disk) \| `html` \| `text` \| `json` \| `raw` (all in-memory)
`chunks`	`boolean`	`false`	Split markdown into paragraph-bounded chunks for RAG. Only applies to `format: "markdown"`.
`maxTokens`	`number`	`512`	Soft target size per chunk in tokens.
`overlapTokens`	`number`	`50`	Tail-overlap from the previous chunk prepended to each chunk after the first.
`mode`	`string`	`auto`	Scrape mode: `auto` (escalates), `fast`, `fingerprint`, or `browser`
`browser`	`string`	latest	Browser profile for TLS fingerprinting. Options: `chrome_145`, `firefox_147`, `safari_26`, `edge_145`
`os`	`string`	`windows`	OS profile for fingerprinting. Options: `windows`, `macos`, `linux`, `android`, `ios`
`proxy`	`string`	—	Proxy URL (`http://user:pass@host:port` or `socks5://host:port`). Supports HTTP, HTTPS, SOCKS5.
`cacheTtlSeconds`	`number`	—	Opt-in cache TTL in seconds. Omit for fresh fetches.
`compile`	`boolean`	`false`	Compile batch results into a single context package
`prune`	`number`	—	Prune markdown to token budget (e.g. 3000)
`interactive`	`boolean`	`false`	Extract interactive elements as numbered refs
`start_index`	`number`	`0`	Return content starting from this character index (0-based). Use with `max_length` for pagination.
`max_length`	`number`	unlimited	Maximum characters to return. Use with `start_index` for pagination.
`bypass`	`boolean`	`false`	If a paywall is detected, run a strategy chain (`archive` → bot UAs → `block_js` → `cookies`) to bypass. Opt-in.
`bypassStrategies`	`string[]`	—	Custom strategy chain order. Options: `archive`, `ua:googlebot`, `ua:bingbot`, `ua:facebookbot`, `referer:google`, `block_js`, `cookies`.

`aio-webcontent`

Parameter	Type	Default	Description
`url`	`string`	—	URL of previously fetched content

`aio-webmap`

Parameter	Type	Default	Description
`url`	`string`	—	URL to discover pages for
`max`	`number`	`100`	Max URLs to discover
`browser`	`string`	latest	Browser profile for TLS fingerprinting
`os`	`string`	`windows`	OS profile for fingerprinting

`aio-webresult`

Parameter	Type	Default	Description
`id`	`string`	—	Response ID from a previous webfetch call

`aio-webpull`

Parameter	Type	Default	Description
`url`	`string`	—	URL to pull (e.g. https://docs.example.com)
`out`	`string`	`<hostname>`	Output directory under temp
`max`	`number`	`100`	Max pages to pull
`mode`	`string`	`auto`	Scrape mode: `auto` (escalates), `fast`, `fingerprint`, or `browser`
`browser`	`string`	latest	Browser profile for TLS fingerprinting. Options: `chrome_145`, `firefox_147`, `safari_26`, `edge_145`
`os`	`string`	`windows`	OS profile for fingerprinting. Options: `windows`, `macos`, `linux`, `android`, `ios`
`proxy`	`string`	—	Proxy URL (`http://user:pass@host:port` or `socks5://host:port`). Supports HTTP, HTTPS, SOCKS5.
`compile`	`boolean`	`false`	Compile pulled pages into a single context package
`bypass`	`boolean`	`false`	If a paywall is detected on any page, run the per-domain strategy chain to bypass. Opt-in.
`resume`	`boolean`	`true`	Resume from previous pull (auto-detected from output directory). Set `false` to force a fresh pull.
`routes`	`object[]`	—	URL pattern → fetcher mode routing. Each: `{ pattern: string, mode?: string, browser?: string, os?: string, extractor?: string }`. Pattern supports substring, glob (`/docs/`), or regex (`/^\/api\//`). First match wins.
`adaptive`	`boolean`	`false`	Enable adaptive content selectors that survive site redesigns via structural DOM fingerprinting.

How it's built

Precompiled dist/ — TypeScript source compiles to dist/index.js via tsc on npm install (the prepare hook). pi loads the compiled JS directly, no jiti transpile on every startup.
TUI rendering — All 6 tools ship custom renderCall / renderResult components. Progress view shows per-item status, spinner, elapsed time, and download progress in real time. Result view shows expanded preview with responseId, format, browser/os profile, package path, chunk count, and error details. Phase + category badge for errors.
Phase-aware FetchError — 25 failure codes × 10 fetch phases × 7 categories. createFetchError() produces frozen rich error objects. classifyError() maps Node errors. buildUserFacingFetchErrorSummary() produces agent-friendly messages. suggestRetryTimeoutMs() extrapolates from partial download progress.
CI — 4 GitHub Actions jobs: lint+typecheck (with npm audit and lockfile check), test (all 11 suites), prod-install-build (simulates the real npm install --omit=dev path), install-test (ubuntu/windows/macos — tarball verification + entry-point loading). Auto-release on version tag with GitHub Release notes.
Security — 19 secret patterns (AWS, GitHub, GitLab, npm, PyPI, Slack, Stripe, Google, SendGrid, DigitalOcean, OpenAI including sk-proj-/sk-svcacct-, Anthropic, Supabase, Vercel, Cloudflare, private keys, passwords in URLs). SSRF protection via DNS resolution + RFC 1918/RFC 6598/RFC 3927 range validation. Path-traversal guard in utils.ts. Prompt injection detection. Default GitHub CodeQL scanning.

License

MIT