pi-crawl4ai

Crawl4AI extension for pi — web crawling and structured extraction

Package details

← Back

extensionprompt

Install pi-crawl4ai from npm and Pi will load the resources declared by the package manifest.

npm repo home report

$ pi install npm:pi-crawl4ai

Package: pi-crawl4ai
Version: 0.1.2
Published: Apr 24, 2026
Downloads: 377/mo · 377/wk
Author: romek_rozen
License: MIT
Types: extension, prompt
Size: 63.7 KB
Dependencies: 0 dependencies · 4 peers

Pi manifest JSON

{
  "extensions": [
    "./extensions/crawl4ai"
  ],
  "agents": [
    "./agents"
  ],
  "prompts": [
    "./prompts"
  ]
}

Security note

Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.

README

pi-crawl4ai

A production-ready pi package that integrates Crawl4AI — an open-source LLM-friendly web crawler — as a custom tool and agent prompt.

Install

pi install npm:pi-crawl4ai

Or from git:

pi install git:github.com/romek-rozen/pi-crawl4ai

What's included

extensions/crawl4ai/ — the full extension source (tool, commands, renderers)
agents/ — specialized agent definitions for use with subagent/run
prompts/crawler.md — a dedicated Crawler prompt template
MIT license

Quick start

Install Crawl4AI (inside pi):
```
/crawl4ai-install
```
This creates a Python venv and installs crawl4ai locally.
Verify:
```
/crawl4ai-status
/crawl4ai-test
```
Set up agents (optional, for subagent workflows):
```
/crawl4ai-setup-agents
```

Crawl:

Crawl https://example.com and give me the markdown.

Usage

Direct tool usage

The crawl4ai tool is available to the LLM in any pi session. Just ask:

Crawl https://example.com and give me the markdown.
Crawl https://docs.example.com deeply using BFS, max 10 pages.
Extract all product names and prices from https://shop.example.com as JSON.

Prompt templates

Three prompt templates are available:

/crawl4ai https://example.com                       # general — scrape, crawl, or extract
/crawl4ai-scrape https://example.com                 # single page → clean markdown
/crawl4ai-extract https://example.com product prices # single page → structured JSON

Subagent workflows

After running /crawl4ai-setup-agents, three agents become available for the subagent tool:

Single agent:

Use crawl4ai-scrape to get the content of https://example.com
Use crawl4ai-crawl to explore https://docs.example.com with max 10 pages
Use crawl4ai-extract to get all product prices from https://shop.example.com

Parallel execution:

Run 3 crawl4ai-scrape agents in parallel: one for https://example.com, one for https://docs.example.com, one for https://blog.example.com

Chained workflow:

Use a chain: first have crawl4ai-scrape get the page at https://shop.example.com, then have crawl4ai-extract pull structured product data from {previous}

Tool parameters

Parameter	Description
`url` (required)	Target URL
`output_format`	`markdown` (default), `markdown-fit`, `md`, `md-fit`, `json`, `all`
`deep_crawl`	`bfs`, `dfs`, or `best-first`
`max_pages`	Limit for deep crawl
`question`	Natural-language question about the page
`json_extract`	LLM extraction prompt (requires LLM provider configured in crawl4ai)
`schema_path`	JSON schema file for structured extraction (requires extraction_config)
`extraction_config`	Extraction strategy config file (YAML/JSON) (required with schema_path)
`browser_config` / `crawler_config`	Key=value pairs
`bypass_cache`	Force fresh crawl
`output_file`	Save directly to a custom path
`timeout`	Seconds, default 60

Commands

Command	Purpose
`/crawl4ai-install`	Create local venv and install `crawl4ai`
`/crawl4ai-test`	Run a smoke test crawl on example.com
`/crawl4ai-status`	Show resolved binary path and version
`/crawl4ai-clear-cache`	Remove local `.crawl4ai/cache` and `.crawl4ai/robots`
`/crawl4ai-setup-agents`	Symlink agents to `~/.pi/agent/agents/` for use with subagent/run

Agents

Three specialized agents are included for use with the subagent tool:

Agent	Purpose	Model
`crawl4ai-scrape`	Scrape a single page, extract clean markdown	Sonnet
`crawl4ai-crawl`	Crawl multiple linked pages (BFS/DFS/best-first)	Sonnet
`crawl4ai-extract`	Extract structured data as JSON (LLM or CSS/XPath)	Sonnet

To set up agents, run inside pi:

/crawl4ai-setup-agents

This symlinks the agent definitions to ~/.pi/agent/agents/. After setup, you can use them with the subagent tool:

Use crawl4ai-scrape to get the content of https://example.com
Use crawl4ai-crawl to explore https://docs.example.com with max 10 pages
Use crawl4ai-extract to get all product prices from https://shop.example.com

JSON extraction requirements

output_format=json requires an extraction strategy:

LLM extraction: pass json_extract (e.g. "Extract all product prices"). Requires a configured LLM provider in Crawl4AI (run crwl once interactively or set up ~/.crawl4ai/global.yml).
CSS/XPath extraction: pass schema_path + extraction_config. Example extraction config YAML:
```
type: json-css
params:
  verbose: true
```

⚠️ json output with deep_crawl is not supported by Crawl4AI. Use markdown for deep crawls.

Architecture

File	Responsibility
`index.ts`	Entry point — wires everything into `ExtensionAPI`
`types.ts`	Typebox schemas + TypeScript interfaces
`tool.ts`	Tool definition, validation, and execution logic
`args.ts`	Maps friendly params to CLI flags
`resolve.ts`	Binary discovery, env vars, output path helpers
`runner.ts`	Spawns `crwl` with timeout, abort, and streaming
`render.ts`	Custom TUI rendering
`commands.ts`	Slash commands (`/crawl4ai-*`)
`format.ts`	Legacy truncation helper

Development

The extension runs inside pi’s extension loader. After modifying source, reload pi and use /crawl4ai-status or /crawl4ai-test to verify.

Troubleshooting

"Binary not found" → run /crawl4ai-install or set CRAWL4AI_VENV=/path/to/venv
"No default LLM provider configured" → configure a provider in Crawl4AI before using json_extract
"the JSON object must be str, bytes or bytearray, not NoneType" → usually missing extraction_config when using schema_path, or the page has no matching content
Stale cache → run /crawl4ai-clear-cache
Timeouts → increase timeout param for slow sites

License

MIT — see LICENSE

pi-crawl4ai

Install

What's included

Quick start

Usage

Direct tool usage

Prompt templates

Subagent workflows

Tool parameters

Commands

Agents

JSON extraction requirements

Architecture

Development

Troubleshooting

Links

License