pi-crawl4ai
Crawl4AI extension for pi — web crawling and structured extraction
Package details
Install pi-crawl4ai from npm and Pi will load the resources declared by the package manifest.
$ pi install npm:pi-crawl4ai- Package
pi-crawl4ai- Version
0.1.2- Published
- Apr 24, 2026
- Downloads
- 377/mo · 377/wk
- Author
- romek_rozen
- License
- MIT
- Types
- extension, prompt
- Size
- 63.7 KB
- Dependencies
- 0 dependencies · 4 peers
Pi manifest JSON
{
"extensions": [
"./extensions/crawl4ai"
],
"agents": [
"./agents"
],
"prompts": [
"./prompts"
]
}Security note
Pi packages can execute code and influence agent behavior. Review the source before installing third-party packages.
README
pi-crawl4ai
A production-ready pi package that integrates Crawl4AI — an open-source LLM-friendly web crawler — as a custom tool and agent prompt.
Install
pi install npm:pi-crawl4ai
Or from git:
pi install git:github.com/romek-rozen/pi-crawl4ai
What's included
extensions/crawl4ai/— the full extension source (tool, commands, renderers)agents/— specialized agent definitions for use with subagent/runprompts/crawler.md— a dedicated Crawler prompt template- MIT license
Quick start
Install Crawl4AI (inside pi):
/crawl4ai-installThis creates a Python venv and installs
crawl4ailocally.Verify:
/crawl4ai-status /crawl4ai-testSet up agents (optional, for subagent workflows):
/crawl4ai-setup-agentsCrawl:
Crawl https://example.com and give me the markdown.
Usage
Direct tool usage
The crawl4ai tool is available to the LLM in any pi session. Just ask:
Crawl https://example.com and give me the markdown.
Crawl https://docs.example.com deeply using BFS, max 10 pages.
Extract all product names and prices from https://shop.example.com as JSON.
Prompt templates
Three prompt templates are available:
/crawl4ai https://example.com # general — scrape, crawl, or extract
/crawl4ai-scrape https://example.com # single page → clean markdown
/crawl4ai-extract https://example.com product prices # single page → structured JSON
Subagent workflows
After running /crawl4ai-setup-agents, three agents become available for the subagent tool:
Single agent:
Use crawl4ai-scrape to get the content of https://example.com
Use crawl4ai-crawl to explore https://docs.example.com with max 10 pages
Use crawl4ai-extract to get all product prices from https://shop.example.com
Parallel execution:
Run 3 crawl4ai-scrape agents in parallel: one for https://example.com, one for https://docs.example.com, one for https://blog.example.com
Chained workflow:
Use a chain: first have crawl4ai-scrape get the page at https://shop.example.com, then have crawl4ai-extract pull structured product data from {previous}
Tool parameters
| Parameter | Description |
|---|---|
url (required) |
Target URL |
output_format |
markdown (default), markdown-fit, md, md-fit, json, all |
deep_crawl |
bfs, dfs, or best-first |
max_pages |
Limit for deep crawl |
question |
Natural-language question about the page |
json_extract |
LLM extraction prompt (requires LLM provider configured in crawl4ai) |
schema_path |
JSON schema file for structured extraction (requires extraction_config) |
extraction_config |
Extraction strategy config file (YAML/JSON) (required with schema_path) |
browser_config / crawler_config |
Key=value pairs |
bypass_cache |
Force fresh crawl |
output_file |
Save directly to a custom path |
timeout |
Seconds, default 60 |
Commands
| Command | Purpose |
|---|---|
/crawl4ai-install |
Create local venv and install crawl4ai |
/crawl4ai-test |
Run a smoke test crawl on example.com |
/crawl4ai-status |
Show resolved binary path and version |
/crawl4ai-clear-cache |
Remove local .crawl4ai/cache and .crawl4ai/robots |
/crawl4ai-setup-agents |
Symlink agents to ~/.pi/agent/agents/ for use with subagent/run |
Agents
Three specialized agents are included for use with the subagent tool:
| Agent | Purpose | Model |
|---|---|---|
crawl4ai-scrape |
Scrape a single page, extract clean markdown | Sonnet |
crawl4ai-crawl |
Crawl multiple linked pages (BFS/DFS/best-first) | Sonnet |
crawl4ai-extract |
Extract structured data as JSON (LLM or CSS/XPath) | Sonnet |
To set up agents, run inside pi:
/crawl4ai-setup-agents
This symlinks the agent definitions to ~/.pi/agent/agents/. After setup, you can use them with the subagent tool:
Use crawl4ai-scrape to get the content of https://example.com
Use crawl4ai-crawl to explore https://docs.example.com with max 10 pages
Use crawl4ai-extract to get all product prices from https://shop.example.com
JSON extraction requirements
output_format=json requires an extraction strategy:
- LLM extraction: pass
json_extract(e.g."Extract all product prices"). Requires a configured LLM provider in Crawl4AI (runcrwlonce interactively or set up~/.crawl4ai/global.yml). - CSS/XPath extraction: pass
schema_path+extraction_config. Example extraction config YAML:type: json-css params: verbose: true
⚠️ json output with deep_crawl is not supported by Crawl4AI. Use markdown for deep crawls.
Architecture
| File | Responsibility |
|---|---|
index.ts |
Entry point — wires everything into ExtensionAPI |
types.ts |
Typebox schemas + TypeScript interfaces |
tool.ts |
Tool definition, validation, and execution logic |
args.ts |
Maps friendly params to CLI flags |
resolve.ts |
Binary discovery, env vars, output path helpers |
runner.ts |
Spawns crwl with timeout, abort, and streaming |
render.ts |
Custom TUI rendering |
commands.ts |
Slash commands (/crawl4ai-*) |
format.ts |
Legacy truncation helper |
Development
The extension runs inside pi’s extension loader. After modifying source, reload pi and use /crawl4ai-status or /crawl4ai-test to verify.
Troubleshooting
- "Binary not found" → run
/crawl4ai-installor setCRAWL4AI_VENV=/path/to/venv - "No default LLM provider configured" → configure a provider in Crawl4AI before using
json_extract - "the JSON object must be str, bytes or bytearray, not NoneType" → usually missing
extraction_configwhen usingschema_path, or the page has no matching content - Stale cache → run
/crawl4ai-clear-cache - Timeouts → increase
timeoutparam for slow sites
Links
License
MIT — see LICENSE