# AnyCrawl
**Repository Path**: JonDO/AnyCrawl
## Basic Information
- **Project Name**: AnyCrawl
- **Description**: AnyCrawl 是一个高性能的抓取和抓取工具包:
SERP 抓取 :多个搜索引擎,批量友好
网页抓取 :单页内容提取
站点抓取 :全站遍历和采集
高性能 :多线程/多进程
批处理任务 :可靠高效
AI 提取 :从页面中提取 LLM 支持的结构化数据 (JSON)
LLM 友好。易于集成和使用。
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-23
- **Last Updated**: 2025-08-26
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## 📖 Overview
AnyCrawl is a high‑performance crawling and scraping toolkit:
- **SERP crawling**: multiple search engines, batch‑friendly
- **Web scraping**: single‑page content extraction
- **Site crawling**: full‑site traversal and collection
- **High performance**: multi‑threading / multi‑process
- **Batch tasks**: reliable and efficient
- **AI extraction**: LLM‑powered structured data (JSON) extraction from pages
LLM‑friendly. Easy to integrate and use.
## 🚀 Quick Start
📖 See full docs: [Docs](https://docs.anycrawl.dev)
## 📚 Usage Examples
💡 Use the [Playground](https://anycrawl.dev/playground) to test APIs and generate code in your preferred language.
> If self‑hosting, replace `https://api.anycrawl.dev` with your own server URL.
### Web Scraping (Scrape)
#### Example
```typescript
curl -X POST https://api.anycrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"url": "https://example.com",
"engine": "cheerio"
}'
```
#### Parameters
| Parameter | Type | Description | Default |
| --------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| url | string (required) | The URL to be scraped. Must be a valid URL starting with http:// or https:// | - |
| engine | string | Scraping engine to use. Options: `cheerio` (static HTML parsing, fastest), `playwright` (JavaScript rendering with modern engine), `puppeteer` (JavaScript rendering with Chrome) | cheerio |
| proxy | string | Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: `http://[username]:[password]@proxy:port` | _(none)_ |
More parameters: see [Request Parameters](https://docs.anycrawl.dev/en/general/scrape#request-parameters).
#### LLM Extraction
```bash
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"json_options": {
"schema": {
"type": "object",
"properties": {
"company_mission": { "type": "string" },
"is_open_source": { "type": "boolean" },
"employee_count": { "type": "number" }
},
"required": ["company_mission"]
}
}
}'
```
### Site Crawling (Crawl)
#### Example
```typescript
curl -X POST https://api.anycrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"url": "https://example.com",
"engine": "playwright",
"max_depth": 2,
"limit": 10,
"strategy": "same-domain"
}'
```
#### Parameters
| Parameter | Type | Description | Default |
| -------------- | ----------------- | ----------------------------------------------------------------------------------------- | ----------- |
| url | string (required) | Starting URL to crawl | - |
| engine | string | Crawling engine. Options: `cheerio`, `playwright`, `puppeteer` | cheerio |
| max_depth | number | Max depth from the start URL | 10 |
| limit | number | Max number of pages to crawl | 100 |
| strategy | enum | Scope: `all`, `same-domain`, `same-hostname`, `same-origin` | same-domain |
| include_paths | array | Only crawl paths matching these patterns | _(none)_ |
| exclude_paths | array | Skip paths matching these patterns | _(none)_ |
| scrape_options | object | Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options | _(none)_ |
More parameters and endpoints: see [Request Parameters](https://docs.anycrawl.dev/en/general/scrape#request-parameters).
### Search Engine Results (SERP)
#### Example
```typescript
curl -X POST https://api.anycrawl.dev/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"query": "AnyCrawl",
"limit": 10,
"engine": "google",
"lang": "all"
}'
```
#### Parameters
| Parameter | Type | Description | Default |
| --------- | ----------------- | ---------------------------------------------------------- | ------- |
| `query` | string (required) | Search query to be executed | - |
| `engine` | string | Search engine to use. Options: `google` | google |
| `pages` | integer | Number of search result pages to retrieve | 1 |
| `lang` | string | Language code for search results (e.g., 'en', 'zh', 'all') | en-US |
#### Supported search engines
- Google
## ❓ FAQ
1. **Can I use proxies?** Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the `proxy` request parameter (per request) or `ANYCRAWL_PROXY_URL` (self‑hosting).
2. **How to handle JavaScript‑rendered pages?** Use the `Playwright` or `Puppeteer` engines.
## 🤝 Contributing
We welcome contributions! See the [Contributing Guide](CONTRIBUTING.md).
## 📄 License
MIT License — see [LICENSE](LICENSE).
## 🎯 Mission
We build simple, reliable, and scalable tools for the AI ecosystem.
---
Built with ❤️ by the Any4AI team