# AnyCrawl **Repository Path**: JonDO/AnyCrawl ## Basic Information - **Project Name**: AnyCrawl - **Description**: AnyCrawl 是一个高性能的抓取和抓取工具包： SERP 抓取：多个搜索引擎，批量友好网页抓取：单页内容提取站点抓取：全站遍历和采集高性能：多线程/多进程批处理任务：可靠高效 AI 提取：从页面中提取 LLM 支持的结构化数据（JSON） LLM 友好。易于集成和使用。 - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-23 - **Last Updated**: 2025-08-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

AnyCrawl

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com) [![LLM Ready](https://img.shields.io/badge/LLM-Ready-blueviolet)](https://github.com/any4ai/anycrawl) [![Documentation](https://img.shields.io/badge/📖-Documentation-blue)](https://docs.anycrawl.dev)

## 📖 Overview AnyCrawl is a high‑performance crawling and scraping toolkit: - **SERP crawling**: multiple search engines, batch‑friendly - **Web scraping**: single‑page content extraction - **Site crawling**: full‑site traversal and collection - **High performance**: multi‑threading / multi‑process - **Batch tasks**: reliable and efficient - **AI extraction**: LLM‑powered structured data (JSON) extraction from pages LLM‑friendly. Easy to integrate and use. ## 🚀 Quick Start 📖 See full docs: [Docs](https://docs.anycrawl.dev) ## 📚 Usage Examples 💡 Use the [Playground](https://anycrawl.dev/playground) to test APIs and generate code in your preferred language. > If self‑hosting, replace `https://api.anycrawl.dev` with your own server URL. ### Web Scraping (Scrape) #### Example ```typescript curl -X POST https://api.anycrawl.dev/v1/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \ -d '{ "url": "https://example.com", "engine": "cheerio" }' ``` #### Parameters | Parameter | Type | Description | Default | | --------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | | url | string (required) | The URL to be scraped. Must be a valid URL starting with http:// or https:// | - | | engine | string | Scraping engine to use. Options: `cheerio` (static HTML parsing, fastest), `playwright` (JavaScript rendering with modern engine), `puppeteer` (JavaScript rendering with Chrome) | cheerio | | proxy | string | Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: `http://[username]:[password]@proxy:port` | _(none)_ | More parameters: see [Request Parameters](https://docs.anycrawl.dev/en/general/scrape#request-parameters). #### LLM Extraction ```bash curl -X POST "https://api.anycrawl.dev/v1/scrape" \ -H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "json_options": { "schema": { "type": "object", "properties": { "company_mission": { "type": "string" }, "is_open_source": { "type": "boolean" }, "employee_count": { "type": "number" } }, "required": ["company_mission"] } } }' ``` ### Site Crawling (Crawl) #### Example ```typescript curl -X POST https://api.anycrawl.dev/v1/crawl \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \ -d '{ "url": "https://example.com", "engine": "playwright", "max_depth": 2, "limit": 10, "strategy": "same-domain" }' ``` #### Parameters | Parameter | Type | Description | Default | | -------------- | ----------------- | ----------------------------------------------------------------------------------------- | ----------- | | url | string (required) | Starting URL to crawl | - | | engine | string | Crawling engine. Options: `cheerio`, `playwright`, `puppeteer` | cheerio | | max_depth | number | Max depth from the start URL | 10 | | limit | number | Max number of pages to crawl | 100 | | strategy | enum | Scope: `all`, `same-domain`, `same-hostname`, `same-origin` | same-domain | | include_paths | array | Only crawl paths matching these patterns | _(none)_ | | exclude_paths | array | Skip paths matching these patterns | _(none)_ | | scrape_options | object | Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options | _(none)_ | More parameters and endpoints: see [Request Parameters](https://docs.anycrawl.dev/en/general/scrape#request-parameters). ### Search Engine Results (SERP) #### Example ```typescript curl -X POST https://api.anycrawl.dev/v1/search \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \ -d '{ "query": "AnyCrawl", "limit": 10, "engine": "google", "lang": "all" }' ``` #### Parameters | Parameter | Type | Description | Default | | --------- | ----------------- | ---------------------------------------------------------- | ------- | | `query` | string (required) | Search query to be executed | - | | `engine` | string | Search engine to use. Options: `google` | google | | `pages` | integer | Number of search result pages to retrieve | 1 | | `lang` | string | Language code for search results (e.g., 'en', 'zh', 'all') | en-US | #### Supported search engines - Google ## ❓ FAQ 1. **Can I use proxies?** Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the `proxy` request parameter (per request) or `ANYCRAWL_PROXY_URL` (self‑hosting). 2. **How to handle JavaScript‑rendered pages?** Use the `Playwright` or `Puppeteer` engines. ## 🤝 Contributing We welcome contributions! See the [Contributing Guide](CONTRIBUTING.md). ## 📄 License MIT License — see [LICENSE](LICENSE). ## 🎯 Mission We build simple, reliable, and scalable tools for the AI ecosystem. ---

_{Built with ❤️ by the Any4AI team}