# chirrup **Repository Path**: Yi_AI/chirrup ## Basic Information - **Project Name**: chirrup - **Description**: rwkv的推理工具 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-16 - **Last Updated**: 2026-01-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
Chirrup Logo
# Chirrup > /ˈCHirəp/ — (especially of a small bird) make repeated short high-pitched sounds; twitter. **Chirrup** is a high-performance inference frontend for RWKV models, built on top of [Albatross](https://github.com/BlinkDL/Albatross). --- ## 📊 Performance ### November 12, 2025 | GPU Configuration | Model | Workers | BSZ/Worker | Total Concurrent Requests | TPS per Request | | ------------------- | ----- | ------- | ---------- | ------------------------- | --------------- | | 4 × RTX 4090 24GB | 7.2B | 4 | 200 | 800 | 16 tps | | 4 × Tesla V100 16GB | 7.2B | 4 | 34 | 136 | 17 tps | > **Note**: The RTX 4090 configuration is far from the GPU's processing limits, with significant optimization potential remaining. ## ✨ Features ### ✅ Implemented - **High Performance**: Leverages the blazing-fast inference engine from [Albatross](https://github.com/BlinkDL/Albatross). - **Continuous Batching**: Maximizes GPU utilization by dynamically batching incoming requests. - **State Cache**: Reuses computation states for long-context inputs, significantly improving throughput as context length increases. - **OpenAI-Compatible API**: Drop-in replacement for existing LLM workflows — no code changes needed. ### 🔜 Planned - [ ] CUDA Graph support for reduced kernel launch overhead - [ ] Prefill-Decode separation for optimized scheduling - [ ] Constrained decoding (e.g., JSON schema) - [ ] Function Calling support - [ ] Pipeline parallelism to enable inference of even larger models --- ## 🚀 Getting Started ### 1. Download a Model Visit the official model hub and download a RWKV-7 `g1` series model that fits your needs: 👉 [https://huggingface.co/BlinkDL/rwkv7-g1/tree/main](https://huggingface.co/BlinkDL/rwkv7-g1/tree/main) ### 2. Set Up Environment For **best performance**, we strongly recommend using **Python 3.14t (Free threading)** via `uv`. ```bash # Clone the repository git clone --recurse-submodules https://github.com/leonsama/chirrup.git # Create a Python 3.14t virtual environment uv venv --python 3.14t # Activate it source .venv/bin/activate # Linux/macOS # .venv\Scripts\activate # Windows # Install Chirrup uv pip install -e . # Install dependencies with CUDA 12.9 support and dev tools uv sync --extra torch-cu129 --dev ``` > 💡 You may use `torch-cu126` instead if your system requires it, or customize the PyTorch backend in `pyproject.toml`. #### For ROCm users If you are ROCm device, you need to use this script to install dependencies ```bash git clone --recurse-submodules https://github.com/leonsama/chirrup.git uv venv --python 3.14t source .venv/bin/activate uv sync --extra dev uv pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4 ``` --- ## 🌐 Start API Service ### Quick Start ```bash # Currently, `triton._C.libtriton` doesn't declare itself GIL-safe, but it actually works fine—so we # manually disable the GIL with `PYTHON_GIL=0`. PYTHON_GIL=0 uv run --frozen python -m chirrup.web_service.app --model_path /path/to/your/model ``` The service will start at **`http://127.0.0.1:8000`**, providing OpenAI-compatible API endpoints. 📖 **Detailed Documentation**: Check [Chirrup API Documentation](./Docs/API.md) for complete command-line parameters and API interface documentation. --- ## 🧪 Run Demos ### Stream Output (Single Request) [**Demo:**](./test/demo_stream_output.py) ```bash PYTHON_GIL=0 uv run --frozen test/demo_stream_output.py --model_path /path/to/your/model ``` **Code Example:** ```python from chirrup.engine_core import AsyncEngineCore from chirrup.core_structure import ModelLoadConfig model_config = ModelLoadConfig( model_path=model_path, vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt", vocab_size=65536, head_size=64, ) engine_core = AsyncEngineCore() await engine_core.init(worker_num=1, model_config=model_config, batch_size=4) prompt = "User: 为什么 42 是一个有趣的数字?\n\nAssistant:" completion = engine_core.completion(prompt) print(prompt, end="", flush=True) async for event in completion: if event[0] == "token": print(event[2], end="", flush=True) ``` ### Batch Inference (Concurrent Requests) [**Demo:**](./test/demo_batch_output.py) ```bash PPYTHON_GIL=0 v run --frozen test/demo_batch_output.py --model_path /path/to/your/model --batch_size 32 --task_num 512 --worker_num 4 ``` **Code Example:** ```python from chirrup.engine_core import AsyncEngineCore from chirrup.core_structure import ModelLoadConfig import asyncio model_config = ModelLoadConfig( model_path=model_path, vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt", vocab_size=65536, head_size=64, ) engine_core = AsyncEngineCore() await engine_core.init(worker_num=4, model_config=model_config, batch_size=33) # batch_size = max_batch + 1 prompts = [f"User: 为什么 {i} 是一个有趣的数字?\n\nAssistant: \n" for i in range(512)] results = await asyncio.gather( *[engine_core.completion(prompt).get_full_completion() for prompt in prompts] ) ``` --- ## 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## 🙏 Acknowledgments - Thanks to [**RWKV-Vibe/rwkv_lightning**](https://github.com/RWKV-Vibe/rwkv_lightning) for inspiration and to its author **Alic** for valuable guidance. - Thanks to **Jellyfish** for the [**continuous batching implementation**](https://github.com/BlinkDL/Albatross/pull/5) in Albatross. ---
🐦 Like a chirping bird — lightweight, fast, and always responsive.
Built with ❤️ for the RWKV ecosystem