# mini_sglang
**Repository Path**: github_zoo/mini-sglang
## Basic Information
- **Project Name**: mini_sglang
- **Description**: https://github.com/sgl-project/mini-sglang.git
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-18
- **Last Updated**: 2025-12-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Mini-SGLang
A **lightweight yet high-performance** inference framework for Large Language Models.
---
Mini-SGLang is a compact implementation of [SGLang](https://github.com/sgl-project/sglang), designed to demystify the complexities of modern LLM serving systems. With a compact codebase of **~5,000 lines of Python**, it serves as both a capable inference engine and a transparent reference for researchers and developers.
## ✨ Key Features
- **High Performance**: Achieves state-of-the-art throughput and latency with advanced optimizations.
- **Lightweight & Readable**: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
- **Advanced Optimizations**:
- **Radix Cache**: Reuses KV cache for shared prefixes across requests.
- **Chunked Prefill**: Reduces peak memory usage for long-context serving.
- **Overlap Scheduling**: Hides CPU scheduling overhead with GPU computation.
- **Tensor Parallelism**: Scales inference across multiple GPUs.
- **Optimized Kernels**: Integrates **FlashAttention** and **FlashInfer** for maximum efficiency.
- ...
## 🚀 Quick Start
### 1. Environment Setup
We recommend using `uv` for a fast and reliable installation (note that `uv` does not conflict with `conda`).
```bash
# Create a virtual environment (Python 3.10+ recommended)
uv venv --python=3.12
source .venv/bin/activate
```
**Prerequisites**: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the **NVIDIA CUDA Toolkit** installed and that its version matches your driver's version. You can check your driver's CUDA capability with `nvidia-smi`.
### 2. Installation
Install Mini-SGLang directly from the source:
```bash
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .
```
### 3. Online Serving
Launch an OpenAI-compatible API server with a single command.
```bash
# Deploy Qwen/Qwen3-0.6B on a single GPU
python -m minisgl --model "Qwen/Qwen3-0.6B"
# Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000
```
Once the server is running, you can send requests using standard tools like `curl` or any OpenAI-compatible client.
### 4. Interactive Shell
Chat with your model directly in the terminal by adding the `--shell` flag.
```bash
python -m minisgl --model "Qwen/Qwen3-0.6B" --shell
```

You can also use `/reset` to clear the chat history.
## Benchmark
### Offline inference
See [bench.py](./benchmark/offline/bench.py) for more details. Set `MINISGL_DISABLE_OVERLAP_SCHEDULING=1` for ablation study on overlap scheduling.
Test Configuration:
- Hardware: 1xH200 GPU.
- Model: Qwen3-0.6B, Qwen3-14B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100-1024 tokens
- Output Length: Randomly sampled between 100-1024 tokens

### Online inference
See [benchmark_qwen.py](./benchmark/online/bench_qwen.py) for more details.
Test Configuration:
- Hardware: 4xH200 GPU, connected by NVLink.
- Model: Qwen3-32B
- Dataset: [Qwen trace](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon/blob/main/qwen_traceA_blksz_16.jsonl), replaying first 1000 requests.
Launch command:
```bash
# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
--disable-radix --port 1919 --decode-attention flashinfer
```

## 📚 Learn More
- **[Detailed Features](./docs/features.md)**: Explore all available features and command-line arguments.
- **[System Architecture](./docs/structures.md)**: Dive deep into the design and data flow of Mini-SGLang.