# mini_sglang **Repository Path**: github_zoo/mini-sglang ## Basic Information - **Project Name**: mini_sglang - **Description**: https://github.com/sgl-project/mini-sglang.git - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-18 - **Last Updated**: 2025-12-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

# Mini-SGLang A **lightweight yet high-performance** inference framework for Large Language Models. --- Mini-SGLang is a compact implementation of [SGLang](https://github.com/sgl-project/sglang), designed to demystify the complexities of modern LLM serving systems. With a compact codebase of **~5,000 lines of Python**, it serves as both a capable inference engine and a transparent reference for researchers and developers. ## ✨ Key Features - **High Performance**: Achieves state-of-the-art throughput and latency with advanced optimizations. - **Lightweight & Readable**: A clean, modular, and fully type-annotated codebase that is easy to understand and modify. - **Advanced Optimizations**: - **Radix Cache**: Reuses KV cache for shared prefixes across requests. - **Chunked Prefill**: Reduces peak memory usage for long-context serving. - **Overlap Scheduling**: Hides CPU scheduling overhead with GPU computation. - **Tensor Parallelism**: Scales inference across multiple GPUs. - **Optimized Kernels**: Integrates **FlashAttention** and **FlashInfer** for maximum efficiency. - ... ## 🚀 Quick Start ### 1. Environment Setup We recommend using `uv` for a fast and reliable installation (note that `uv` does not conflict with `conda`). ```bash # Create a virtual environment (Python 3.10+ recommended) uv venv --python=3.12 source .venv/bin/activate ``` **Prerequisites**: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the **NVIDIA CUDA Toolkit** installed and that its version matches your driver's version. You can check your driver's CUDA capability with `nvidia-smi`. ### 2. Installation Install Mini-SGLang directly from the source: ```bash git clone https://github.com/sgl-project/mini-sglang.git cd mini-sglang uv pip install -e . ``` ### 3. Online Serving Launch an OpenAI-compatible API server with a single command. ```bash # Deploy Qwen/Qwen3-0.6B on a single GPU python -m minisgl --model "Qwen/Qwen3-0.6B" # Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000 python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000 ``` Once the server is running, you can send requests using standard tools like `curl` or any OpenAI-compatible client. ### 4. Interactive Shell Chat with your model directly in the terminal by adding the `--shell` flag. ```bash python -m minisgl --model "Qwen/Qwen3-0.6B" --shell ``` ![shell-example](https://lmsys.org/images/blog/minisgl/shell.png) You can also use `/reset` to clear the chat history. ## Benchmark ### Offline inference See [bench.py](./benchmark/offline/bench.py) for more details. Set `MINISGL_DISABLE_OVERLAP_SCHEDULING=1` for ablation study on overlap scheduling. Test Configuration: - Hardware: 1xH200 GPU. - Model: Qwen3-0.6B, Qwen3-14B - Total Requests: 256 sequences - Input Length: Randomly sampled between 100-1024 tokens - Output Length: Randomly sampled between 100-1024 tokens ![offline](https://lmsys.org/images/blog/minisgl/offline.png) ### Online inference See [benchmark_qwen.py](./benchmark/online/bench_qwen.py) for more details. Test Configuration: - Hardware: 4xH200 GPU, connected by NVLink. - Model: Qwen3-32B - Dataset: [Qwen trace](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon/blob/main/qwen_traceA_blksz_16.jsonl), replaying first 1000 requests. Launch command: ```bash # Mini-SGLang python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive # SGLang python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \ --disable-radix --port 1919 --decode-attention flashinfer ``` ![online](https://lmsys.org/images/blog/minisgl/online.png) ## 📚 Learn More - **[Detailed Features](./docs/features.md)**: Explore all available features and command-line arguments. - **[System Architecture](./docs/structures.md)**: Dive deep into the design and data flow of Mini-SGLang.