# nmoe **Repository Path**: github_zoo/nmoe ## Basic Information - **Project Name**: nmoe - **Description**: https://github.com/Noumena-Network/nmoe.git - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-18 - **Last Updated**: 2026-01-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # nmoe ``` _ __ _ __ ___ ___ ___ | '_ \ | '_ ` _ \ / _ \ / _ \ | | | || | | | | | (_) | __/ |_| |_||_| |_| |_|\___/ \___| ``` > No all-to-all. No tensor parallel. B200-only. This repo is an opinionated Mixture-of-Experts trainer hard-targeted to NVIDIA Blackwell B200 (`sm_100a`). MoE expert parallelism is implemented via **RDEP**: direct dispatch/return using CUDA IPC (intra-node) and NVSHMEM (inter-node), instead of NCCL all-to-all collectives on the expert path. ## Quick start This repository is **container-first**. The supported way to build and run is via the Dockerfiles in `docker/`. Boot a machine with B200 GPUs and run a minimal single-GPU smoke test (`moonlet`) inside the training image: ```bash # Build base image (Dockerfile.train expects this tag) docker build -f docker/Dockerfile.base -t xjdr/nmoe:base . # Build training image docker build -f docker/Dockerfile.train -t xjdr/nmoe_train:latest . # Run single-GPU training (mount /data for datasets, checkpoints, metrics) docker run --gpus all -v /data:/data xjdr/nmoe_train:latest \ python -m nmoe.train configs/moonlet.toml ``` ## Multi-GPU and multi-node Single-node (8×GPU) training: ```bash torchrun --standalone --nproc_per_node=8 -m nmoe.train configs/moonlight.toml ``` Multi-node runs require NVSHMEM. Build the NVSHMEM-enabled image: ```bash docker build -f docker/Dockerfile.dist -t xjdr/nmoe_dist:latest . ``` Kubernetes manifests in `k8s/` are templates for training, NVIZ, and profiling; edit hostnames, images, and storage before deploying. ## Configs | Config | Model | Experts | GPUs | Use Case | |--------|-------|---------|------|----------| | `moonlet.toml` | 7B | 64 (6 active) | 1 | Single-GPU research | | `moonlight.toml` | 16B | 64 (6 active) | 8 | Single-node RDEP | | `dsv2.toml` | DeepSeek-V2 | 160 (6 active) | 8+ | Multi-node | | `dsv3.toml` | DeepSeek-V3 | 256 (8 active) | 32+ | Production | ## Why RDEP Traditional MoE uses NCCL all-to-all: every GPU waits for every other GPU. RDEP replaces this with direct NVSHMEM puts—each GPU writes tokens directly into the expert owner's buffer. No collective. No barrier. No waiting. ``` Source rank Owner rank ─────────── ────────── tokens ──▶ dispatch ─────────────▶ symmetric buffer │ │ │ nvshmem_putmem │ │ + atomic slot ▼ │ expert GEMM │ │ output ◀── scatter ◀───────────── return ``` ## Data Training consumes pre-tokenized `.npy` shards. **Preprocess from HuggingFace:** ```bash python -m nmoe.data.cli prep \ --source hf \ --dataset HuggingFaceFW/fineweb-edu \ --output /data/fineweb_edu \ --name fineweb_edu ``` Two workflows: - **Direct shards** (research): set `data_path` in config - **Flows** (production): set `flow_mode`, `mixture_toml`, `flow_profiles_toml` See `nmoe/data/README.md` for the full data pipeline. ## Metrics & NVIZ Training writes: - Experiments → SQLite (`/data/experiments.db`) - Metrics → DuckDB (`/data/metrics/{run_id}/rank_{rank}.duckdb`) NVIZ is the included dashboard. See `nviz/README.md`. ## Architecture ``` nmoe/ ├── train.py # Training loop ├── model.py # Transformer + MoE ├── moe.py # Fused MoE autograd ├── rdep.py # RDEP orchestration ├── checkpoint.py # Split checkpoints ├── config.py # TOML config ├── metrics.py # DuckDB writer ├── csrc/ # CUDA kernels ├── data/ # Data pipeline, HYDRA ├── attention/ # MLA, DSA, SWA └── eval/ # Evaluation hooks ``` ## What's Inside **RDEP Kernels** — Fused dispatch/return using NVSHMEM (inter-node) and IPC (intra-node). BF16 and blockscaled (FP8/NVFP4) paths. **Grouped GEMMs** — cuBLASLt with per-expert scaling. SM100-optimized via CuTe DSL. **Deterministic Resume** — Checkpoint includes RNG state, shard cursor, config fingerprint. **HYDRA** — LLM-as-judge data quality pipeline. See `nmoe/data/HYDRA.md`. This repo includes `nmoe/data/hydra_judge.pt` (a small judge head `state_dict`); see `nmoe/data/HYDRA_JUDGE_HEAD.md`. ## Tests The project is primarily validated via end-to-end training runs. Some Triton kernels include optional `pytest`-guarded tests inside the module (e.g. `nmoe/triton/nsa.py`, `nmoe/triton/swa.py`). ## Contributing nmoe is intentionally narrow and opinionated: B200-only (`sm_100a`), RDEP expert parallelism, TOML configs, and no NCCL all-to-all on the MoE path. We prefer one clear way to do each supported job over many interchangeable stacks. ## Acknowledgements This codebase borrows ideas from and interoperates with upstream ecosystems including PyTorch, Triton, NVSHMEM, CUTLASS, and the DeepSeek family of MoE architectures. See `THIRD_PARTY_NOTICES.md` for license attributions. ## Cite ```bibtex @misc{nmoe, title = {nmoe: B200-targeted MoE training with RDEP}, year = {2025}, publisher = {GitHub} } ``` ## Non-Goals - Tensor parallel (ever) - NCCL all-to-all for MoE (ever) - H100/A100 support - Fallback paths One hardware target. One distribution strategy. B200 or bust. ## Troubleshooting | Problem | Fix | |---------|-----| | `sm_100a` errors | You need B200. No workarounds. | | NVSHMEM init fails | Use IPC mode for single-node, or check bootstrap config | | OOM | Reduce `batch_size` or `seq_len` | ## License Apache-2.0. See `LICENSE`, `NOTICE`, and `THIRD_PARTY_NOTICES.md`.