diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen3_vl_8b_singleNPU/qwen3_vl_8b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen3_vl_8b_singleNPU/qwen3_vl_8b_singleNPU.md new file mode 100644 index 0000000000000000000000000000000000000000..da4c959b8be054b49e9c5b170051edd05dfaf75b --- /dev/null +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen3_vl_8b_singleNPU/qwen3_vl_8b_singleNPU.md @@ -0,0 +1,265 @@ +# 单卡推理(Qwen3-VL-8B-Instruct) + +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen3_vl_8b_singleNPU/qwen3_vl_8b_singleNPU.md) + +本文档将介绍使用vLLM-MindSpore插件进行单卡推理的流程。以[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)模型为例,用户可通过以下[docker安装](#docker安装)章节或[安装指南](../../installation/installation.md#安装指南)章节进行环境配置,并[下载模型权重](#下载模型权重)。在[设置环境变量](#设置环境变量)之后,可进行[离线推理](#离线推理)与[在线推理](#在线推理),体验单卡推理功能。 + +## docker安装 + +在本章节中,我们推荐使用docker创建的方式,快速部署vLLM-MindSpore插件环境。以下是部署docker的步骤介绍: + +### 构建镜像 + +用户可执行以下命令,拉取vLLM-MindSpore插件代码仓库: + +```bash +git clone https://gitee.com/mindspore/vllm-mindspore.git +``` + +根据计算卡类型,构建镜像: + +- 若为Atlas 800I A2,则执行 + + ```bash + bash build_image.sh + ``` + +- 若为Atlas 300I Duo,则执行 + + ```bash + bash build_image.sh -a 310p + ``` + +构建成功后,用户可以得到以下信息: + +```text +Successfully built e40bcbeae9fc +Successfully tagged vllm_ms_20250726:latest +``` + +其中,`e40bcbeae9fc`为镜像ID,`vllm_ms_20250726:latest`为镜像名与tag。用户可执行以下命令,确认docker镜像创建成功: + +```bash +docker images +``` + +### 新建容器 + +用户在完成[构建镜像](#构建镜像)后,设置`DOCKER_NAME`与`IMAGE_NAME`为容器名与镜像名,并执行以下命令新建容器: + +```bash +export DOCKER_NAME=vllm-mindspore-container # your container name +export IMAGE_NAME=vllm_ms_20250726:latest # your image name + +docker run -itd --name=${DOCKER_NAME} --ipc=host --network=host --privileged=true \ + --device=/dev/davinci0 \ + --device=/dev/davinci1 \ + --device=/dev/davinci2 \ + --device=/dev/davinci3 \ + --device=/dev/davinci4 \ + --device=/dev/davinci5 \ + --device=/dev/davinci6 \ + --device=/dev/davinci7 \ + --device=/dev/davinci_manager \ + --device=/dev/devmm_svm \ + --device=/dev/hisi_hdc \ + -v /usr/local/sbin/:/usr/local/sbin/ \ + -v /var/log/npu/slog/:/var/log/npu/slog \ + -v /var/log/npu/profiling/:/var/log/npu/profiling \ + -v /var/log/npu/dump/:/var/log/npu/dump \ + -v /var/log/npu/:/usr/slog \ + -v /etc/hccn.conf:/etc/hccn.conf \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /etc/vnpu.cfg:/etc/vnpu.cfg \ + --shm-size="250g" \ + ${IMAGE_NAME} \ + bash +``` + +新建容器成功后,将返回容器ID。用户可执行以下命令,确认容器是否创建成功: + +```bash +docker ps +``` + +### 进入容器 + +用户在完成[新建容器](#新建容器)后,使用已定义的环境变量`DOCKER_NAME`,启动并进入容器: + +```bash +docker exec -it $DOCKER_NAME bash +``` + +## 下载模型权重 + +用户可采用[Python工具下载](#python工具下载)或[git-lfs工具下载](#git-lfs工具下载)两种方式,进行模型下载。 + +### Python工具下载 + +执行以下 Python 脚本,从[Hugging Face社区](https://huggingface.co/)下载[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)权重及文件: + +```python +from openmind_hub import snapshot_download +snapshot_download( + repo_id="Qwen/Qwen3-VL-8B-Instruct", + local_dir="/path/to/save/Qwen3-VL-8B-Instruct", + local_dir_use_symlinks=False +) +``` + +其中`local_dir`为模型保存路径,由用户指定,请确保该路径下有足够的硬盘空间。 + +### git-lfs工具下载 + +执行以下代码,以确认[git-lfs](https://git-lfs.com)工具是否可用: + +```bash +git lfs install +``` + +如果可用,将获得如下返回结果: + +```text +Git LFS initialized. +``` + +若工具不可用,则需要先安装[git-lfs](https://git-lfs.com),可参考[FAQ](../../../faqs/faqs.md)章节中关于[git-lfs安装](../../../faqs/faqs.md#git-lfs安装)的阐述。 + +工具确认可用后,执行以下命令下载权重: + +```bash +git clone https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct +``` + +## 设置环境变量 + +以[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)为例,以下环境变量用于设置内存占用、后端以及模型相关的YAML文件: + +```bash +#set environment variables +export VLLM_MS_MODEL_BACKEND=Native # use Native Model as model backend. +``` + +以下是对上述环境变量的解释: + +- `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 + +用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: + +```bash +export ASCEND_RT_VISIBLE_DEVICES=0 +``` + +## 离线推理 + +vLLM-MindSpore插件环境搭建之后,用户可以使用如下Python代码,进行模型的离线推理: + +```python +from PIL import Image +import vllm_mindspore # Add this line on the top of script. +from vllm import LLM, SamplingParams + +# Sample prompts. +PROMPT_TEMPLATE = ( + "<|im_start|>user\nWhat is in the image?<|vision_start|><|image_pad|>" + "<|vision_end|><|im_end|>\n<|im_start|>assistant\n") + +image_path = "xxx.jpeg" + +def pil_image() -> Image.Image: + return Image.open(image_path) + +inputs = [ + { + "prompt": PROMPT_TEMPLATE, + "multi_modal_data": { + "image": pil_image() + }, + }, +] + +# Create a sampling params object. +sampling_params = SamplingParams(temperature=0.0, top_p=0.95) + +# Create a LLM +llm = LLM(model="Qwen/Qwen3-VL-8B-Instruct") +# Generate texts from the prompts. The output is a list of RequestOutput objects +# that contain the prompt, generated text, and other information. +outputs = llm.generate(prompts, sampling_params) +# Print the outputs. +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}. Generated text: {generated_text!r}") +``` + +若成功执行,则可以获得类似的执行结果: + +```text +Prompt: 'I am'. Generated text: ' trying to create a virtual environment for my Python project, but I am encountering some' +Prompt: 'Today is'. Generated text: ' the 100th day of school. To celebrate, the teacher has' +Prompt: 'Llama is'. Generated text: ' a 100% natural, biodegradable, and compostable alternative' +``` + +## 在线推理 + +vLLM-MindSpore插件可使用OpenAI的API协议,部署在线推理。以下以[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)为例,介绍模型的[启动服务](#启动服务)和[发送请求](#发送请求),得到在线推理的推理结果。 + +### 启动服务 + +使用如下命令启动vLLM服务: + +```bash +vllm-mindspore serve Qwen/Qwen3-VL-8B-Instruct +``` + +用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动,则可以获得类似的执行结果: + +```text +INFO: Started server process [6363] +INFO: Waiting for application startup. +INFO: Application startup complete. +``` + +另外,日志中还会打印出服务的性能数据信息,如: + +```text +Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% +``` + +### 发送请求 + +使用如下命令发送请求。其中`prompt`字段为模型输入: + +```bash +curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen3-VL-8B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' +``` + +其中,用户需确认`"model"`字段与启动服务中的`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: + +```text +{ + "id":"cmpl-bac2b14c726b48b9967bcfc724e7c2a8","object":"text_completion", + "create":1748485893, + "model":"Qwen2.5-7B-Instruct", + "choices":[ + { + "index":0, + "text":"trying to create a virtual environment for my Python project, but I am encountering some issues with setting up", + "logprobs":null, + "finish_reason":"length", + "stop_reason":null, + "prompt_logprobs":null + } + ], + "usage":{ + "prompt_tokens":2, + "total_tokens":22, + "completion_tokens":20, + "prompt_tokens_details":null + } +} +```