文章

模型部署记录

模型部署记录

大模型部署记录

最近在公司部署了好多模型,记录一下不同模型的部署方式和参数配置,包括使用Ollama、VLLM等工具。

2. Ollama 部署记录

最开始的时候我用的OLLAMA来部署的模型,一个是因为内网存在OLLMA的镜像,可以直接下载使用,另外OLLAMA里面可以部署一些规模比较小的向量化模型,这可以让我在本地直接向量化一些小型代码仓。

2.1 环境变量配置

1
2
export OLLAMA_HOST="0.0.0.0:11434"
export OLLAMA_MODELS=/usr1/ollama/models

2.2 服务启动

1
ollama serve

2.3 模型管理

1
curl -X POST http://localhost:11434/api/unload -H "Content-Type: application/json" -d '{"model": "GLM-4.5-Air"}'

3. VLLM 部署记录

3.1 Qwen3-Coder-480B-A35B-Instruct-FP8

1
2
3
4
5
6
VLLM_USE_DEEP_GEMM=1 vllm serve /usr1/huggingface/models/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --max-model-len 131072 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

3.2 Qwen3-235B-A22B-Instruct-2507-FP8

1
2
3
4
5
6
7
8
9
10
vllm serve /usr1/huggingface/models/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --served-model-name Qwen3-235B-A22B-Instruct-2507-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-expert-parallel \
  --host 0.0.0.0 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

性能数据:

  • 每张卡占用43G
  • 总计占用显存344G(总共368GB)
  • 上下文窗口较短

3.3 Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix

1
2
3
4
5
6
7
8
9
10
11
12
13
vllm serve /usr1/huggingface/models/Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix \
  --served-model-name Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --kv-cache-dtype fp8_e5m2 \
  --enable-chunked-prefill  \
  --max-num-batched-tokens 8192 \
  --enable-expert-parallel \
  --host 0.0.0.0 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

性能数据:

  • 8张卡,每张卡占用42G,设置成16K、65536时均占用42G,但是设置成131072占用为44.7GB显存

3.4 Qwen3-Embedding-8B

1
2
3
4
5
vllm serve /usr1/huggingface/models/Qwen3-Embedding-8B --served-model-name  Qwen3-Embedding-8B --tensor-parallel-size 8 --task embedding --host 0.0.0.0 --port 8113 --max-model-len 40000 --max-num-batched-tokens 40000 --max-num-seqs 40 --gpu-memory-utilization 0.12


nohup vllm serve /usr1/huggingface/models/Qwen3-Embedding-8B --served-model-name  Qwen3-Embedding-8B --tensor-parallel-size 8 --task embedding --host 0.0.0.0 --port 8113 --max-model-len 40000 --max-num-batched-tokens 40000 --max-num-seqs 40 --gpu-memory-utilization 0.12 > vllm_qwen3-embeding.log 2>&1 &

当–tensor-parallel-size为8时,每张显卡占用内存22G,当为1时占用显存35GB

3.5 Qwen3-235B-A22B-GPTQ-Int4

1
2
vllm serve /usr1/huggingface/models/Qwen3-235B-A22B-GPTQ-Int4 --served-model-name  Qwen3-235B-A22B-GPTQ-Int4 --host 0.0.0.0 --max-num-seqs 40 --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --enable-expert-parallel --disable-log-requests --trust-remote-code --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --enable-chunked-prefill --max-num-batched-tokens 8192 --enable-auto-tool-choice --tool-call-parser hermes

3.6 GLM-4.6-Int4-Int8Mix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
vllm serve \
    /usr1/huggingface/models/GLM-4.6-GPTQ-Int4-Int8Mix \
    --served-model-name GLM-4.6-GPTQ-Int4-Int8Mix \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.8 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

32768时每张显卡占用43GB, 131072时,也是差不多43GB显存

nohup vllm serve
/usr1/huggingface/models/GLM-4.6-GPTQ-Int4-Int8Mix
–served-model-name GLM-4.6-GPTQ-Int4-Int8Mix
–enable-auto-tool-choice
–tool-call-parser glm45
–reasoning-parser glm45
–swap-space 16
–max-num-seqs 64
–max-model-len 131072
–gpu-memory-utilization 0.8
–tensor-parallel-size 8
–enable-expert-parallel
–trust-remote-code
–disable-log-requests
–host 0.0.0.0
–port 8000 » vllm_glm4.log 2>&1 &

3.7 GLM-4.6-AWQ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
vllm serve \
    /usr1/huggingface/models/GLM-4.6-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 202752 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

nohup vllm serve \
    /usr1/huggingface/models/GLM-4.6-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 202752 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --enable-log-requests \
    --host 0.0.0.0 \
    --port 8000  >> vllm_glm4-awq.log 2>&1 &

#     --enable-log-outputs  \ 打开日志输出
# 增加日志和性能监控
nohup vllm serve \
    /usr1/huggingface/models/GLM-4.6-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 202752 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --enable-log-requests \
    --host 127.0.0.1 \
    --port 8000  >> vllm_glm4-awq.log 2>&1 &

8张卡,每张占用43G显存

GLM-4.6-AWQ部署

  1. 激活环境
    1
    
    source ~/glm4.7/bin/activate
    
  2. 设置环境变量
    1
    2
    3
    4
    
    export VLLM_USE_DEEP_GEMM=0
    export VLLM_USE_FLASHINFER_MOE_FP16=1
    export VLLM_USE_FLASHINFER_SAMPLER=0
    export OMP_NUM_THREADS=4
    
  3. 启动 vLLM
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    nohup vllm serve \
     /usr1/huggingface/models/GLM-4.7-AWQ \
     --served-model-name GLM-4.7-AWQ \
     --swap-space 16 \
     --max-num-seqs 32 \
     --max-model-len 202752 \
     --gpu-memory-utilization 0.93 \
     --tensor-parallel-size 8 \
     --enable-expert-parallel \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --trust-remote-code \
     --enable-log-requests \
     --host 127.0.0.1 \
     --port 8000  >> vllm_glm4-awq.log 2>&1 &
    

    3.8 Qwen3-32B

1
2
vllm serve /usr1/huggingface/models/Qwen3-32B  --served-model-name Qwen3-32B --enable-auto-tool-choice --tool-call-parser qwen3 --tensor-parallel-size 8 --gpu-memory-utilization 0.8

qwen3-32B部署必须先停掉glm4.6,否则会出现显存不够的问题

3.9 Kimi-K2-thinking

1
2
3
4
5
6
7
8
9
10
vllm serve /usr1/huggingface/models/Kimi-K2-Thinking \
  --served-model-name kimi-k2-thinking \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 32768 \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2

3.10 MiniMax-M2.1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# MiniMax-M2.1
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用

SAFETENSORS_FAST_GPU=1 vllm serve /usr1/huggingface/models/MiniMax-M2-1 \
    --served-model-name GLM-4.6-AWQ MiniMax-M2.1 \
    --trust-remote-code \
    --enable_expert_parallel \
    --tensor-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --max-num-seqs 64 \
    --max-model-len 196608 \
    --enable-log-requests \
    --gpu-memory-utilization 0.95 \
    --host 127.0.0.1 \
    --port 8000

3.11 MiniMax-M2.5

1
2
3
4
5
6
7
8
9
10
11
12
# MiniMax-M2.5
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
SAFETENSORS_FAST_GPU=1 vllm serve \
    /usr1/huggingface/models/MiniMax-M2.5-BF16-INT4-AWQ --trust-remote-code \
    --served-model-name GLM-4.6-AWQ  \
    --enable_expert_parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-log-requests \
    --host 127.0.0.1 \
    --port 8000 
    

后台运行

1
2
3
4
5
6
7
8
9
SAFETENSORS_FAST_GPU=1 nohup vllm serve \
    /usr1/huggingface/models/MiniMax-M2.5-BF16-INT4-AWQ --trust-remote-code \
    --served-model-name GLM-4.6-AWQ  \
    --enable_expert_parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-log-requests \
    --host 127.0.0.1 \
    --port 8000 >> vllm_minimax2.5-awq.log 2>&1 &

可以成功运行

3.12 Qwen3.5-397B-A17B

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 激活vllm环境
source vllm-qwen3.5-plus/bin/activate
# 需要根据 https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ 完成nightly版本的安装以及transformer的安装
# 加载cuda12
export LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cuda_runtime/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cublas/lib:$LD_LIBRARY_PATH
#  中提供的部署命令
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=16

# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
vllm serve \
    /usr1/huggingface/models/Qwen3.5-397B-A17B-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 202752  \
    --tensor-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'  --trust-remote-code  --host 127.0.0.1  --port 8000

后台nohup部署

1
2
3
4
5
6
7
8
nohup vllm serve \
    /usr1/huggingface/models/Qwen3.5-397B-A17B-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 202752  \
    --tensor-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'  --trust-remote-code  --host 127.0.0.1  --port 8000 > vllm_qwen3.5.log 2>&1 &

4. Docker 容器化部署

4.1 Open-WebUI 部署

1
2
3
4
5
6
7
8
9
10
11
# 启动容器
sudo docker run -d -p 0.0.0.0:3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  -e OPENAI_API_BASE_URL=http://10.44.151.54:8000/v1 \
  ghcr.io/open-webui/open-webui:main

# 停止和清理容器
docker stop open-webui
docker rm open-webui

4.2 Lobe-Chat 部署

1
2
3
4
5
6
7
8
9
10
11
12
sudo docker run -d -p 0.0.0.0:3210:3210 \
  --name lobe-chat \
  --restart always \
  -e OPENAI_PROXY_URL=http://10.44.151.54:8111/v1 \
  -e OPENAI_MODEL_LIST="-all,+GLM-4.6-AWQ" \
  -e DEFAULT_AGENT_CONFIG="model=GLM-4.6-AWQ" \
  lobehub/lobe-chat:latest


  # 停止和清理容器
sudo docker stop lobe-chat
sudo docker rm lobe-chat
本文由作者按照 CC BY 4.0 进行授权