Qwen3 VL 30B-A3B vLLM recipes
helm recipes
The recipes are written for Helm (mixa3607/charts/vllm), but you can easily rewrite them for Docker.
A. QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ + vLLM 0.11.0
yaml setup
image:
registry: docker.io
repository: mixa3607/vllm-gfx906
tag: 0.11.2-rocm-6.3.3-nlzy-20260309165414
extraEnvVars:
- name: VLLM_SLEEP_WHEN_IDLE
value: "1"
- name: VLLM_USE_V1
value: "1"
- name: VLLM_USE_TRITON_AWQ
value: "1"
- name: VLLM_USE_TRITON_FLASH_ATTN
value: "True"
- name: RECIPE
value: "Setup A"
app-configuration:
vllm.yaml: |-
# basics
model: QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
served-model-name:
- Qwen3-VL-30B-A3B
- QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
async-scheduling: true
# gpus setup related
max-model-len: 64K
max-num-batched-tokens: 8192
max-num-seqs: 8
tensor-parallel-size: 2
data-parallel-size: 1
gpu-memory-utilization: 0.95
# multimodality
mm-encoder-tp-mode: data
limit-mm-per-prompt.image: 16
limit-mm-per-prompt.video: 1