[vLLM] Stuck at "Waiting for output from MQLLMEngine"

#72
by slemouellic - opened

Hello, :)

I'm trying to run vLLM in a Docker container using the model Mistral-Small-3.1-24B-Instruct-2503 with 2 GPUs, but the startup hangs indefinitely with this repeated message:

DEBUG [client.py:193] Waiting for output from MQLLMEngine.

Setup

  • GPUs: 2 (A100 80GB)
  • Docker image: vllm/vllm-openai:v0.8.4

Docker command

docker run -d \
  --name vllm_service \
  --gpus '"device=0,1"' \
  --shm-size=64g \
  --env-file .env \
  -p 4445:4445 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --network bridge \
  vllm/vllm-openai:v0.8.4 \
  --host 0.0.0.0 \
  --model "mistralai/Mistral-Small-3.1-24B-Instruct-2503" \
  --gpu-memory-utilization 0.95 \
  --dtype float16 \
  --port 4445 \
  --tensor-parallel-size 2 \
  --enforce-eager \
  --max-model-len 30800 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --limit_mm_per_prompt "image=10"

Environment variables

NVIDIA_VISIBLE_DEVICES=all
NCCL_MAX_SOCKET_BUFFERS=16384
PYTHONUNBUFFERED=1
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True
NCCL_DEBUG=INFO

Logs

NCCL initializes properly:

  • Ranks and GPUs are assigned
  • All channels and rings are connected
  • No errors in the logs

The process loops forever with:

DEBUG [client.py:193] Waiting for output from MQLLMEngine.

The server never becomes ready.

I’ve tried a bunch of different things (see environment variables above), but nothing seems to work...
Any idea what might be causing this or how to fix it?

Thanks in advance!

More logs :

INFO 04-22 02:26:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-22 02:26:30 [config.py:2832] Downcasting torch.float32 to torch.float16.
INFO 04-22 02:26:46 [config.py:689] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 04-22 02:26:48 [arg_utils.py:1731] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. 
INFO 04-22 02:26:49 [config.py:1713] Defaulting to use mp for distributed inference
WARNING 04-22 02:26:49 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-22 02:26:49 [api_server.py:246] Started engine process with PID 107
INFO 04-22 02:26:53 [__init__.py:239] Automatically detected platform cuda.
INFO 04-22 02:26:54 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='mistralai/Mistral-Small-3.1-24B-Instruct-2503', speculative_config=None, tokenizer='mistralai/Mistral-Small-3.1-24B-Instruct-2503', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=30800, download_dir=None, load_format=LoadFormat.MISTRAL, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=mistralai/Mistral-Small-3.1-24B-Instruct-2503, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
WARNING 04-22 02:26:56 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 30 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-22 02:26:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-22 02:26:57 [cuda.py:289] Using XFormers backend.
INFO 04-22 02:26:59 [__init__.py:239] Automatically detected platform cuda.
(VllmWorkerProcess pid=145) INFO 04-22 02:27:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=145) INFO 04-22 02:27:02 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=145) INFO 04-22 02:27:02 [cuda.py:289] Using XFormers backend.
INFO 04-22 02:27:03 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-22 02:27:03 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=145) INFO 04-22 02:27:03 [utils.py:993] Found nccl from library libnccl.so.2
fe249c782f4f:107:107 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
(VllmWorkerProcess pid=145) INFO 04-22 02:27:03 [pynccl.py:69] vLLM is using nccl==2.21.5
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: Using internal network plugin.
fe249c782f4f:107:107 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
fe249c782f4f:145:145 [1] NCCL INFO cudaDriverVersion 12040
fe249c782f4f:145:145 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: Using internal network plugin.
fe249c782f4f:107:107 [0] NCCL INFO NET/IB : No device found.
fe249c782f4f:107:107 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
fe249c782f4f:107:107 [0] NCCL INFO Using non-device net plugin version 0
fe249c782f4f:107:107 [0] NCCL INFO Using network Socket
fe249c782f4f:145:145 [1] NCCL INFO NET/IB : No device found.
fe249c782f4f:145:145 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
fe249c782f4f:145:145 [1] NCCL INFO Using non-device net plugin version 0
fe249c782f4f:145:145 [1] NCCL INFO Using network Socket
fe249c782f4f:145:145 [1] NCCL INFO ncclCommInitRank comm 0x135efc30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 80 commId 0x3f6e4e56e2d8d67a - Init START
fe249c782f4f:107:107 [0] NCCL INFO ncclCommInitRank comm 0x3f5d86f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 70 commId 0x3f6e4e56e2d8d67a - Init START
fe249c782f4f:107:107 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
fe249c782f4f:145:145 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
fe249c782f4f:145:145 [1] NCCL INFO comm 0x135efc30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
fe249c782f4f:107:107 [0] NCCL INFO comm 0x3f5d86f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
fe249c782f4f:107:107 [0] NCCL INFO Channel 00/02 :    0   1
fe249c782f4f:145:145 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
fe249c782f4f:107:107 [0] NCCL INFO Channel 01/02 :    0   1
fe249c782f4f:145:145 [1] NCCL INFO P2P Chunksize set to 131072
fe249c782f4f:107:107 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
fe249c782f4f:107:107 [0] NCCL INFO P2P Chunksize set to 131072
fe249c782f4f:145:145 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
fe249c782f4f:107:107 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
fe249c782f4f:145:145 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
fe249c782f4f:145:145 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
fe249c782f4f:107:107 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
fe249c782f4f:107:107 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
fe249c782f4f:145:145 [1] NCCL INFO Connected all rings
fe249c782f4f:107:107 [0] NCCL INFO Connected all rings
fe249c782f4f:145:145 [1] NCCL INFO Connected all trees
fe249c782f4f:107:107 [0] NCCL INFO Connected all trees
fe249c782f4f:145:145 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
fe249c782f4f:145:145 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
fe249c782f4f:107:107 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
fe249c782f4f:107:107 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
fe249c782f4f:107:107 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
fe249c782f4f:145:145 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
fe249c782f4f:107:107 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
fe249c782f4f:145:145 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
fe249c782f4f:145:145 [1] NCCL INFO ncclCommInitRank comm 0x135efc30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 80 commId 0x3f6e4e56e2d8d67a - Init COMPLETE
fe249c782f4f:107:107 [0] NCCL INFO ncclCommInitRank comm 0x3f5d86f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 70 commId 0x3f6e4e56e2d8d67a - Init COMPLETE
INFO 04-22 02:27:04 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-22 02:27:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=145) WARNING 04-22 02:27:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-22 02:27:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=145) WARNING 04-22 02:27:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-22 02:27:17 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_7c6fd185'), local_subscribe_addr='ipc:///tmp/dd3a6d06-f420-4cc7-8b9d-d6a3fd1115a3', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-22 02:27:17 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-22 02:27:17 [model_runner.py:1110] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [model_runner.py:1110] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
INFO 04-22 02:27:17 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
INFO 04-22 02:27:17 [weight_utils.py:265] Using model weights format ['consolidated*.safetensors', '*.pt']
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [weight_utils.py:265] Using model weights format ['consolidated*.safetensors', '*.pt']
DEBUG 04-22 02:27:20 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:30 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:40 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:51 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:01 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:11 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:21 [client.py:193] Waiting for output from MQLLMEngine.
...

same issue here

it looks like vllm is still downloading the model files. if you first download it via huggingface cli then it will not hang there

trust remote code?

The tip of Frostyseppo worked for me (thanks !)

Sign up or log in to comment