AssertionError when hosting via vLLM in H20x8

#23

by O-delicious - opened 9 days ago

9 days ago

while running container```
image: vllm/vllm-openai:nightly-e5e9067e61600eedd4e75bd1c512ec52872916aa
container_name: kimi-k2-thinking
restart: unless-stopped
network_mode: host
ipc: host
privileged: true
shm_size: 256g
entrypoint: ["vllm",
"serve", "/data00/Kimi-K2-Thinking",
"--gpu-memory-utilization", "0.9",
"--tensor-parallel-size", "8",
"--decode-context-parallel-size", "8",
# For compatiblity test
"--served-model-name", "Kimi-K2-Thinking",
"--enable-auto-tool-choice",
"--tool-call-parser", "kimi_k2",
#"--reasoning-parser", "kimi_k2",
"--trust-remote-code",
"--max-num-seq", "16",
"--max-num-batched-tokens", "32768",
#"--kv-cache-dtype", "fp8",

  "--swap-space", "64",
  "--disable-custom-all-reduce"
]
environment:
    - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    #- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    #- OMP_NUM_THREADS=90
    #- MKL_NUM_THREADS=90
    #- TORCH_NUM_THREADS=90
security_opt:
  - seccomp:unconfined
stdin_open: true
tty: true
cap_add:
  - ALL
ulimits:
  memlock: -1
  stack: 67108864
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]
volumes:
  - /data00/:/data00/
  - /var/run/nvidia-topologyd/:/var/run/nvidia-topologyd/

I got error

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] return self._call_impl(*args, **kwargs)
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1748, in _context_parallel_compute_prefill_context
(Worker_TP5 pid=409) ERROR 11-10 09:28:06 [multiproc_executor.py:718] AssertionError
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] kv_c_normed, k_pe = reorg_kvcache(
(Worker_TP5 pid=409) ERROR 11-10 09:28:06 [multiproc_executor.py:718]
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] return forward_call(*args, **kwargs)
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1062, in reorg_kvcache
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] assert reorganized_kv_c_normed.shape[0] == sum_seq_len
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File ".2", line 5, in forward
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q, x_11, key_rot_1, output_1, 'model.layers.0.self_attn.attn'); q = x_11 = key_rot_1 = output_1 = unified_mla_attention_with_output = None
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718] AssertionError
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=411) ERROR 11-10 09:28:06 [multiproc_executor.py:718]
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] return self._op(*args, **kwargs)
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 1056, in unified_mla_attention_with_output
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] self.impl.forward(
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1947, in forward
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] output[num_decode_tokens:] = self._forward_prefill(
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1826, in _forward_prefill
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] self._context_parallel_compute_prefill_context(
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1748, in _context_parallel_compute_prefill_context
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] kv_c_normed, k_pe = reorg_kvcache(
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1062, in reorg_kvcache
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] assert reorganized_kv_c_normed.shape[0] == sum_seq_len
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718] AssertionError
(Worker_TP6 pid=410) ERROR 11-10 09:28:06 [multiproc_executor.py:718]
(EngineCore_DP0 pid=268) ERROR 11-10 09:28:06 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc6.dev226+ge5e9067e6) with config: model='/data00/Kimi-K2-Thinking', speculative_config=None, tokenizer='/data00/Kimi-K2-Thinking', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Kimi-K2-Thinking, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': True, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'full_cuda_graph': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 32, 'local_cache_dir': None},
(EngineCore_DP0 pid=268) ERROR 11-10 09:28:06 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[],


I dig a little bit on the source code and found this assertion in reorg_kvcache in only enabled when dcp_word_size > 1
Here is the issue https://github.com/vllm-project/vllm/issues/28411

What is the suggested and verified images/versions of serving kimi-k2-thinking via vLLM or sglang?

youkaichao

Moonshot AI org 9 days ago

what if you remove the dcp and use plain tp=8?

O-delicious

8 days ago

•

edited 8 days ago

what if you remove the dcp and use plain tp=8?

@youkaichao That would not be an option if running under single node of H20. The model will take around 70Gb mem and since my situation requires the maximum model len, I got GPU OOM when no dcp configured.

I will try to patch this https://github.com/vllm-project/vllm/pull/28526 and bring back the result.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment