cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit · Error installing from PR branch

3 days ago

got this while trying to install vllm:

ERROR: Could not find a version that satisfies the requirement xformers==0.0.33+5d4b92a5.d20251029; platform_system == "Linux" and platform_machine == "x86_64" (from vllm) (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.0.9, 0.0.10, 0.0.11, 0.0.12, 0.0.13, 0.0.16rc424, 0.0.16rc425, 0.0.16, 0.0.20, 0.0.21, 0.0.22, 0.0.22.post7, 0.0.23, 0.0.23.post1, 0.0.24, 0.0.25, 0.0.25.post1, 0.0.26.post1, 0.0.27, 0.0.27.post1, 0.0.27.post2, 0.0.28, 0.0.28.post1, 0.0.28.post2, 0.0.28.post3, 0.0.29, 0.0.29.post1, 0.0.29.post2, 0.0.29.post3, 0.0.30, 0.0.31, 0.0.31.post1, 0.0.32.post1, 0.0.32.post2, 0.0.33.dev1089, 0.0.33.dev1090)
ERROR: No matching distribution found for xformers==0.0.33+5d4b92a5.d20251029; platform_system == "Linux" and platform_machine == "x86_64"

same with uv:

(vllm.313) drros@tesla:~/vllm.313/vllm$ VLLM_USE_PRECOMPILED=1 uv pip install -e .
Using Python 3.13.6 environment at: /home/drros/vllm.313/.venv
  × No solution found when resolving dependencies:
  ╰─▶ Because there is no version of xformers{platform_machine == 'x86_64' and sys_platform == 'linux'}==0.0.33+5d4b92a5.d20251029 and vllm==0.11.1rc6.dev8+g09c9d32a5.precompiled depends on xformers{platform_machine == 'x86_64' and sys_platform == 'linux'}==0.0.33+5d4b92a5.d20251029, we can conclude that vllm==0.11.1rc6.dev8+g09c9d32a5.precompiled cannot be used.
      And because only vllm==0.11.1rc6.dev8+g09c9d32a5.precompiled is available and you require vllm, we can conclude that your requirements are unsatisfiable.

DrRos

3 days ago

Solved by commenting out xformers in requirements/cuda.txt and manually installing xformers==0.0.33.dev1090
There is some issues with model - tool calling does not worked for me (in vllm logs was an error - [serving_chat.py:256] RuntimeError: Kimi-K2 Tool parser could not locate tool call start/end tokens in the tokenizer! and model was running slower than one can expect for 3B active - i got 25-30 tps of TG - this is with dual A5000, model runs with -tp 2.

cpatonn

cyankiwi org 3 days ago

Thank you for trying the model. The problem also occurs in vllm main branch, i.e., here.

While this problem occurs, I would recommend the following as a fast fix:

# Install vllm without dependencies
VLLM_USE_PRECOMPILED=1 pip install --no-deps .
                       
# Install all other requirements except xformers
pip install -r requirements/common.txt
pip install numba==0.61.2
pip install "ray[cgraph]>=2.48.0"
pip install torch==2.9.0
pip install torchaudio==2.9.0
pip install torchvision==0.24.0
pip install flashinfer-python==0.4.1
                                    
# Install xformers WITHOUT its dependencies to prevent version changes
pip install --no-deps xformers==0.0.33.dev1090

ztsvvstz

3 days ago

Can confirm it works with two rtx3090 tp 2 with 30t/s
however
pipeline parallel (which I suppose would run faster) gives an error like "intermediate tensors is None"
And also the tokenization or generation is a bit weird, when generating code it sometimes stops in the middle and sometimes it seems to append the exact same token randomly repeating.

So inside html code it keeps appending "9px}" randomly throughout the code

cpatonn

cyankiwi org 3 days ago

@DrRos @ztsvvstz I really appreciate your feedback. In models with hybrid linear attention architecture, I keep attention layers at BF16 precisions for higher model accuracy. And thus, this might be the reason for slow inference speed.

I will keep this in mind in future model quantizations.

itsmebcc

3 days ago

cpatonn -- This is also your account? I guess I have 2 places to look for updates.

ztsvvstz

3 days ago

@DrRos @ztsvvstz I really appreciate your feedback. In models with hybrid linear attention architecture, I keep attention layers at BF16 precisions for higher model accuracy. And thus, this might be the reason for slow inference speed.

I will keep this in mind in future model quantizations.

Alright alright, I got some more info for ya :)
So with the newest vllm its now quite fast at ~70t/s buuuut....
It only outputs "!!!!!!!!"
occasionally some other random tokens in between but mostly "!!!!!!!!!!!!!!!"
Honestly if we get this working properly at this speed I'd be quite happy ;p
I used the same params as before

cpatonn

cyankiwi org 2 days ago

@itsmebcc Yes, thank you for using my quant so far! I am starting to migrate from my HF account to this org account :)

cpatonn

cyankiwi org 2 days ago

@ztsvvstz Thank you for the info. May I get the revision that you are on? Could you try the latest commit i.e., 30a14b034fa387470a512e8004527ad1c28af303?

ztsvvstz

2 days ago

•

edited 2 days ago

@ztsvvstz Thank you for the info. May I get the revision that you are on? Could you try the latest commit i.e., 30a14b034fa387470a512e8004527ad1c28af303?

Didnt keep track on which commit I was on sorry o:
But I can confirm that the newest vllm version does work without problems (so far).
Will do more speed / consistency tests later but seems like the bugs like pre-ending mid code generation do not happen anymore:)
I would be particulary interested in pipeline parallel (did not test this yet with the newest)
As in my experience, it allows for higher t/s than tensor parallel.
Currently Im at a throughput of ~1.5gb on the PCI lane for my two gpus and I suspect that to be quite a bottleneck (qwen3-next for example runs pretty fast at 110t/s with pipeline parallel on 3 gpus)
Thanks for your work, appreciate the fast responses

Pipeline parallel throws this error:

"gpu_model_runner.py", line 2007, in sync_and_slice_intermediate_tensors
assert self.intermediate_tensors is not None

itsmebcc

2 days ago

•

edited 2 days ago

I have it working pretty well.

I am running with this:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/owner/stuff/kimi-awq --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 121000 --trust-remote-code --max-num-seqs 1 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enable-expert-parallel

on:

python==3.11.14 (conda env vllm-kimi)
vllm==0.11.1rc6.dev8+g09c9d32a5 (editable build from PR 27834)
torch==2.9.0+cu128, torchvision==0.24.0+cu128, torchaudio==2.9.0+cu128
xformers==0.0.33+5d4b92a5.d20251029 (custom wheel built from commit 5d4b92a5)
fla-core==0.4.0
transformers==4.57.1, huggingface-hub==0.36.0, tokenizers==0.22.1, sentencepiece==0.2.1
flashinfer-python==0.4.1
numpy==2.2.6, scipy==1.16.3, ray==2.51.0, cuda-python==13.0.3with a 3090+4090

Pipeline parallel does not work currently with KimiLinearForCausalLM

ztsvvstz

2 days ago

I have it working pretty well.

I am running with this:
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/owner/stuff/kimi-awq --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 512000 --trust-remote-code --max-num-seqs 2 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enable-expert-parallel

on:

python==3.11.14 (conda env vllm-kimi)

vllm==0.11.1rc6.dev8+g09c9d32a5 (editable build from PR 27834)

torch==2.9.0+cu128, torchvision==0.24.0+cu128, torchaudio==2.9.0+cu128

xformers==0.0.33+5d4b92a5.d20251029 (custom wheel built from commit 5d4b92a5)

fla-core==0.4.0

transformers==4.57.1, huggingface-hub==0.36.0, tokenizers==0.22.1, sentencepiece==0.2.1

flashinfer-python==0.4.1

numpy==2.2.6, scipy==1.16.3, ray==2.51.0, cuda-python==13.0.3with a 3090+4090

Pipeline parallel does not work currently with KimiLinearForCausalLM

at what speeds? :)
Did some tests now and it seems the 30t/s I mentioned earlier only happen on pretty much no context (no chat template etc just checking if the model responds at all)
When applying chat template with a proper prompt the speed drops down to
Processed prompts: 100%|█████████████████████████████| 1/1 [00:04<00:00, 4.42s/it, est. speed input: 4452.57 toks/s, output: 1.81 toks/s]

itsmebcc

2 days ago

~40 tokens/s