Great size choice ! AWQ would be greatly appreciated.

#1
by rekrek - opened

Hi, very interested in smaller Kimi models to run locally. Kimi-VL was a bit small but had great context. This should enable greater context with MLA+linear and be more intelligent !

Just hoping for a thinking model later 😁

This seems like a perfect fit.

Would you kindly provide AWQ quantization for vllm ? Quantitizing it with a larger sample size and longer sequence length than what can be done locally would ensure that the AWQ version performs to it's best without unneeded degradation.

The community releases GGUF conversions but rarely AWQ.

Thanks

P.S. typo in model card :
High Throughput: Achieves up to $6\times$ faster decoding and significantly reduces time per output token (TPOT).

EDIT: Using cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit

Nice model and attempt. But the size is a bit bizarre...

Even the quantized model is too big for a single consumer GPU(for example, 24GB).
For enterprise usage, this model is a bit small.

It will be good if there is a model around 20-36B, like gpt-oss-20b, qwen3-30b series.
And another model like gpt-oss-120b, GLM-Air(110B), Qwen3-80B.

@sunny2038 I think this is the right size for 2x24Gb (3090/L4/7900XTX), 32GB (Mi100/5090) or 48Gb (L40, RTX6000, A40) with quant q4 taking roughly 25GB leaving the rest for a very large context window thanks to the novel implementation of linear + MLA. Should also go plenty fast on CPU and perhaps make competition to Qwen3 30B-A3B and more sens than 80B-A3B ? This is great news for enthusiast consumers and small business that are considering granite 4 small for large context to run on budget hardware. Might also be a nice fit for AMD 395+ SoC with 128Gb ram.

The Kimi K2 model had a refresh that helped instruction following, I hope this model gets long-term support as well as a CODER variant with further pre-training on lots of code and specific instruction following for coding agents like aider/opencode/crush/qwen-code/gemini-cli generated with Kimi K2.

Also getting official unsloth/axolotl support/blog post on training this model for enhanced performance in specific tasks/domains and how well it will absorb the training dataset could be a huge boon for it. There are few finetunes of Qwen3 A3B compared to other models and MoE finetuning is not as prominent as with dense model, hope Moonshotai push this model as a training target by providing a few sample training configurations scripts and benchmarks. Targeting minimal configuration with QLora and full FT with 8xGPU ?

@rekrek Indeed it could be fast on CPU thanks for the linear, and undoubtly K2 is a good model.

But quantized Qwen3 30B-A3B could run on single 24GB GPU. 1x24GB GPU vs 2x24GB GPU is a big difference for enthusiast.

For AMD 395+ SoC with 128Gb ram, quantized gpt-oss-120b/GLM-4.6 Air could be a better choice.

I completely agree with the rest of your comments.

I guess they need to wait for quant libs to support this architecture.

Currently trying on a 3090 (using 19GB with current NUM_CALIBRATION_SAMPLES = 1024, MAX_SEQUENCE_LENGTH = 1024 ), should be run on a larger dataset with more samples and with longer sequences. Any one can run it on a B200 ?

A mixture of datasets from eaddario/imatrix-calibration or a sample of Kimi Linear's SFT would be best.
Not yet tested if AWQ works or even gets quantized correctly.

pip install torch transformers xformers torchvision accelerate wheel --extra-index-url https://download.pytorch.org/whl/cu128-ampare -U
pip install flash-attn --no-build-isolation
pip install llmcompressor fla-core  flash-linear-attention -U
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import yaml

from llmcompressor import oneshot
# from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.utils import dispatch_for_generation
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping

# Select model and load it.
model_id = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map=None,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Select calibration dataset.
# Could use more datasets for more diverse samples, covering more languages, coding, agentic...
DATASET_ID = "neuralmagic/LLM_compression_calibration"  # only 10000 samples
DATASET_SPLIT = "train"

# Bump this up to all samples and 16k lenght if it doesn't OOM
NUM_CALIBRATION_SAMPLES = 1024 #10000
MAX_SEQUENCE_LENGTH = 1024

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def enhanced_preprocess(example):
    """Enhanced preprocessing with system prompt"""
    # Check which column exists
    if "conversations" in example:
        messages_key = "conversations"
    elif "messages" in example:
        messages_key = "messages"
    else:
        raise ValueError("Dataset must contain either 'conversations' or 'messages' column")

    converted_conversations = []
    for message in example[messages_key]:
        # Handle different message formats
        if "from" in message and "value" in message:
            # Original format: {"from": "human", "value": "..."}
            role = message["from"]
            if role == "human":
                role = "user"
            elif role == "gpt":
                role = "assistant"
            content = message["value"]
        elif "role" in message and "content" in message:
            # Standard chat format: {"role": "user", "content": "..."}
            role = message["role"]
            content = message["content"]
        else:
            continue

        # Create new message in standard format
        converted_conversations.append({
            "role": role,
            "content": content
        })

    # Add system prompt if missing
    if not any(msg["role"] == "system" for msg in converted_conversations):
        converted_conversations.insert(0, {
            "role": "system",
            "content": "You are a helpful AI assistant."
        })

    return {
        "text": tokenizer.apply_chat_template(
            converted_conversations,
            tokenize=False,
        )
    }

ds = ds.map(enhanced_preprocess)

# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = [
    AWQModifier(
        ignore=["norm", "layernorm", "ln", "embed", "bias", "A_log", "dt_bias", "conv1d", "e_score_correction"],
        targets=["Linear"],
        scheme="W4A16" #  EDITED !
    )
]

# Apply algorithms.
oneshot(
    model=model,
    tokenizer=tokenizer,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
)

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

https://huggingface.co/cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit

This does work for me on two rtx 3090 with 512k context.
However I have some bugs in the output when testing with code generation, specifically, it sometimes ends mid-code and sometimes emits a single random token across the whole code (seems to be appended to the end of the generated sequence, but only sometimes)
And only tensor parallel works, if someone would get pipelines parallel working too I guess the speed could benefit a bit

https://huggingface.co/cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit

This does work for me on two rtx 3090 with 512k context.
However I have some bugs in the output when testing with code generation, specifically, it sometimes ends mid-code and sometimes emits a single random token across the whole code (seems to be appended to the end of the generated sequence, but only sometimes)
And only tensor parallel works, if someone would get pipelines parallel working too I guess the speed could benefit a bit

Can you paste your vllm command and version of vllm you used to get it to work. I can't seem to get it running with nightly vllm and my 4x3090 machine.

https://huggingface.co/cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit

This does work for me on two rtx 3090 with 512k context.
However I have some bugs in the output when testing with code generation, specifically, it sometimes ends mid-code and sometimes emits a single random token across the whole code (seems to be appended to the end of the generated sequence, but only sometimes)
And only tensor parallel works, if someone would get pipelines parallel working too I guess the speed could benefit a bit

Can you paste your vllm command and version of vllm you used to get it to work. I can't seem to get it running with nightly vllm and my 4x3090 machine.

Can try to reproduce my steps but I went through quite a bit of a hassle haha

Also notice that the owner of that repo changed the readme few hours ago so this information is most likely outdated (and the main VLLM branch should work but I cannot test this right now)

  • cloned vllm and changed to this specific branch (from the previous readme)
    git fetch origin pull/27834/head:pr-27834
    git checkout pr-27834

  • this version has some conflicts with xformers, so I deleted the xformers requirement in vllm's cloned repo ( requirements/cuda.txt and maybe pyproject.toml )
    and installed xformers manually from its official github

  • then launched offline inference with tensor-parallel 2 and gpu-mem-util 0.8 and max-num-seqs 8 which pretty much worked out of the box

Note that it might be beneficial to give VLLM 1-2 days and trying the main repo again as this version is just not quite there in terms of speed and functionality.
Will try the newest nightly vllm myself tomorrow

Attempted to run on Modal last night (so early hours of November 1) with the github main build of official vLLM (which wasn't easy to build because of a weirdly specific xformers dependency, but I got through it).

Failed with this:

File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1160, in causal_conv1d_update
    assert num_cache_lines >= batch
AssertionError

A slight bit more detail here https://www.reddit.com/r/kimimania/comments/1om3bkn/trying_to_evaluate_kimi_linear_48b_on_modal/

Closing this as cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit works.

Needs current vllm built from github.
Less KV cache than expected ?

With 2x3090 :

LLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0,1 vllm serve cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit  --port 8111  --api-key local \
                             --gpu-memory-utilization 0.95 --max_model_len 74000 --tensor_parallel 2 --enable_prefix_caching --max_num_batched_tokens 74000 --max-num-seqs 8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2 --served-model-name Kimi-Linear-48B-A3B-Instruct  --dtype float16
(Worker_TP0 pid=1064865) INFO 11-02 19:51:56 [gpu_worker.py:348] Available KV cache memory: 1.37 GiB
(EngineCore_DP0 pid=1064801) WARNING 11-02 19:51:56 [kv_cache_utils.py:979] Add 1 padding layers, may waste at most 5.00% KV cache memory
(EngineCore_DP0 pid=1064801) INFO 11-02 19:51:56 [kv_cache_utils.py:1229] GPU KV cache size: 45,120 tokens
(EngineCore_DP0 pid=1064801) INFO 11-02 19:51:56 [kv_cache_utils.py:1234] Maximum concurrency for 74,000 tokens per request: 2.33x
(EngineCore_DP0 pid=1064801) INFO 11-02 19:51:56 [kv_cache_utils.py:1234] Maximum concurrency for 74,000 tokens per request: 2.33x

Toolcall needs fixing...

rekrek changed discussion status to closed

Sign up or log in to comment