Model Card

High quality quantization of MiniMax-M2 without using imatrix.

Run

Currently llama.cpp does not return <think> token for this model. If you know how to fix that, please share in the "Community" section!

As a workaround, to inject the token in OpenWebUI, you can use the inject_think_token_filter.txt. You can add filters via Admin Panel -> Functions -> Filter -> + button on the right

llama.cpp - CPU experts offload

./build/bin/llama-server \
    --alias anikifoss/MiniMax-M2-HQ4_K \
    --model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
    --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.05 \
    --repeat-penalty 1.01 --repeat-last-n 64 \
    --ctx-size 95000 \
    -ctk q8_0 -ctv q8_0 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-4])\.ffn_.*_exps.*=CUDA0" \
    -ot "blk\.([5-9])\.ffn_.*_exps.*=CPU" \
    -ot "blk\.([1-6][0-9])\.ffn_.*_exps.*=CPU" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

llama.cpp - MI50 experts offload

./build/bin/llama-server \
    --alias anikifoss/MiniMax-M2-HQ4_K \
    --model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
    --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.05 \
    --repeat-penalty 1.01 --repeat-last-n 64 \
    --ctx-size 95000 \
    -ctk q8_0 -ctv q8_0 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(4[0-3])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(4[4-7])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(4[8-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(5[0-1])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(5[2-5])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(5[6-9])\.ffn_.*_exps.*=CUDA0" \
    -ot "blk\.(6[0-9])\.ffn_.*_exps.*=CUDA0" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

Quantization Recipe

Quantized with llama.cpp. See Custom Quants section in this detailed guide for all the quantization steps.

TARGET_MODEL="MiniMax-M2-HQ4_K"
mkdir -p ~/Env/models/anikifoss/$TARGET_MODEL
./build/bin/llama-quantize \
    --output-tensor-type Q8_0 \
    --token-embedding-type Q8_0 \
    --tensor-type attn_q=Q8_0 \
    --tensor-type attn_k=Q8_0 \
    --tensor-type attn_v=Q8_0 \
    --tensor-type ffn_down_exps=Q6_K \
    --tensor-type ffn_gate_exps=Q4_K \
    --tensor-type ffn_up_exps=Q4_K \
    /mnt/data/Models/MiniMaxAI/MiniMax-M2-GGUF/MiniMax-M2-256x4.9B-BF16-00001-of-00010.gguf \
    ~/Env/models/anikifoss/$TARGET_MODEL/$TARGET_MODEL.gguf \
    Q8_0 \
    32

Downloads last month: 421

GGUF

Model size

229B params

Architecture

minimax-m2

Hardware compatibility

Model tree for anikifoss/MiniMax-M2-HQ4_K

Base model

MiniMaxAI/MiniMax-M2

Quantized

(24)

this model