Q4_K Quant of Deepseek-R1 for the MLA fork pull request

Requires this custom build of llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/11446

** IMPORTANT NOTE **

If you try to load this with the main branch of llama.cpp you'll see an error like this:

load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 1147, got 1025
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/mount/checkpoints/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf'
srv    load_model: failed to load model, '/mount/checkpoints/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
terminate called without an active exception
Aborted (core dumped)

There's a Q3_K_M version here: daydream-org/DeepSeek-R1-GGUF-11446

Created using the script below by evshiron:

export WORK_DIR=$(pwd)
python3 -m venv venv
source venv/bin/activate
pip3 install -U "huggingface_hub[cli]"

# the fp8 checkpoints are around 700GB
mkdir checkpoints
huggingface-cli download --resume-download --local-dir checkpoints/DeepSeek-R1 deepseek-ai/DeepSeek-R1

# my fork of llama.cpp including pr #11446 and some changes to allow converting fp8 hf to bf16 gguf directly using triton(-cpu) without the need of intermediate checkpoints
git clone https://github.com/evshiron/llama.cpp --recursive
pushd llama.cpp
pip3 install -r requirements/requirements-convert_hf_to_gguf.txt
cmake -B build
cmake --build build --config Release
popd

# install triton-cpu for cpu-only dequant
git clone https://github.com/triton-lang/triton-cpu --recursive
pushd triton-cpu
pip3 install ninja cmake wheel pybind11
MAX_JOBS=32 pip3 install -e python
popd

# hopefully it should work, takes an hour or more depending on your hardware, the bf16 checkpoints are around 1.3TB
# the dequant process may take more than 64GB RAM, but should be doable within 360GB RAM
python3 llama.cpp/convert_hf_to_gguf.py --outtype bf16 --split-max-size 50G checkpoints/DeepSeek-R1

# removing the fp8 checkpoints gives us 700GB back
mkdir checkpoints/DeepSeek-R1-BF16
mv checkpoints/DeepSeek-R1/*.gguf checkpoints/DeepSeek-R1-BF16
rm -r checkpoints/DeepSeek-R1

# then use llama-quantize to make the quants you want, Q4_K_M should be around 400GB?
./llama.cpp/build/bin/llama-quantize --keep-split checkpoints/DeepSeek-R1-BF16/<THE_FIRST_OF_DeepSeek-R1-BF16_GGUF>.gguf Q4_K_M

It took 16 hours on an EC2 instance so I figured I'd share it.

Script Credit/Source: daydream-org/DeepSeek-R1-GGUF-11446

Downloads last month
19
GGUF
Model size
672B params
Architecture
deepseek2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gghfez/DeepSeek-R1-11446-Q4_K

Quantized
(56)
this model