Are the F16 weights upcasted MXFP4? -- Why no `gpt-oss-20b-MXFP4.gguf`?

#34
by rtzurtz - opened

Follow up question to https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/14 and https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/7:

Are the F16 weights maybe just upcasted MXFP4 ones, or why is bartowski recommending to use gpt-oss-20b-MXFP4.gguf (12.1 GB):

Use this one:
gpt-oss-20b-MXFP4.gguf
The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything.

, everyone in https://github.com/ggml-org/llama.cpp/discussions/15396 is also testing only the gpt-oss-20b-MXFP4.gguf and just another example, lmstudio-community also only has the gpt-oss-20b-MXFP4.gguf?

Yes, I think only using MXFP4.gguf is the way to go with gpt oss. Unsloth GGUFs aren't applicable AFAIK in this model.
I think they have anyway made all their GGUFs for a completion's sake. And perhaps the quantization under Q4 also has value for people without enough vram. But if you can use Q4, it only makes sense to use the standard *-MXFP4.ggufs.

Unsloth AI org

The other MXFP4 GGUFs are actually quantized down to 8bit so it's not true 100% full precision. The f16 versions retain the model's full original precision. The difference shouldn't be much but regardless, there is a difference between them.

Our MXFP4 versions (like the others) are actually the Q8 ones. While the true full precision is f16.

Sign up or log in to comment