Are the F16 weights upcasted MXFP4? -- Why no `gpt-oss-20b-MXFP4.gguf`?

#34

by rtzurtz - opened 16 days ago

16 days ago

Follow up question to https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/14 and https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/7:

Are the F16 weights maybe just upcasted MXFP4 ones, or why is bartowski recommending to use gpt-oss-20b-MXFP4.gguf (12.1 GB):

Use this one:
gpt-oss-20b-MXFP4.gguf
The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything.

, everyone in https://github.com/ggml-org/llama.cpp/discussions/15396 is also testing only the gpt-oss-20b-MXFP4.gguf and just another example, lmstudio-community also only has the gpt-oss-20b-MXFP4.gguf?

qikchen

15 days ago

Yes, I think only using MXFP4.gguf is the way to go with gpt oss. Unsloth GGUFs aren't applicable AFAIK in this model.
I think they have anyway made all their GGUFs for a completion's sake. And perhaps the quantization under Q4 also has value for people without enough vram. But if you can use Q4, it only makes sense to use the standard *-MXFP4.ggufs.

danielhanchen

Unsloth AI org 3 days ago

The other MXFP4 GGUFs are actually quantized down to 8bit so it's not true 100% full precision. The f16 versions retain the model's full original precision. The difference shouldn't be much but regardless, there is a difference between them.

Our MXFP4 versions (like the others) are actually the Q8 ones. While the true full precision is f16.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment