A 0.6B parameter draft (speculative decoding) model for use with GLM-4.5, GLM-4.5-Air and GLM-4-32B-0414.

See GLM-4.5-DRAFT-0.6B-v3.0 for the models in transformers format, and a detailed explanation of how the model was created.

I've included the Q4_0 quants for 3 different context lengths:

NOTES:

The 14 heads of Qwen2.5-0.5B doesn't allow for any of the other 4-bit quants to be made (and experimentation has shown using more or less than 4-bits for speculative decoding is a waste of time anwyay).
Due to llama.cpp using "static-YaRN" the scaling factor remains constant regardless of input length! Only use the longer context versions when processing long contexts is required...