FP8-Quantized Seed-OSS Release for vLLM

#1
by DDG-64 - opened

Hi @mratsim
thanks for releasing Seed-OSS-36B-Instruct-NVFP4.
I’m running it on blackwell device with vLLM and it works well.
Could you also publish an FP8-quantized variant.
Thanks again!

Owner

I'm trying to quantize the KV-cache as well to FP8 so that the model + quantized KV-cache fit in a RTX Pro 6000 (unsure if it will work FP8 weight/activation + BF16 KV-cache requires 35.35GiB + 128GiB) but currently stuck due to a LLM compressor bug here: https://github.com/vllm-project/llm-compressor/issues/1881.

So I might just release a FP8 quants without KV-cache, which is too bad given that it's the model where KV-cache quantization is the most useful.

Sign up or log in to comment