FP8-Quantized Seed-OSS Release for vLLM
#1
by
DDG-64
- opened
I'm trying to quantize the KV-cache as well to FP8 so that the model + quantized KV-cache fit in a RTX Pro 6000 (unsure if it will work FP8 weight/activation + BF16 KV-cache requires 35.35GiB + 128GiB) but currently stuck due to a LLM compressor bug here: https://github.com/vllm-project/llm-compressor/issues/1881.
So I might just release a FP8 quants without KV-cache, which is too bad given that it's the model where KV-cache quantization is the most useful.