High schooler by day, LLM builder by night. Driven by a deep love for both Physics and AI. Currently spending my runtime building on Hugging Face, experimenting with transformer architectures, and training custom LLMs.
Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision.
Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2.
The important part: this is not just post-training KV cache quantization. Instead we take the BitNet approach.
KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime.
During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte.
WikiText-2 eval vs 16-bit KV cache reference:
Mean KLD: 0.0916 nats/token Mean KLD: 0.1322 bits/token Average KV cache shrink vs FP16: 5.33x Evaluated positions: 372,675
If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine! Try it here: BananaMind/BananaMind-KV1-8M-2Bit-Experimental
Created research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth โ the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state.
Also the original SpikeWhale project โ the one that sparked all the other SpikeWhale related projects. Every spiking primitive here is hand-written in plain PyTorch: the leaky integrate-and-fire (LIF) neuron dynamics, the fast-sigmoid surrogate gradient, and the backprop-through-time training loop. No snntorch, no spikingjelly, no norse, no bindsnet โ the network is a genuine from-scratch SNN.
A new model is coming! Its going to take a long time on my 5070 Ti so expect a release in ~1 month. We think this model is going to be SOTA For its size. Our Mini Version will be 25M Parameters and Pro with 140M. The Pro version has a 3072 Context Window (Extensible to up to 6K with RoPE) And the Mini version has a context window of 4096 (Up to 8K with RoPE) Meanwhile we are currently working on a Instruct Version of our BananaMind 1.5 Base.
OLD POST CONTENTS EDITED: We currently have a new AI Model and we are currently training it. We are training it on 27B tokens and are currently 8% done. Follow us to be notified when it releases ๐ Some Info: Parameters 75M GPU: RTX Pro 6000 We expect to be able to release it in the coming dayshttps://huggingface.co/BananaMind/BananaMind-1.5-Base