DAT Byte Small (200M)

DAT Byte is a family of byte-level Differential-Attention Transformers, trained from scratch on an RTX 5090.
This model is the smallest in the family, with approximately 200 million parameters. It was trained on a set of Discord chat data, public domain books, and English Bible translations. Larger models in the family received a larger and more diverse training set.

Training Data

As the smallest DAT Byte model, this version was trained on less data than its larger family members. The training data was composed exclusively of the following sources:

Gutenberg English — English books in the public domain (about 20GB total)
OpenDiscord — Discord dumps in ChatML format
Proprietary Discord dumps (similar structure and tone to OpenDiscord)
A diverse set of public domain English Bible translations (~34MB total)

Only the datasets listed above were used. Data was shuffled with Gutenberg English being used only around 35% of the time (average estimate).

The Discord datasets (combined ~693MB) were formatted in ChatML, with usernames serving as speaker roles, enabling the model to learn natural dialogue structure and dynamics. Discord data included many diverse topics, especially code. Thus, the model understands basic syntax patterns of some common programming languages. However, due to its lack of training on large scale high quality code samples, generated code will likely not be reliable or production-quality. Larger models in the family received a larger and more diverse training set.

Architecture

This model follows the structure proposed in Differential Transformer (Ye et al., 2024), which introduces Differential Attention. Differential Attention is particularly helpful in creating byte-level LLMs as it reduces attention noise and allows the model to better grasp semantic meaning at such a high granularity.

Key architectural details:

Model Type: Decoder-only Transformer
Positional Encoding: RoPE (Rotary Positional Embeddings)
Normalization: Pre-layernorm (LayerNorm before attention and MLP blocks)
Hidden Size: 768
FFN Size: 3,072
Attention Heads: 12
Layers: 28
Vocabulary Size: 259 (256 byte tokens + 3 special tokens)

Training

DAT-Byte Small was trained for a total of 31,200 steps with a max sequence length of 2048 and a min sequence length of 512. Batch size varried based on sequence length but was usually around 128-256 (Gradient Accumulation was used). The model saw approximately 5-10 Billion tokens during training (sorry, I didn't keep track so the best you'll get is an estimate). The learning rate averaged probably around 5e-5. Even though it was only trained up to 2048 token sequence length, in my testing it seems to extrapolate well above that. RoPE helps.

Benchmarks

Coming soon.

Usage

To run DAT-Byte Small (200M), follow these steps.

git clone https://huggingface.co/hudsongouge/DAT-Byte-Small
python3 run.py -c

That's it! You're now in an interactive chat with the model!

Citation

If you use DAT Byte Small in your research, fine-tune it, or build on this work, please cite the original author:

BibTeX entry

@misc{gouge2025datbyte,
  title        = {DAT Byte: Byte-Level Differential Attention Transformers},
  author       = {Hudson Gouge},
  year         = {2025},
  url          = {https://huggingface.co/hudsongouge/dat-byte-small},
  note         = {DAT Byte Small (200M) Model Card}
}

Please include this citation in any derivative work, publication, or project that makes use of the DAT Byte architecture or training artifacts.

hudsongouge
/

DAT-Byte-Small