arxiv:2511.18890

Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

Published on Nov 24

· Submitted by

Yonggan Fu on Dec 1

NVIDIA

Upvote

Authors:

Abstract

The study identifies key architectural factors and efficient operators to optimize small language models for real-device latency, introducing the Nemotron-Flash family for improved accuracy and efficiency.

AI-generated summary

Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.

View arXiv page View PDF Add to collection

Community

YongganFu

Paper submitter 2 days ago

•

edited 2 days ago

👀 Small LMs often look lightweight, but on real GPUs they are not necessarily faster. At NVIDIA Research, we asked: What if small language models were designed around real-world latency rather than parameter count?

🚀 Releasing Nemotron-Flash
Nemotron-Flash is a hybrid SLM family designed around real-world latency and trained from scratch at 1B/3B sizes, achieving SOTA accuracy, latency, and throughput. It has been integrated into TRTLLM for production-grade inference with up to 41K tokens/second on a single H100 GPU.

📊 Model Performance

Nemotron-Flash-1B: +5.5% accuracy vs. Qwen3-0.6B, 1.9× lower latency, 45.6× higher throughput;
Nemotron-Flash-3B: +2–5.5% accuracy vs. Qwen2.5-3B/Qwen3-1.7B, 1.3–1.7× lower latency, 6.4–18.7× higher throughput.

💡 Why is Nemotron-Flash so fast?

Latency-optimal depth–width ratios: Deeper models are stronger at iso-parameter settings, while wider models are stronger at iso-latency. Nemotron-Flash identifies the sweet spot between the two.
Hybrid model architecture: A deliberate mix of Attention, Mamba2, DeltaNet, and FFN, discovered through evolutionary search.
More effective training via weight normalization: Weight normalization after each iteration reduces structured outlier weights and preserves larger effective learning rates for better convergence.

Paper Link: https://arxiv.org/pdf/2511.18890

HF models: