Recently, there has been a lot of buzz around a seemingly simple question that even state-of-the-art large language models (LLM) fail to answer correctly: "Which is bigger? 9.9 or 9.11"
Despite various attempts and variations of prompting techniques, most frontier models still struggle to make an accurate comparison of the two numbers. This highlights a broader issue many of today's models encounter: they have limited mathematical reasoning capabilities[1]. While there are multiple conjectures of why this is the case, including the composition of pretraining data and the model architecture itself[2], we investigate one of the most fundamental processes in LLMs,tokenization, and how it affects a model's ability to do math, specifically arithmetic problems.
In this blog post, we discuss:
- Our detailed approach in comparing different methods of number tokenization
- Why reading from right to left is sometimes better than from left to right
- A clear frontrunner of tokenization methods for arithmetic in LLMs
A Brief History of Number Tokenization
Back in 2019, The GPT2 paper detailed its use of BPE (byte-pair encoding) as a tokenization method for language models [3]. This approach works by merging frequently occurring subwords into single units until the vocabulary reaches a target size.
Because of how this algorithm operates, the resulting vocabulary depends heavily on the training data fed into the tokenizer. This led to inconsistencies in how numbers are encoded [4]. Commonly seen numbers (i.e. 1-100, years like 1945, etc.) in the training data will likely be represented as a single token, while less frequently seen numbers are split into multiple tokens like below:
Four years later, the herd of llamas began their stampede! Llama and Llama 2 used SentencePiece's BPE implementation with a notable tweak for numbers: they split all numbers into individual digits [5][6]. This meant there were only 10 unique tokens to represent any number, simplifying numerical representation for LLMs. Deepseek released a model much later (DeepSeek-V2) with a similar single-digit tokenizer [7].
Later on, Llama 3 took a different approach for handling numbers, tokenizing them in groups of three digits [8]. As a result, numbers from 1 to 999 each have unique tokens, while numbers from 1000 onward are composed of these tokens.
A New Paradigm: Right-to-Left Tokenization
So far, the tokenization methods we've seen "processed" text from left to right. For instance, if the three-digit tokenizer encounters the sequence "12345," it will scan from the beginning, breaking it down into segments like "123" and "45".
Right-to-left (R2L) tokenization, on the other hand, processes text from the end to the beginning in groups of three. Using R2L, the sequence "12345" would be tokenized by scanning from the right, first splitting off "345" and then moving to "12." Recently, there has been some exploration too of forcing this R2L tokenization behaviour in frontier closed-source models, which has shown to benefit certain arithmetic operations since the R2L representation prevents the misalignment of the operands [9]. It has also been rumored that Claude uses this R2L tokenization method [10].
To better understand what misalignment looks like, let's take 3789 + 8791 as an example:
Three-digit L2R Tokenization
Three-digit R2L Tokenization
In the three-digit L2R example, 9 + 1 should map to the digit 0 but ends up grouped together with 8 to form 80, since the first three tokens (125) were already grouped together. This 'shift' in the tokenization boundary produces additional complexity in the learning process which has been shown to be detrimental to accuracy.
In the three-digit R2L example, each digit of 580 aligns neatly with its corresponding sub-operands 789 and 791, which is a more intuitive grouping for the model to learn.
This insight suggests that three-digit R2L tokenization could potentially be improved over the standard three-digit L2R tokenization used by Llama 3.
To recap, here's an overview of the techniques used to handle number tokenization:
How numbers are tokenized | tokenizer (model) |
---|---|
pure BPE; no special handling | gpt2 |
split to single digits | llama, llama2, deepseek |
1-999 has unique tokens | llama3 |
split to groups of three digits (R2L) | Claude (?) |
Creating a fair comparison of different methods
The goal of this investigation is to compare these tokenizers and their different ways of processing numbers in a way that minimizes the influence of external factors such as model architecture, training configurations, and pre-training data in evaluation results.
Thus, one important design decision we made to address this goal was to evaluate models trained from scratch, where each model has the same data mixture, training configs, and a roughly equal compute budget (number of model parameters and training tokens). The only meaningful difference that each model should have with one another is the tokenizer used to tokenize the training data.
Experimental Setup
We picked 3 tokenizers mentioned previously, namely GPT2's BPE tokenizer, Llama 3's three-digit tokenizer, and Deepseek's single-digit tokenizer.
To test right-to-left tokenization, we created R2L versions of the Pure-BPE and three-digit tokenizers, where numbers would be chunked into groups of 3 digits from the right before being tokenized. We didn't create a R2L version for single-digit tokenization since it would produce the same result since numbers are tokenized to individual digits 1. To achieve this, we added an extra preprocessing step which forces the R2L behaviour without producing additional tokens during inference:
from transformers import AutoTokenizer
+From Digits to Decisions 
From Digits to Decisions: How Tokenization Impacts Arithmetic in LLMs
Hugging Face Recently, there has been a lot of buzz around a seemingly simple question that even state-of-the-art large language models (LLM) fail to answer correctly: "Which is bigger? 9.9 or 9.11"
Despite various attempts and variations of prompting techniques, most frontier models still struggle to make an accurate comparison of the two numbers. This highlights a broader issue many of today's models encounter: they have limited mathematical reasoning capabilities[1]. While there are multiple conjectures of why this is the case, including the composition of pretraining data and the model architecture itself[2], we investigate one of the most fundamental processes in LLMs, tokenization, and how it affects a model's ability to do math, specifically arithmetic problems.
In this blog post, we discuss:
- Our detailed approach in comparing different methods of number tokenization
- Why reading from right to left is sometimes better than from left to right
- A clear frontrunner of tokenization methods for arithmetic in LLMs
A Brief History of Number Tokenization
Back in 2019, The GPT2 paper detailed its use of BPE (byte-pair encoding) as a tokenization method for language models [3]. This approach works by merging frequently occurring subwords into single units until the vocabulary reaches a target size.
Because of how this algorithm operates, the resulting vocabulary depends heavily on the training data fed into the tokenizer. This led to inconsistencies in how numbers are encoded [4]. Commonly seen numbers (i.e. 1-100, years like 1945, etc.) in the training data will likely be represented as a single token, while less frequently seen numbers are split into multiple tokens like below:
BPE (GPT2) Tokenization Heatmap for Numbers 1-1000This number consists of 1 token2 tokens1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000Four years later, the herd of llamas began their stampede! Llama and Llama 2 used SentencePiece's BPE implementation with a notable tweak for numbers: they split all numbers into individual digits [5][6]. This meant there were only 10 unique tokens to represent any number, simplifying numerical representation for LLMs. Deepseek released a model much later (DeepSeek-V2) with a similar single-digit tokenizer [7].
Later on, Llama 3 took a different approach for handling numbers, tokenizing them in groups of three digits [8]. As a result, numbers from 1 to 999 each have unique tokens, while numbers from 1000 onward are composed of these tokens.
A New Paradigm: Right-to-Left Tokenization
So far, the tokenization methods we've seen "processed" text from left to right. For instance, if the three-digit tokenizer encounters the sequence "12345," it will scan from the beginning, breaking it down into segments like "123" and "45".
Right-to-left (R2L) tokenization, on the other hand, processes text from the end to the beginning in groups of three. Using R2L, the sequence "12345" would be tokenized by scanning from the right, first splitting off "345" and then moving to "12." Recently, there has been some exploration too of forcing this R2L tokenization behaviour in frontier closed-source models, which has shown to benefit certain arithmetic operations since the R2L representation prevents the misalignment of the operands [9]. It has also been rumored that Claude uses this R2L tokenization method [10].
To better understand what misalignment looks like, let's take 3789 + 8791 as an example:
Three-digit L2R Tokenization
3789 8791+12580Three-digit R2L Tokenization
3789 8791+12580In the three-digit L2R example, 9 + 1 should map to the digit 0 but ends up grouped together with 8 to form 80, since the first three tokens (125) were already grouped together. This 'shift' in the tokenization boundary produces additional complexity in the learning process which has been shown to be detrimental to accuracy.
In the three-digit R2L example, each digit of 580 aligns neatly with its corresponding sub-operands 789 and 791, which is a more intuitive grouping for the model to learn.
This insight suggests that three-digit R2L tokenization could potentially be improved over the standard three-digit L2R tokenization used by Llama 3.
To recap, here's an overview of the techniques used to handle number tokenization:
How numbers are tokenized tokenizer (model) pure BPE; no special handling gpt2 split to single digits llama, llama2, deepseek 1-999 has unique tokens llama3 split to groups of three digits (R2L) Claude (?)
Creating a fair comparison of different methods
The goal of this investigation is to compare these tokenizers and their different ways of processing numbers in a way that minimizes the influence of external factors such as model architecture, training configurations, and pre-training data in evaluation results.
Thus, one important design decision we made to address this goal was to evaluate models trained from scratch, where each model has the same data mixture, training configs, and a roughly equal compute budget (number of model parameters and training tokens). The only meaningful difference that each model should have with one another is the tokenizer used to tokenize the training data.
Experimental Setup
We picked 3 tokenizers mentioned previously, namely GPT2's BPE tokenizer, Llama 3's three-digit tokenizer, and Deepseek's single-digit tokenizer.
To test right-to-left tokenization, we created R2L versions of the Pure-BPE and three-digit tokenizers, where numbers would be chunked into groups of 3 digits from the right before being tokenized. We didn't create a R2L version for single-digit tokenization since it would produce the same result since numbers are tokenized to individual digits 1. To achieve this, we added an extra preprocessing step which forces the R2L behaviour without producing additional tokens during inference:
from transformers import AutoTokenizer
from tokenizers import pre_tokenizers, Regex
# Initialize all tokenizers
@@ -23,4 +23,4 @@ print(tokenizer.tokenize("42069")) # [42, 069]
title={From Digits to Decisions: How Tokenization Impacts Arithmetic in LLMs},
author={Garreth Lee, Guilherme Penedo, Leandro von Werra and Thomas Wolf},
url={https://huggingface.co/spaces/huggingface/number-tokenization-blog},
-}
\ No newline at end of file
+}