🇷🇺 Русский...

Русско-английский BPE-токенизатор

Оптимизированный токенизатор для трехязычных текстов с расширенной поддержкой русской лексики и эффективной обработкой английского и токипона.

Ключевые характеристики

  • Формат: BPE (Byte-Pair Encoding)
  • Размер словаря: 12 288 токенов
  • Языки: Русский + Английский + Токипона (просто потому что могу и это ничего не стоит)
  • Специальные токены:
    <|endoftext|>
    <|padding|>
    <|mask|>
    <|user|>
    <|assistant|>
    <|system|>
    <|end|>
    <|en|>
    <|ru|>
    <|tok|>
    <|
    |>
🇬🇧 English...

Russian-English BPE tokenizer

Optimized tokenizer for trilingual texts with extended support for Russian vocabulary and efficient processing of English and Toki pona.

Key Features

  • Format: BPE (Byte-Pair Encoding)
  • Dictionary size: 12 288 tokens
  • Languages: Russian + English + Toki pona (just because I can and it costs nothing)
  • Special tokens: <|endoftext|>
    <|padding|>
    <|mask|>
    <|user|>
    <|assistant|>
    <|system|>
    <|end|>
    <|en|>
    <|ru|>
    <|tok|>
    <|
    |>

🧪 Tests...

English text (27741474 chars, 4613167 words)

Tokenizer Tokens Compression Vocab Size Vocab Used Vocab Usage % Avg Token Length Perfect Detokenization Tokenization Time (s) Detokenization Time (s) Max Length
deepseek-ai/DeepSeek-V3 5639822 1.22 128000 60979 47.6 4.9 1 17.8162 3.7699 131072
RefalMachine/RuadaptQwen3-32B-Instruct 5705024 1.24 146213 61580 42.1 4.9 1 17.6528 4.2012 131072
Gensyn/Qwen2.5-1.5B-Instruct 5708987 1.24 151643 60135 39.7 4.9 1 19.3785 3.9194 131072
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 5708988 1.24 151643 60136 39.7 4.9 1 18.9563 1.6886 16384
IlyaGusev/saiga_nemo_12b 5806480 1.26 131072 56865 43.4 4.8 1 18.4329 3.1752 1024000
openai-community/gpt2 5836927 1.27 50257 45466 90.5 4.8 1 16.6623 2.2766 1024
facebook/opt-125m 5836928 1.27 50265 45467 90.5 4.8 1 19.4051 3.7256 1E+030
Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it 5984540 1.3 129024 51435 39.9 4.6 1 14.5142 3.0903 16384
yandex/YandexGPT-5-Lite-8B-instruct 5984540 1.3 129024 51435 39.9 4.6 1 15.081 4.5032 1E+030
IlyaGusev/saiga_yandexgpt_8b 5984540 1.3 129024 51435 39.9 4.6 1 15.7957 3.6403 32768
loim/whiff-tokenizer-12k 6271746 1.36 12288 9611 78.2 4.4 1 41.6606 1.5217 65536
TinyLlama/TinyLlama-1.1B-Chat-v1.0 6655231 1.44 32000 24919 77.9 4.2 1 43.1161 5.5738 2048
ai-forever/ruGPT-3.5-13B 7154363 1.55 50257 12582 25.0 3.9 0 15.711 11.2961 2048
loim/whiff-tokenizer-8k 7369398 1.6 8192 7456 91.0 3.8 1 32.1512 1.6195 32768
ai-forever/rugpt3small_based_on_gpt2 7749641 1.68 50257 10938 21.8 3.6 0 16.4294 8.9582 2048

Russian text (16315296 chars, 2185925 words)

Tokenizer Tokens Compression Vocab Size Vocab Used Vocab Usage % Avg Token Length Perfect Detokenization Tokenization Time (s) Detokenization Time (s) Max Length
Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it 3475768 1.59 129024 67971 52.7 4.7 1 9.6723 1.4114 16384
IlyaGusev/saiga_yandexgpt_8b 3475768 1.59 129024 67971 52.7 4.7 1 10.1863 1.8007 32768
yandex/YandexGPT-5-Lite-8B-instruct 3475768 1.59 129024 67971 52.7 4.7 1 10.3878 4.8323 1E+030
ai-forever/ruGPT-3.5-13B 3693945 1.69 50257 43208 86.0 4.4 0 16.1615 3.9659 2048
RefalMachine/RuadaptQwen3-32B-Instruct 3732533 1.71 146213 52564 36.0 4.4 1 16.5792 2.4271 131072
ai-forever/rugpt3small_based_on_gpt2 3801887 1.74 50257 42820 85.2 4.3 0 17.1418 2.9581 2048
loim/whiff-tokenizer-12k 4070967 1.86 12288 9306 75.7 4.0 1 35.0603 1.3202 65536
deepseek-ai/DeepSeek-V3 4806676 2.2 128000 21621 16.9 3.4 1 15.8833 2.2505 131072
IlyaGusev/saiga_nemo_12b 4926095 2.25 131072 21901 16.7 3.3 1 15.2355 3.6558 1024000
Gensyn/Qwen2.5-1.5B-Instruct 5411283 2.48 151643 20458 13.5 3.0 1 14.6061 1.9548 131072
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 5411284 2.48 151643 20459 13.5 3.0 1 16.4851 1.5277 16384
TinyLlama/TinyLlama-1.1B-Chat-v1.0 5986567 2.74 32000 13454 42.0 2.7 1 20.6121 1.9489 2048
loim/whiff-tokenizer-8k 6090683 2.79 8192 5749 70.2 2.7 1 24.6047 1.4503 32768
openai-community/gpt2 16931837 7.75 50257 13818 27.5 1.0 1 19.4 6.16 1024
facebook/opt-125m 16931838 7.75 50265 13819 27.5 1.0 1 22.1165 4.2726 1E+030

Toki pona text (3663780 chars, 831463 words)

Tokenizer Tokens Compression Vocab Size Vocab Used Vocab Usage % Avg Token Length Perfect Detokenization Tokenization Time (s) Detokenization Time (s) Max Length
loim/whiff-tokenizer-12k 1144322 1.38 12288 2927 23.8 3.2 1 4.145 0.2371 65536
IlyaGusev/saiga_nemo_12b 1332599 1.6 131072 8428 6.4 2.7 1 2.7613 0.7956 1024000
deepseek-ai/DeepSeek-V3 1343359 1.62 128000 8870 6.9 2.7 1 2.6998 0.4471 131072
RefalMachine/RuadaptQwen3-32B-Instruct 1396348 1.68 146213 7546 5.2 2.6 1 2.3745 2.2573 131072
Gensyn/Qwen2.5-1.5B-Instruct 1393944 1.68 151643 7931 5.2 2.6 1 2.181 0.3505 131072
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1393945 1.68 151643 7932 5.2 2.6 1 2.6367 0.3489 16384
Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it 1481531 1.78 129024 7306 5.7 2.5 1 2.2853 1.3855 16384
yandex/YandexGPT-5-Lite-8B-instruct 1481531 1.78 129024 7306 5.7 2.5 1 2.359 1.2527 1E+030
IlyaGusev/saiga_yandexgpt_8b 1481531 1.78 129024 7306 5.7 2.5 1 2.5027 2.1723 32768
TinyLlama/TinyLlama-1.1B-Chat-v1.0 1536792 1.85 32000 6322 19.8 2.4 1 4.2253 0.6623 2048
openai-community/gpt2 1550846 1.87 50257 6680 13.3 2.4 1 2.7572 0.7449 1024
facebook/opt-125m 1550847 1.87 50265 6681 13.3 2.4 1 2.4144 0.6391 1E+030
ai-forever/ruGPT-3.5-13B 1828262 2.2 50257 3881 7.7 2.0 0 2.1597 0.7194 2048
ai-forever/rugpt3small_based_on_gpt2 1925501 2.32 50257 3697 7.4 1.9 0 1.9954 0.8262 2048
loim/whiff-tokenizer-8k 2123707 2.55 8192 2709 33.1 1.7 1 2.4541 0.3799 32768
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train loim/whiff-tokenizer-12k