Using the same approach as Tereshchenko Blue, now trained on the full Kobza corpus.

By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

How to possible

More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.

Replaced tokens

Writing system Tokens removed Tokens retained
Han (Chinese) 16,488 4,122
Devanagari (Hindi) 10,976 2,743
Bengali 7,983 1,995
Arabic 6,730 1,682
Hiragana / Katakana (Japanese) 3,944 985
Hangul (Korean) 3,744 935
Tamil 3,080 770
Thai 1,740 435
Malayalam 1,566 391
Telugu 1,428 356
Gujarati 1,080 270
Kannada 1,016 253
Ethiopic 691 172
Hebrew 670 167
Khmer 481 119
Sinhala 435 108
Myanmar 410 102
Lao 243 60
Gurmukhi 215 53
Tibetan 107 26
Oriya 100 25
Cyrillic 13,398 0
Gemma-3 <unused-*> 6,139 102

Feature Overview:

  1. +81,492 new Cyrillic BPE tokens trained on the full Kobza corpus plus the Cyrillic slice of the Crimean Tatar corpus.
  2. Just tokens from Replaced tokens table was replaced, no any tokens from other Writing system was affected.
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
  5. Reasoning tokens

Simple example

tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(len(toks.input_ids)) -only 4 tokens 💪🏻

"fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lapa-llm/tokenizer

Finetuned
(114)
this model

Datasets used to train lapa-llm/tokenizer

Collection including lapa-llm/tokenizer