DeBERTa Tokenizer (Extended for Dhivehi)

This repository contains a custom extension of the microsoft/deberta-v3-base tokenizer, enhanced with 10,000 frequent Dhivehi tokens. It significantly improves tokenization coverage and accuracy for Dhivehi while preserving English behavior.

Overview

Base tokenizer: microsoft/deberta-v3-base (vocab size: 128001)
New Dhivehi tokens added: 15,000
Source corpus: 58K+ Dhivehi sentences, 424,319 unique tokens (≥2 chars)

Tokenization Comparison

English (No change expected)

Input:
The quick brown fox jumps over the lazy dog.

Tokens (STOCK & CUSTOM):

['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁lazy', '▁dog', '.']

✔️ Token IDs identical — English tokenization is preserved.

Dhivehi (Improved coverage)

Input:
އީދުގެ ހަރަކާތް ފެށުމަށް މިރޭ ހުޅުމާލޭގައި އީދު މަޅި ރޯކުރަނީ

STOCK Tokens (fragmented):

['▁', 'އ', 'ީދ', 'ު', 'ގ', 'ެ', '▁', 'ހ', 'ަ', 'ރ', 'ަ', ...]

CUSTOM Tokens (clean and meaningful):

['އީދުގެ', '▁', 'ހަރަކާތް', '▁', 'ފެށުމަށް', ...]

✔️ Long, language-meaningful tokens reduce fragmentation and UNKs.

Token ID Example

STOCK (fragmented):

Token IDs: [..., 3, 3, 3, ...]  # many unknowns

CUSTOM (extended):

Token IDs: [137561, 130775, 129048, ...]

Clean and consistent token IDs for Thaana tokens.

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alakxender/deberta-dhivehi-tokenizer-extended")
tokens = tokenizer.tokenize("އީދުގެ ހަރަކާތް")
print(tokens)

alakxender
/

deberta-dhivehi-tokenizer-extended