DeBERTa Tokenizer (Extended for Dhivehi)
This repository contains a custom extension of the microsoft/deberta-v3-base tokenizer, enhanced with 10,000 frequent Dhivehi tokens. It significantly improves tokenization coverage and accuracy for Dhivehi while preserving English behavior.
Overview
- Base tokenizer:
microsoft/deberta-v3-base
(vocab size: 128001) - New Dhivehi tokens added: 15,000
- Source corpus: 58K+ Dhivehi sentences, 424,319 unique tokens (≥2 chars)
Tokenization Comparison
English (No change expected)
Input:The quick brown fox jumps over the lazy dog.
Tokens (STOCK & CUSTOM):
['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁lazy', '▁dog', '.']
✔️ Token IDs identical — English tokenization is preserved.
Dhivehi (Improved coverage)
Input:އީދުގެ ހަރަކާތް ފެށުމަށް މިރޭ ހުޅުމާލޭގައި އީދު މަޅި ރޯކުރަނީ
STOCK Tokens (fragmented):
['▁', 'އ', 'ީދ', 'ު', 'ގ', 'ެ', '▁', 'ހ', 'ަ', 'ރ', 'ަ', ...]
CUSTOM Tokens (clean and meaningful):
['އީދުގެ', '▁', 'ހަރަކާތް', '▁', 'ފެށުމަށް', ...]
✔️ Long, language-meaningful tokens reduce fragmentation and UNKs.
Token ID Example
STOCK (fragmented):
Token IDs: [..., 3, 3, 3, ...] # many unknowns
CUSTOM (extended):
Token IDs: [137561, 130775, 129048, ...]
Clean and consistent token IDs for Thaana tokens.
How to Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alakxender/deberta-dhivehi-tokenizer-extended")
tokens = tokenizer.tokenize("އީދުގެ ހަރަކާތް")
print(tokens)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support