bert-dhivehi-tokenizer-extended
An extended BERT tokenizer built upon bert-base-multilingual-cased
, optimized for Dhivehi (Divehi/Thaana script).
Overview
This tokenizer preserves all English and multilingual coverage of the base model, while adding the top 100,000 high-frequency Dhivehi tokens extracted from a large corpus. It ensures robust tokenization for both English and Dhivehi with no regressions.
How It Was Built
- Base model:
bert-base-multilingual-cased
- Corpus: ~16.7M lines of cleaned Dhivehi text (one sentence per line)
- Processing steps:
- Count word-level tokens in batches of 100,000 lines
- Filter out rare tokens (frequency < 5)
- Select top 100,000 tokens by frequency
- Extend vocab using
tokenizer.add_tokens([...])
Usage Example
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("alakxender/bert-dhivehi-tokenizer-extended")
# Tokenization test
text_dv = "ޖެންޑާގެ ސްޓޭޓް އަޒްރާ"
print(tokenizer.tokenize(text_dv))
- English tokenization unchanged
- Dhivehi text now tokenizes into meaningful word-level tokens with zero
[UNK]
Intended Uses
- Downstream NER, QA, classification, or masking tasks in Dhivehi
- Fine-tuning of
bert-base-multilingual-cased
model with extended vocabulary - Specially useful for projects in Maldivian/Thaana script
Limitations
- Vocabulary capped at 100,000 new tokens; may miss very rare words
- Tokenization remains word-level, not subword-based—better performance may require further tokenizer retraining
Files
vocab.txt
,tokenizer_config.json
,special_tokens_map.json
→ Standard tokenizer files- Note: No
tokenizer.json
; requiresBertTokenizer
, notFast
, due to manual extension
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support