bert-dhivehi-tokenizer-extended

An extended BERT tokenizer built upon bert-base-multilingual-cased, optimized for Dhivehi (Divehi/Thaana script).

Overview

This tokenizer preserves all English and multilingual coverage of the base model, while adding the top 100,000 high-frequency Dhivehi tokens extracted from a large corpus. It ensures robust tokenization for both English and Dhivehi with no regressions.

How It Was Built

Base model: bert-base-multilingual-cased
Corpus: ~16.7M lines of cleaned Dhivehi text (one sentence per line)
Processing steps:
1. Count word-level tokens in batches of 100,000 lines
2. Filter out rare tokens (frequency < 5)
3. Select top 100,000 tokens by frequency
4. Extend vocab using tokenizer.add_tokens([...])

Usage Example

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("alakxender/bert-dhivehi-tokenizer-extended")

# Tokenization test
text_dv = "ޖެންޑާގެ ސްޓޭޓް އަޒްރާ"
print(tokenizer.tokenize(text_dv))

English tokenization unchanged
Dhivehi text now tokenizes into meaningful word-level tokens with zero [UNK]

Intended Uses

Downstream NER, QA, classification, or masking tasks in Dhivehi
Fine-tuning of bert-base-multilingual-cased model with extended vocabulary
Specially useful for projects in Maldivian/Thaana script

Limitations

Vocabulary capped at 100,000 new tokens; may miss very rare words
Tokenization remains word-level, not subword-based—better performance may require further tokenizer retraining

Files

vocab.txt, tokenizer_config.json, special_tokens_map.json
→ Standard tokenizer files
Note: No tokenizer.json; requires BertTokenizer, not Fast, due to manual extension

alakxender
/

bert-dhivehi-tokenizer-extended