Transformers
Divehi
bert-dhivehi-tokenizer

bert-dhivehi-tokenizer-extended

An extended BERT tokenizer built upon bert-base-multilingual-cased, optimized for Dhivehi (Divehi/Thaana script).

Overview

This tokenizer preserves all English and multilingual coverage of the base model, while adding the top 100,000 high-frequency Dhivehi tokens extracted from a large corpus. It ensures robust tokenization for both English and Dhivehi with no regressions.

How It Was Built

  • Base model: bert-base-multilingual-cased
  • Corpus: ~16.7M lines of cleaned Dhivehi text (one sentence per line)
  • Processing steps:
    1. Count word-level tokens in batches of 100,000 lines
    2. Filter out rare tokens (frequency < 5)
    3. Select top 100,000 tokens by frequency
    4. Extend vocab using tokenizer.add_tokens([...])

Usage Example

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("alakxender/bert-dhivehi-tokenizer-extended")

# Tokenization test
text_dv = "ޖެންޑާގެ ސްޓޭޓް އަޒްރާ"
print(tokenizer.tokenize(text_dv))
  • English tokenization unchanged
  • Dhivehi text now tokenizes into meaningful word-level tokens with zero [UNK]

Intended Uses

  • Downstream NER, QA, classification, or masking tasks in Dhivehi
  • Fine-tuning of bert-base-multilingual-cased model with extended vocabulary
  • Specially useful for projects in Maldivian/Thaana script

Limitations

  • Vocabulary capped at 100,000 new tokens; may miss very rare words
  • Tokenization remains word-level, not subword-based—better performance may require further tokenizer retraining

Files

  • vocab.txt, tokenizer_config.json, special_tokens_map.json
    → Standard tokenizer files
  • Note: No tokenizer.json; requires BertTokenizer, not Fast, due to manual extension
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train alakxender/bert-dhivehi-tokenizer-extended