πŸ”‘ Telugu BPE Tokenizer (23k vocab) β€” Vipplav

A Byte-Pair Encoding (BPE) tokenizer trained on over **3.4 lakh cleaned Telugu text keys ** from the AI4Bharat Sangraha dataset and other open sources. This tokenizer is ideal for pretraining or fine-tuning Telugu language models.


πŸ“Œ Highlights

  • Tokenizer Type: SentencePiece BPE
  • Vocabulary Size: 23,000
  • Character Coverage: 100% Telugu script
  • Library: πŸ€— transformers + sentencepiece
  • Special Tokens:
    • <unk> β€” Unknown token
    • <pad> β€” Padding
    • <s> β€” Start of sequence
    • </s> β€” End of sequence
    • \n, β‚Ή, β€’, - β€” User-defined symbols preserved during training

✨ Example Usage

from transformers import T5Tokenizer

# Load tokenizer from Hugging Face Hub
tokenizer = T5Tokenizer.from_pretrained("Vipplav/telugu-bpe-23k")

# Sample Telugu input
text = "ΰ°ͺΰ°°ΰ°Ώΰ°Άΰ±€ΰ°²ΰ°¨ ఀేదీ: 15-06-2025"

# Tokenize the input
tokens = tokenizer.tokenize(text)

# Decode tokens back to text
decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=True)

# Display results
print(f"\nπŸ“₯ Input   : {text}")
print(f"πŸ”€ Tokens  : {tokens}")
print(f"πŸ“ Decoded : {decoded}")

πŸ“œ Citation

If you use this tokenizer, please cite:

APA:

Vipplav AI (2025). Telugu BPE Tokenizer (23k vocab). Hugging Face. https://huggingface.co/Vipplav/telugu-bpe-23k
AI4Bharat. (2023). Sangraha: A Large-Scale Multidomain Corpus for Indian Languages. Hugging Face Datasets. https://huggingface.co/datasets/ai4bharat/sangraha

BibTeX:

@misc{vipplav_telugu_tokenizer,
  author = {Vipplav AI},
  title = {Telugu BPE Tokenizer (23k vocab)},
  year = {2025},
  url = {https://huggingface.co/Vipplav/telugu-bpe-23k}
}
@dataset{sangraha2023,
  author = {AI4Bharat},
  title = {Sangraha Dataset},
  year = {2023},
  url = {https://huggingface.co/datasets/ai4bharat/sangraha}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Vipplav/telugu-bpe-23k