π‘ Telugu BPE Tokenizer (23k vocab) β Vipplav
A Byte-Pair Encoding (BPE) tokenizer trained on over **3.4 lakh cleaned Telugu text keys ** from the AI4Bharat Sangraha dataset and other open sources. This tokenizer is ideal for pretraining or fine-tuning Telugu language models.
π Highlights
- Tokenizer Type: SentencePiece BPE
- Vocabulary Size: 23,000
- Character Coverage: 100% Telugu script
- Library: π€
transformers
+sentencepiece
- Special Tokens:
<unk>
β Unknown token<pad>
β Padding<s>
β Start of sequence</s>
β End of sequence\n
,βΉ
,β’
,-
β User-defined symbols preserved during training
β¨ Example Usage
from transformers import T5Tokenizer
# Load tokenizer from Hugging Face Hub
tokenizer = T5Tokenizer.from_pretrained("Vipplav/telugu-bpe-23k")
# Sample Telugu input
text = "ΰ°ͺΰ°°ΰ°Ώΰ°Άΰ±ΰ°²ΰ°¨ ΰ°€ΰ±ΰ°¦ΰ±: 15-06-2025"
# Tokenize the input
tokens = tokenizer.tokenize(text)
# Decode tokens back to text
decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=True)
# Display results
print(f"\nπ₯ Input : {text}")
print(f"π€ Tokens : {tokens}")
print(f"π Decoded : {decoded}")
π Citation
If you use this tokenizer, please cite:
APA:
Vipplav AI (2025). Telugu BPE Tokenizer (23k vocab). Hugging Face. https://huggingface.co/Vipplav/telugu-bpe-23k
AI4Bharat. (2023). Sangraha: A Large-Scale Multidomain Corpus for Indian Languages. Hugging Face Datasets. https://huggingface.co/datasets/ai4bharat/sangraha
BibTeX:
@misc{vipplav_telugu_tokenizer,
author = {Vipplav AI},
title = {Telugu BPE Tokenizer (23k vocab)},
year = {2025},
url = {https://huggingface.co/Vipplav/telugu-bpe-23k}
}
@dataset{sangraha2023,
author = {AI4Bharat},
title = {Sangraha Dataset},
year = {2023},
url = {https://huggingface.co/datasets/ai4bharat/sangraha}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support