tokenizers
tokenizer

Tokenizer Card for Ansh-256k!

The tokenizer model Ansh-256k - is trained on a dataset of 22 Official Indic languages and English. We propose the name Ansh as this tokenizer is designed to meticulously identify every essential token (Ansh in Sanskrit) of our diverse Indic languages. This model is the advanced version of the Ansh-160k which was trained on 18 Indic languages and English.

image/png

Model Description

India is a vast country that has a multi-lingual culture that covers 22 Official languages and more than 1700 languages and dialects. It has been observed that various languages share words among themselves, sometimes even across language families. To capitalize on this observation, we trained our tokenization model with a vocabulary size of 256,000 (256k) using the dataset of Wikipedia articles and Sangraha dataset in 22 Indic languages and English by applying the Byte-Pair Encoding (BPE) algorithm. When compared among all the popular open-source tokenizers trained on multilingual Indic languages on fertility scores, our model outperformed them in 20 Indic languages.

How to Get Started with the Model ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป

Use the code below to get started with the model.

from transformers import AutoTokenizer
try:
    tokenizer = tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/Ansh-256k"))
    print("Tokenizer loaded successfully!")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    print("Please ensure you have the correct model name and are connected to the internet.")
    exit()
 
input_text = "Hello, world! This is an example of how to use the tokenizer."
#input_text = 'เคฎเฅเคเฅ‡ เคฏเคน presentation เค•เคฒ morning เคคเค• submit เค•เคฐเคจเคพ เคนเฅˆเฅค '
#input_text = 'What is capital city of India?'

encoded_input = tokenizer.encode(example_text)
print("\nOriginal Text:", example_text)
print("Encoded (Token IDs):", encoded_input)

decoded_output = tokenizer.decode(encoded_input)
print("Decoded Text:", decoded_output)

Evaluation

[More Information Needed]

Results ๐Ÿ†

Comparison of Fertility Scores among popular open-source tokenizers trained on multilingual Indic languages and Ansh-256k tokenizers across the 22 Indic languages and English. Tokenizers Results
Language Ansh-256k Sarvam-1 Gemma-3 Llama-3.1 IndicBERTv2 MuRIL NLLB XLMRoBERTa Ansh-128k Ansh-160k
Tamil 1.732 2.590 2.524 11.941 1.790 1.844 2.742 2.486 1.915 1.899
Kannada 1.684 2.654 3.349 14.239 1.815 1.953 2.846 2.507 1.909 1.862
Malayalam 1.957 3.363 3.612 16.064 2.177 2.337 3.406 2.968 2.210 2.236
Maithili 1.398 2.503 2.152 3.246 1.695 1.832 1.955 2.133 1.474 1.561
Konkani 1.770 2.992 2.727 4.037 2.221 2.491 2.617 2.581 1.941 2.072
Telugu 1.747 2.693 3.143 13.240 1.873 2.069 2.859 2.552 1.940 2.010
Odia 1.401 2.494 4.523 15.535 1.539 1.714 2.149 2.196 1.546 1.587
Bengali 1.408 2.045 1.767 8.200 1.461 1.442 2.205 2.140 1.542 1.509
Nepali 1.272 2.358 2.027 3.611 1.411 1.413 1.898 1.643 1.376 1.428
Punjabi 1.310 1.726 2.789 7.855 1.341 1.420 1.843 1.798 1.415 1.434
Urdu 1.230 8.417 1.687 3.003 1.393 1.314 1.589 1.430 1.285 1.270
Hindi 1.195 1.480 1.442 2.757 1.272 1.276 1.546 1.525 1.245 1.246
Gujarati 1.423 2.093 2.358 9.651 1.459 1.587 2.145 2.062 1.537 1.495
Kashmiri 1.406 9.248 3.053 4.026 2.646 2.131 2.849 2.985 1.540 1.619
Marathi 1.463 1.979 2.012 4.010 1.521 1.579 2.207 2.011 1.585 1.573
Sindhi 1.226 8.165 2.101 2.938 1.630 1.354 1.621 1.532 1.300 1.333
Assamese 1.528 4.334 2.728 8.051 1.686 1.770 2.191 2.875 1.662 1.724
Sanskrit 2.254 3.949 3.562 5.034 2.732 2.855 3.453 3.344 2.444 2.470
Bodo 1.375 3.136 3.057 3.855 1.886 2.761 3.008 3.068 1.486 2.499
Santhali 1.333 14.402 5.634 13.456 1.966 1.144 2.994 2.095 1.414 4.538
Dogri 1.438 1.789 1.658 2.810 1.457 1.512 1.721 1.717 1.539 1.525
Manipuri 4.395 13.496 9.272 13.184 2.497 1.436 2.237 2.326 4.416 4.407
English 1.415 1.743 1.415 1.384 1.373 1.368 1.480 1.470 1.545 1.449
Overall 1.526 5.963 3.123 6.024 1.893 1.899 2.498 2.439 1.641 2.348

Model Card Contact โœ‰๏ธ

Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support