helmo's picture
Update README.md
00e00d0 verified
---
library_name: transformers
tags:
- tokenizer
- code
- multilingual
- programming
license: apache-2.0
base_model:
- openai-community/gpt2
---
# CodeSearchNet Multilingual Tokenizer
A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.
## Model Details
### Model Description
This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.
- **Model type:** BPE Tokenizer
- **Languages:** Python, Java, JavaScript, PHP, Ruby, Go
- **Vocabulary size:** 64,000 tokens
- **Finetuned from:** GPT-2 tokenizer
## Uses
### Direct Use
This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:
- Code generation models
- Code completion systems
- Code analysis and understanding tasks
- Multi-language programming assistants
## Performance
Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:
- **Python**: 25% fewer tokens on average
- **Java**: 31% fewer tokens on average
- **JavaScript**: 21% fewer tokens on average
- **Go**: 14% fewer tokens on average
- **PHP**: 14% fewer tokens on average
- **Ruby**: 13% fewer tokens on average
## How to Get Started
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")
# Example usage
code = '''public class Calculator {
public int add(int a, int b) {
return a + b;
}
}'''
tokens = tokenizer.tokenize(code)
token_ids = tokenizer.encode(code)
```
## Training Details
### Training Data
Trained on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) which contains:
- ~2M code functions across 6 programming languages
- Real-world code from GitHub repositories
- Functions with associated documentation
### Training Procedure
- **Base model:** GPT-2 tokenizer (50,257 vocab)
- **Training method:** BPE (Byte-Pair Encoding)
- **Final vocabulary:** 64,000 tokens
- **Training corpus:** Combined functions from all 6 languages in CodeSearchNet
## Technical Specifications
### Model Architecture
- **Algorithm:** Byte-Pair Encoding (BPE)
- **Vocabulary size:** 64,000
- **Special tokens:** Inherited from GPT-2 tokenizer
- **Subword handling:** Optimized for code syntax and patterns
## Citation
```bibtex
@misc{codesearchnet-multilang-tokenizer,
title={CodeSearchNet Multilingual Tokenizer},
author={Hélder Monteiro},
year={2025},
howpublished={Hugging Face Model Hub},
}
```
## Dataset Reference
```bibtex
@article{husain2019codesearchnet,
title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
```