|
--- |
|
library_name: transformers |
|
tags: |
|
- tokenizer |
|
- code |
|
- multilingual |
|
- programming |
|
license: apache-2.0 |
|
base_model: |
|
- openai-community/gpt2 |
|
--- |
|
|
|
# CodeSearchNet Multilingual Tokenizer |
|
|
|
A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers. |
|
|
|
- **Model type:** BPE Tokenizer |
|
- **Languages:** Python, Java, JavaScript, PHP, Ruby, Go |
|
- **Vocabulary size:** 64,000 tokens |
|
- **Finetuned from:** GPT-2 tokenizer |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for: |
|
|
|
- Code generation models |
|
- Code completion systems |
|
- Code analysis and understanding tasks |
|
- Multi-language programming assistants |
|
|
|
## Performance |
|
|
|
Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves: |
|
|
|
- **Python**: 25% fewer tokens on average |
|
- **Java**: 31% fewer tokens on average |
|
- **JavaScript**: 21% fewer tokens on average |
|
- **Go**: 14% fewer tokens on average |
|
- **PHP**: 14% fewer tokens on average |
|
- **Ruby**: 13% fewer tokens on average |
|
|
|
## How to Get Started |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer") |
|
|
|
# Example usage |
|
code = '''public class Calculator { |
|
public int add(int a, int b) { |
|
return a + b; |
|
} |
|
}''' |
|
|
|
tokens = tokenizer.tokenize(code) |
|
token_ids = tokenizer.encode(code) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Trained on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) which contains: |
|
- ~2M code functions across 6 programming languages |
|
- Real-world code from GitHub repositories |
|
- Functions with associated documentation |
|
|
|
### Training Procedure |
|
|
|
- **Base model:** GPT-2 tokenizer (50,257 vocab) |
|
- **Training method:** BPE (Byte-Pair Encoding) |
|
- **Final vocabulary:** 64,000 tokens |
|
- **Training corpus:** Combined functions from all 6 languages in CodeSearchNet |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture |
|
- **Algorithm:** Byte-Pair Encoding (BPE) |
|
- **Vocabulary size:** 64,000 |
|
- **Special tokens:** Inherited from GPT-2 tokenizer |
|
- **Subword handling:** Optimized for code syntax and patterns |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{codesearchnet-multilang-tokenizer, |
|
title={CodeSearchNet Multilingual Tokenizer}, |
|
author={Hélder Monteiro}, |
|
year={2025}, |
|
howpublished={Hugging Face Model Hub}, |
|
} |
|
``` |
|
|
|
## Dataset Reference |
|
|
|
```bibtex |
|
@article{husain2019codesearchnet, |
|
title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, |
|
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, |
|
journal={arXiv preprint arXiv:1909.09436}, |
|
year={2019} |
|
} |
|
``` |