helmo
/

code-search-net-multilang-tokenizer

Model card Files Files and versions

code-search-net-multilang-tokenizer / README.md

helmo's picture

Update README.md

00e00d0 verified 13 days ago

|

history blame contribute delete

3.03 kB

	---
	library_name: transformers
	tags:
	- tokenizer
	- code
	- multilingual
	- programming
	license: apache-2.0
	base_model:
	- openai-community/gpt2
	---

	# CodeSearchNet Multilingual Tokenizer

	A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.

	## Model Details

	### Model Description

	This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.

	- Model type: BPE Tokenizer
	- Languages: Python, Java, JavaScript, PHP, Ruby, Go
	- Vocabulary size: 64,000 tokens
	- Finetuned from: GPT-2 tokenizer

	## Uses

	### Direct Use

	This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:

	- Code generation models
	- Code completion systems
	- Code analysis and understanding tasks
	- Multi-language programming assistants

	## Performance

	Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:

	- Python: 25% fewer tokens on average
	- Java: 31% fewer tokens on average
	- JavaScript: 21% fewer tokens on average
	- Go: 14% fewer tokens on average
	- PHP: 14% fewer tokens on average
	- Ruby: 13% fewer tokens on average

	## How to Get Started

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")

	# Example usage
	code = '''public class Calculator {
	public int add(int a, int b) {
	return a + b;
	}
	}'''

	tokens = tokenizer.tokenize(code)
	token_ids = tokenizer.encode(code)
	```

	## Training Details

	### Training Data

	Trained on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) which contains:
	- ~2M code functions across 6 programming languages
	- Real-world code from GitHub repositories
	- Functions with associated documentation

	### Training Procedure

	- Base model: GPT-2 tokenizer (50,257 vocab)
	- Training method: BPE (Byte-Pair Encoding)
	- Final vocabulary: 64,000 tokens
	- Training corpus: Combined functions from all 6 languages in CodeSearchNet

	## Technical Specifications

	### Model Architecture
	- Algorithm: Byte-Pair Encoding (BPE)
	- Vocabulary size: 64,000
	- Special tokens: Inherited from GPT-2 tokenizer
	- Subword handling: Optimized for code syntax and patterns

	## Citation

	```bibtex
	@misc{codesearchnet-multilang-tokenizer,
	title={CodeSearchNet Multilingual Tokenizer},
	author={Hélder Monteiro},
	year={2025},
	howpublished={Hugging Face Model Hub},
	}
	```

	## Dataset Reference

	```bibtex
	@article{husain2019codesearchnet,
	title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
	author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
	journal={arXiv preprint arXiv:1909.09436},
	year={2019}
	}
	```