File size: 3,031 Bytes
aa48cf0
 
046443e
 
 
 
 
 
00e00d0
 
aa48cf0
 
046443e
aa48cf0
046443e
aa48cf0
 
 
 
 
046443e
aa48cf0
046443e
 
 
 
aa48cf0
 
 
 
 
046443e
aa48cf0
046443e
 
 
 
aa48cf0
046443e
aa48cf0
046443e
aa48cf0
046443e
 
 
 
 
 
aa48cf0
046443e
aa48cf0
046443e
 
aa48cf0
046443e
aa48cf0
046443e
 
 
 
 
 
aa48cf0
046443e
 
 
aa48cf0
 
 
 
 
046443e
 
 
 
aa48cf0
 
 
046443e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
library_name: transformers
tags:
- tokenizer
- code
- multilingual
- programming
license: apache-2.0
base_model:
- openai-community/gpt2
---

# CodeSearchNet Multilingual Tokenizer

A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.

## Model Details

### Model Description

This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.

- **Model type:** BPE Tokenizer
- **Languages:** Python, Java, JavaScript, PHP, Ruby, Go
- **Vocabulary size:** 64,000 tokens
- **Finetuned from:** GPT-2 tokenizer

## Uses

### Direct Use

This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:

- Code generation models
- Code completion systems
- Code analysis and understanding tasks
- Multi-language programming assistants

## Performance

Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:

- **Python**: 25% fewer tokens on average
- **Java**: 31% fewer tokens on average  
- **JavaScript**: 21% fewer tokens on average
- **Go**: 14% fewer tokens on average
- **PHP**: 14% fewer tokens on average
- **Ruby**: 13% fewer tokens on average

## How to Get Started

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")

# Example usage
code = '''public class Calculator {
    public int add(int a, int b) {
        return a + b;
    }
}'''

tokens = tokenizer.tokenize(code)
token_ids = tokenizer.encode(code)
```

## Training Details

### Training Data

Trained on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) which contains:
- ~2M code functions across 6 programming languages
- Real-world code from GitHub repositories
- Functions with associated documentation

### Training Procedure

- **Base model:** GPT-2 tokenizer (50,257 vocab)
- **Training method:** BPE (Byte-Pair Encoding)
- **Final vocabulary:** 64,000 tokens
- **Training corpus:** Combined functions from all 6 languages in CodeSearchNet

## Technical Specifications

### Model Architecture
- **Algorithm:** Byte-Pair Encoding (BPE)
- **Vocabulary size:** 64,000
- **Special tokens:** Inherited from GPT-2 tokenizer
- **Subword handling:** Optimized for code syntax and patterns

## Citation

```bibtex
@misc{codesearchnet-multilang-tokenizer,
  title={CodeSearchNet Multilingual Tokenizer},
  author={Hélder Monteiro},
  year={2025},
  howpublished={Hugging Face Model Hub},
}
```

## Dataset Reference

```bibtex
@article{husain2019codesearchnet,
  title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}
```