LeviDeHaan commited on
Commit
70345e9
·
verified ·
1 Parent(s): 35d9ee2

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ smollm-security-nginx02-merged.gguf filter=lfs diff=lfs merge=lfs -text
Modelfile ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ollama modelfile auto-generated by llamafactory
2
+
3
+ FROM .
4
+
5
+ TEMPLATE """{{ if .System }}<|im_start|>system
6
+ {{ .System }}<|im_end|>
7
+ {{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user
8
+ {{ .Content }}<|im_end|>
9
+ <|im_start|>assistant
10
+ {{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|>
11
+ {{ end }}{{ end }}"""
12
+
13
+ SYSTEM """You are a helpful AI assistant named SmolLM, trained by Hugging Face."""
14
+
15
+ PARAMETER stop "<|im_end|>"
16
+ PARAMETER num_ctx 4096
README.md ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: HuggingFaceTB/SmolLM2-360M-Instruct
4
+ tags:
5
+ - security
6
+ - log-analysis
7
+ - threat-detection
8
+ - nginx
9
+ - text-classification
10
+ - lora
11
+ - cpu
12
+ - llama-cpp
13
+ language:
14
+ - en
15
+ library_name: transformers
16
+ pipeline_tag: text-classification
17
+ datasets:
18
+ - nginx_security
19
+ metrics:
20
+ - accuracy
21
+ model-index:
22
+ - name: SecInt-SmolLM2-360M-nginx
23
+ results:
24
+ - task:
25
+ type: text-classification
26
+ name: Security Log Classification
27
+ metrics:
28
+ - type: accuracy
29
+ value: 99.0
30
+ name: Accuracy
31
+ ---
32
+
33
+ # SecInt-SmolLM2-360M-nginx
34
+
35
+ **SecInt** (Security Intelligence Monitor) is a fine-tuned SmolLM2-360M model for real-time nginx security log classification. This is the first model in the SecInt series, designed to automatically detect security threats, errors, and normal traffic patterns in web server logs.
36
+
37
+ ## Model Overview
38
+
39
+ - **Base Model**: [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)
40
+ - **Model Size**: 360M parameters (~691MB)
41
+ - **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
42
+ - **Task**: Multi-class text classification (3 classes)
43
+ - **Classes**: `hack`, `error`, `normal`
44
+ - **Inference**: CPU-optimized (~2GB RAM, 32 tokens/sec)
45
+ - **Format**: Safetensors + GGUF (llama.cpp compatible)
46
+
47
+ ## Key Features
48
+
49
+ - **99%+ Accuracy** on production security logs
50
+ - **Real-time Detection**: <100ms latency per classification
51
+ - **CPU Inference**: No GPU required, runs on any system
52
+ - **Production-Tested**: Battle-tested since October 2025, processing logs from 8 domains
53
+ - **Lightweight**: Only ~2GB RAM needed
54
+ - **Fast**: 32 tokens/second on CPU
55
+
56
+ ## Quick Start
57
+
58
+ ### Using Transformers
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+ import torch
63
+
64
+ # Load model and tokenizer
65
+ model_name = "LeviDeHaan/SecInt-SmolLM2-360M-nginx"
66
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
67
+ model = AutoModelForCausalLM.from_pretrained(model_name)
68
+
69
+ # Example log entry
70
+ log_entry = '192.168.1.100 - - [28/Oct/2025:12:34:56 +0000] "GET /.env HTTP/1.1" 404 162 "-" "curl/7.68.0"'
71
+
72
+ # System prompt with classification rules
73
+ system_prompt = """You are a security log analyzer. Classify the log entry as one of: hack, error, or normal.
74
+
75
+ HACK - Any of these patterns indicate an attack:
76
+ - Scanning for sensitive files: .env, .git, .php, config.php, wp-admin, phpmyadmin
77
+ - SQL injection attempts, XSS attempts
78
+ - Invalid login attempts, brute force, "invalid user", "failed password"
79
+ - Exploit attempts: /cgi-bin/, shell commands, malformed requests
80
+ - 403/404 errors with suspicious paths
81
+ - "access forbidden by rule" with .env, .git, admin, wp-, .php
82
+ - Scanner user-agents: sqlmap, nikto, zgrab, nuclei
83
+ - Webshell access attempts
84
+
85
+ ERROR - Application errors:
86
+ - 500 errors, crashes, exceptions
87
+ - SSL/TLS errors
88
+ - Database connection failures
89
+ - [emerg], [alert], [crit], [error] log levels
90
+
91
+ NORMAL - Everything else:
92
+ - 200/304 responses to legitimate paths
93
+ - Regular API calls, static files
94
+ - Known good bots: googlebot, facebookbot
95
+
96
+ Respond with only one word: hack, error, or normal."""
97
+
98
+ # Format prompt using chat template
99
+ messages = [
100
+ {"role": "system", "content": system_prompt},
101
+ {"role": "user", "content": f"Classify this log entry as hack, error, or normal.\n\n{log_entry}"}
102
+ ]
103
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
104
+
105
+ # Generate classification
106
+ inputs = tokenizer(prompt, return_tensors="pt")
107
+ with torch.no_grad():
108
+ outputs = model.generate(
109
+ **inputs,
110
+ max_new_tokens=10,
111
+ temperature=0.01,
112
+ top_p=0.38,
113
+ top_k=10,
114
+ do_sample=True,
115
+ pad_token_id=tokenizer.eos_token_id
116
+ )
117
+
118
+ # Extract result
119
+ result = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
120
+ print(f"Classification: {result}") # Output: hack
121
+ ```
122
+
123
+ ### Using llama.cpp
124
+
125
+ The model includes a GGUF file for efficient CPU inference:
126
+
127
+ ```bash
128
+ # Download the GGUF model
129
+ huggingface-cli download LeviDeHaan/SecInt-SmolLM2-360M-nginx smollm-security-nginx02-merged.gguf
130
+
131
+ # Run inference with llama.cpp
132
+ ./llama-cli -m smollm-security-nginx02-merged.gguf \
133
+ --temp 0.01 \
134
+ --top-p 0.38 \
135
+ --top-k 10 \
136
+ --seed 42 \
137
+ -p "<|im_start|>system\nYou are a security log analyzer...<|im_end|>\n<|im_start|>user\nClassify this log entry...<|im_end|>\n<|im_start|>assistant\n"
138
+ ```
139
+
140
+ ## Training Details
141
+
142
+ ### Dataset
143
+
144
+ - **Source**: Real production nginx logs from 8 domains
145
+ - **Total Examples**: 1,646 labeled samples
146
+ - **Class Distribution**:
147
+ - `hack`: 800 examples (48.6%) - SQL injection, path traversal, scanner activity, exploit attempts
148
+ - `error`: 46 examples (2.8%) - 500 errors, SSL failures, application crashes
149
+ - `normal`: 800 examples (48.6%) - Legitimate traffic, API calls, static file requests
150
+
151
+ ### LoRA Configuration
152
+
153
+ ```yaml
154
+ LoRA Rank (r): 8
155
+ LoRA Alpha: 16
156
+ LoRA Dropout: 0.05
157
+ Target Modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
158
+ RSLoRA: enabled
159
+ ```
160
+
161
+ ### Training Hyperparameters
162
+
163
+ ```yaml
164
+ Learning Rate: 2e-05
165
+ Scheduler: cosine_with_restarts
166
+ Warmup Steps: 5
167
+ Batch Size: 10 per device
168
+ Gradient Accumulation: 8 steps
169
+ Effective Batch Size: 80
170
+ Epochs: 10
171
+ Max Sequence Length: 2048 tokens
172
+ Optimizer: AdamW (betas=0.9,0.999, eps=1e-08)
173
+ Seed: 42
174
+ ```
175
+
176
+ ### Training Results
177
+
178
+ - **Training Duration**: ~50 minutes (210 steps)
179
+ - **Final Loss**: 0.2575
180
+ - **Throughput**: 3,121 tokens/second
181
+ - **Total Tokens**: 9.29M
182
+ - **Hardware**: CPU training (no GPU required)
183
+
184
+ ## Use Cases
185
+
186
+ ### Real-time Web Server Security Monitoring
187
+
188
+ SecInt is designed for integration into security monitoring systems to provide automated threat detection:
189
+
190
+ 1. **Log Ingestion**: Monitor nginx access/error logs
191
+ 2. **Classification**: Identify attacks, errors, and normal traffic
192
+ 3. **Alerting**: Trigger notifications for security threats
193
+ 4. **Analytics**: Track attack patterns and trends
194
+ 5. **Response**: Feed into incident response workflows
195
+
196
+ ### Typical Integration Architecture
197
+
198
+ ```
199
+ nginx logs → Log Parser → SecInt Classifier → Alert System
200
+
201
+ Database Storage → Dashboard
202
+ ```
203
+
204
+ ### Detection Capabilities
205
+
206
+ The model can identify:
207
+
208
+ **Attack Patterns (hack)**:
209
+ - File/directory scanning (`.env`, `.git`, `config.php`, `wp-admin`, `phpmyadmin`)
210
+ - SQL injection (`UNION SELECT`, `OR 1=1`, etc.)
211
+ - Cross-site scripting (XSS) attempts
212
+ - Path traversal (`../../../`)
213
+ - Command injection attempts
214
+ - Known exploit attempts (PHPUnit RCE, ThinkPHP, etc.)
215
+ - Webshell access (c99, r57, alfa, wso)
216
+ - Scanner signatures (sqlmap, nikto, zgrab, nuclei)
217
+ - Brute force attacks (failed passwords, invalid users)
218
+ - Request obfuscation (null bytes, encoding tricks)
219
+
220
+ **Application Errors (error)**:
221
+ - HTTP 500 errors
222
+ - SSL/TLS handshake failures
223
+ - Application crashes and exceptions
224
+ - Database connection errors
225
+ - Critical log levels ([emerg], [alert], [crit])
226
+
227
+ **Normal Traffic (normal)**:
228
+ - HTTP 200/304 responses to legitimate paths
229
+ - API endpoints and authenticated requests
230
+ - Static file serving (CSS, JS, images)
231
+ - Known good bots (Googlebot, etc.)
232
+
233
+ ## Performance Metrics
234
+
235
+ ### Production Environment (October 2025)
236
+
237
+ - **Accuracy**: 99%+ on security logs
238
+ - **Inference Speed**: 32 tokens/second (CPU)
239
+ - **Latency**: <100ms per classification
240
+ - **Memory Usage**: ~2GB RAM
241
+ - **Uptime**: 99.9%+ (stable, no crashes)
242
+ - **Processing Rate**: 6-200 log entries per 60s batch
243
+ - **Attack Detection Rate**: ~36 attacks/hour average
244
+
245
+ ### Optimization Features
246
+
247
+ When deployed in the full SecInt system:
248
+ - **Intelligent Caching**: 95%+ cache hit rate reduces redundant LLM calls
249
+ - **Session Tracking**: Sampling mode after 50 requests from same IP
250
+ - **Whitelist Support**: Known-good traffic bypasses classification
251
+ - **Batch Processing**: Groups requests for efficient processing
252
+
253
+ ## Recommended Inference Settings
254
+
255
+ For optimal security classification results:
256
+
257
+ ```python
258
+ temperature = 0.01 # Very deterministic
259
+ max_tokens = 1024 # Classification is short
260
+ top_k = 10 # Limit vocabulary
261
+ top_p = 0.38 # Nucleus sampling
262
+ seed = 42 # Fixed for consistency
263
+ ```
264
+
265
+ These settings ensure consistent, deterministic classification suitable for production security monitoring.
266
+
267
+ ## Prompt Template
268
+
269
+ The model requires the SmolLM2 chat template format. **Critical**: Use the exact system prompt shown in the Quick Start section for best results. The system prompt contains:
270
+
271
+ 1. Clear task definition
272
+ 2. Detailed attack pattern definitions (HACK class)
273
+ 3. Error pattern definitions (ERROR class)
274
+ 4. Normal traffic definitions (NORMAL class)
275
+ 5. Instruction to respond with single word only
276
+
277
+ Deviation from this prompt format may significantly reduce accuracy.
278
+
279
+ ## Limitations
280
+
281
+ - **nginx-Specific**: Trained exclusively on nginx log format; may require fine-tuning for Apache, IIS, or other web servers
282
+ - **Prompt-Dependent**: Requires exact prompt template for optimal performance
283
+ - **CPU Inference**: Optimized for CPU; no GPU-specific optimizations
284
+ - **English Only**: Trained on English-language logs
285
+ - **Context Length**: Limited to 2048 tokens per log entry
286
+ - **Class Balance**: Fewer error examples (2.8%) may affect error detection sensitivity
287
+ - **No Multi-log Context**: Classifies individual log entries; does not correlate across multiple logs
288
+
289
+ ## Model Architecture
290
+
291
+ Built on SmolLM2-360M-Instruct, a decoder-only transformer model optimized for instruction following:
292
+
293
+ - **Parameters**: 360M
294
+ - **Architecture**: Transformer decoder with grouped-query attention
295
+ - **Context Length**: 2048 tokens
296
+ - **Vocabulary Size**: 49,152 tokens
297
+ - **Base Training**: Pre-trained on diverse text corpus, instruction-tuned
298
+
299
+ LoRA fine-tuning targets all attention and MLP projection layers for maximum adaptation to security log classification while maintaining base model knowledge.
300
+
301
+ ## Citation
302
+
303
+ If you use this model in your research or production systems, please cite:
304
+
305
+ ```bibtex
306
+ @misc{secint-smollm2-nginx,
307
+ author = {Levi DeHaan},
308
+ title = {SecInt: SmolLM2-360M Fine-tuned for nginx Security Log Classification},
309
+ year = {2025},
310
+ publisher = {Hugging Face},
311
+ howpublished = {\url{https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx}}
312
+ }
313
+ ```
314
+
315
+ ## Acknowledgments
316
+
317
+ - **HuggingFace** for the SmolLM2-360M-Instruct base model
318
+ - **llama.cpp** team for efficient CPU inference capabilities
319
+ - **LLaMA-Factory** for streamlined LoRA fine-tuning framework
320
+
321
+ ## License
322
+
323
+ This model is released under Apache 2.0 license, consistent with the base SmolLM2 model. You are free to use, modify, and distribute this model for commercial and non-commercial purposes.
324
+
325
+ ## Project
326
+
327
+ SecInt is part of the **Security Intelligence Monitor v2** project, a comprehensive real-time security monitoring system for web servers. The full system includes:
328
+
329
+ - Multi-format log ingestion (nginx, Apache, custom)
330
+ - AI-powered threat classification
331
+ - Threat intelligence enrichment (GeoIP, Shodan)
332
+ - Breach detection (7+ detection rules)
333
+ - Real-time alerting (Pushover, email, webhooks)
334
+ - Interactive dashboard (Streamlit)
335
+ - Attack session management
336
+ - SQLite-based persistence and analytics
337
+
338
+ For more information about the full SecInt system, visit: [logwatcher project](https://github.com/LeviDeHaan/logwatcher)
339
+
340
+ ## Model Card Contact
341
+
342
+ For questions, issues, or collaboration opportunities:
343
+ - **Hugging Face**: [@LeviDeHaan](https://huggingface.co/LeviDeHaan)
344
+ - **Model Repository**: [SecInt-SmolLM2-360M-nginx](https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx)
chat_template.jinja ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
2
+ You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
3
+ ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
4
+ ' + message['content'] + '<|im_end|>' + '
5
+ '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
6
+ ' }}{% endif %}
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "head_dim": 64,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 960,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 2560,
14
+ "is_llama_config": true,
15
+ "max_position_embeddings": 8192,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 15,
19
+ "num_hidden_layers": 32,
20
+ "num_key_value_heads": 5,
21
+ "pad_token_id": 2,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_interleaved": false,
25
+ "rope_scaling": null,
26
+ "rope_theta": 100000,
27
+ "tie_word_embeddings": true,
28
+ "torch_dtype": "bfloat16",
29
+ "transformers.js_config": {
30
+ "kv_cache_dtype": {
31
+ "fp16": "float16",
32
+ "q4f16": "float16"
33
+ }
34
+ },
35
+ "transformers_version": "4.52.4",
36
+ "use_cache": true,
37
+ "vocab_size": 49152
38
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 2,
6
+ "transformers_version": "4.52.4"
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71fd019aa98aa6c25c5cf8f4c4ab16814504578c155944ce77bfd5b78d911da0
3
+ size 723674912
smollm-security-nginx02-merged.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9fc5c2b3e948d21f4c27e3a13bb6a9be710a29ea7954a8470dee8f25df5b8c48
3
+ size 725553184
special_tokens_map.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "bos_token": {
7
+ "content": "<|im_start|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "eos_token": {
14
+ "content": "<|im_end|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "pad_token": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "unk_token": {
28
+ "content": "<|endoftext|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ }
140
+ },
141
+ "additional_special_tokens": [
142
+ "<|im_start|>",
143
+ "<|im_end|>"
144
+ ],
145
+ "bos_token": "<|im_start|>",
146
+ "clean_up_tokenization_spaces": false,
147
+ "eos_token": "<|im_end|>",
148
+ "extra_special_tokens": {},
149
+ "model_max_length": 8192,
150
+ "pad_token": "<|im_end|>",
151
+ "padding_side": "left",
152
+ "split_special_tokens": false,
153
+ "tokenizer_class": "GPT2Tokenizer",
154
+ "unk_token": "<|endoftext|>",
155
+ "vocab_size": 49152
156
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff