πŸ§ͺ CodeLLaMA Unit Test Generator β€” Full Merged Model (v2)

This is a merged model that combines codellama/CodeLlama-7b-hf with a LoRA adapter fine-tuned on embedded C/C++ code and high-quality unit tests using GoogleTest and CppUTest. This version includes enhanced formatting, stop tokens, and test cleanup mechanisms.


🎯 Use Cases

  • Generate comprehensive unit tests for embedded C/C++ functions
  • Focus on edge cases, boundaries, error handling

🧠 Training Summary

  • Base model: codellama/CodeLlama-7b-hf
  • LoRA fine-tuned with:
    • Special tokens: <|system|>, <|user|>, <|assistant|>, // END_OF_TESTS
    • Instruction-style prompts
    • Explicit test output formatting
    • Cleaned test labels via regex stripping headers/main
  • Datasets: athrv/Embedded_Unittest2

πŸ“Œ Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Utkarsh524/codellama_utests_full_new_ver2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

prompt = """<|system|>
Generate comprehensive unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
Output Constraints:
1. ONLY include test code (no explanations, headers, or main functions)
2. Start directly with TEST(...)
3. End after last test case
4. Never include framework boilerplate
<|user|>
Create tests for:
int add(int a, int b) { return a + b; }
<|assistant|>
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, eos_token_id=tokenizer.convert_tokens_to_ids("// END_OF_TESTS"))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training & Optimization Details

Step Description
Dataset athrv/Embedded_Unittest2 (filtered for valid code-test pairs)
Preprocessing Token length filtering (≀4096), special token injection
Quantization 8-bit (BitsAndBytesConfig), llm_int8_threshold=6.0
LoRA Config r=64, alpha=32, dropout=0.1 on q_proj/v_proj/k_proj/o_proj
Training 4 epochs, batch=4 (effective 8), lr=2e-4, FP16
Optimization Paged AdamW 8-bit, gradient checkpointing, custom data collator
Special Tokens Added `<

Tips for Best Results

  • Temperature: 0.2–0.4
  • Top-p: 0.85–0.95
  • Max New Tokens: 256–512-1024-2048
  • Input Formatting:
    • Include complete function signatures
    • Remove unnecessary comments
    • Keep functions under 200 lines
    • For long functions, split into logical units

Feedback & Citation

Dataset Credit: athrv/Embedded_Unittest2
Report Issues: Model's Hugging Face page

Maintainer: Utkarsh524
Model Version: v2 (4-epoch trained)

Downloads last month
0
Safetensors
Model size
6.74B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Utkarsh524/codellama_utests_full_new_ver2

Adapter
(535)
this model