File size: 15,423 Bytes

---
license: apache-2.0
datasets:
- bigcode/the-stack-v2
- codeparrot/github-code
- openai/humaneval
- google-research-datasets/mbpp
- deepmind/code_contests
language:
- code
- en
base_model: meta-llama/Llama-2-7b-hf
tags:
- code
- code-generation
- python
- javascript
- java
- cpp
- rust
- go
- lua
- typescript
- programming
- software-engineering
- code-completion
- code-translation
- debugging
- algorithm
pipeline_tag: text-generation
library_name: transformers
metrics:
- pass@1
- pass@10
- code_eval
model-index:
- name: Troviku-1.1
  results:
  - task:
      type: text-generation
      name: Code Generation
    dataset:
      name: HumanEval
      type: openai/humaneval
    metrics:
    - type: pass@1
      value: 72.0
      name: Pass@1
    - type: pass@10
      value: 89.0
      name: Pass@10
  - task:
      type: text-generation
      name: Code Generation
    dataset:
      name: MBPP
      type: mbpp
    metrics:
    - type: pass@1
      value: 68.0
      name: Pass@1
  - task:
      type: text-generation
      name: Code Generation
    dataset:
      name: CodeContests
      type: deepmind/code_contests
    metrics:
    - type: pass@1
      value: 45.0
      name: Pass@1
---

# Troviku-1.1

## Model Card

### Model Details

**Organization:** OpenTrouter  
**Model Type:** Autoregressive Transformer Language Model  
**Model Version:** 1.1.0  
**Release Date:** January 15, 2025  
**Model License:** Apache 2.0  
**Languages:** Multi-language (25+ programming languages)  
**Model Size:** 7 billion parameters  
**Context Length:** 8,192 tokens  
**Base Model:** Llama-2-7b-hf  


### Model Description

Troviku-1.1 is the inaugural model in the Troviku series, a family of large language models specifically engineered for advanced code generation, analysis, and software development tasks. Built on a transformer architecture with 7 billion parameters, the model has been extensively trained on high-quality code repositories, technical documentation, and algorithmic implementations. Troviku-1.1 represents a significant advancement in AI-assisted programming, offering state-of-the-art performance across multiple programming languages and software engineering paradigms.

**Developed by:** OpenTrouter Research Team  
**Funded by:** OpenTrouter Inc., with compute support from cloud infrastructure partners  
**Model Family:** Troviku series  
**Base Architecture:** Transformer decoder with multi-head attention  
**Training Framework:** PyTorch 2.1 with DeepSpeed ZeRO-3  
**Fine-tuning Methods:** Supervised fine-tuning (SFT) + Reinforcement Learning from Human Feedback (RLHF)

### Intended Use

**Primary Use Cases:**
- Code generation and autocomplete in IDE environments
- Algorithm implementation and optimization
- Code translation between programming languages
- Debugging and error resolution assistance
- Technical documentation generation
- Code review and quality assessment
- Test case generation and validation
- Educational programming assistance

**Intended Users:**
- Professional software developers and engineers
- Computer science students and educators
- DevOps and infrastructure engineers
- Data scientists and ML engineers
- Open-source contributors
- Technical writers and documentation specialists

**Out-of-Scope Uses:**
- Generating malicious code, exploits, or malware
- Creating code for illegal activities or bypassing security measures
- Production-critical systems without human review and testing
- Medical diagnosis or treatment recommendation systems
- Legal document generation or legal advice
- Financial trading algorithms without regulatory compliance review
- Autonomous systems where failures could cause physical harm

## Training Data

### Data Sources

The model was trained on a carefully curated dataset comprising:

1. **The Stack v2 (50% of training data)**
   - Source: bigcode/the-stack-v2
   - Permissively licensed source code from GitHub
   - 3.8 million repositories across 600+ programming languages
   - Focus on top 25 languages with quality filtering
   - License: MIT, Apache 2.0, BSD-3-Clause

2. **GitHub Code Dataset (30% of training data)**
   - Source: codeparrot/github-code
   - Curated code snippets and functions
   - High-quality repositories with active maintenance
   - Filtered for code quality and documentation
   - License: Multiple open-source licenses

3. **Technical Documentation (10% of training data)**
   - Official language documentation (Python, JavaScript, Java, C++, etc.)
   - API references and SDK documentation
   - Framework and library documentation
   - License: CC BY 4.0, MIT, Apache 2.0

4. **Benchmark Datasets (5% of training data)**
   - HumanEval: openai/humaneval
   - MBPP: google-research-datasets/mbpp
   - CodeContests: deepmind/code_contests
   - License: MIT, Apache 2.0

5. **Educational Content (5% of training data)**
   - Programming tutorials and guides
   - Algorithm explanations and implementations
   - Stack Overflow posts under CC BY-SA 4.0
   - License: CC BY-SA 4.0

**Total Training Tokens:** 500 billion tokens  
**Training Duration:** 45 days on 512 NVIDIA A100 GPUs  
**Dataset Size:** Approximately 2.3 TB of text data  
**Languages Covered:** Python, JavaScript, TypeScript, Java, C, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, HTML, CSS, Bash, PowerShell, Lua, Perl, Haskell, Julia, MATLAB

### Data Preprocessing

**Quality Filtering:**
- Removed repositories with fewer than 10 stars or inactive for over 2 years
- Filtered out code with syntax errors or poor quality metrics
- Removed duplicates and near-duplicates using MinHash LSH
- Excluded code containing profanity, hate speech, or toxic content

**Privacy Protection:**
- Scanned for and removed personally identifiable information (PII)
- Filtered out API keys, passwords, and credentials
- Removed private email addresses and phone numbers
- Excluded internal company code and proprietary information

**License Compliance:**
- Verified all source code adheres to permissive open-source licenses
- Excluded GPL and other copyleft-licensed code to prevent license contamination
- Maintained attribution records for all training sources
- Regular audits to ensure compliance with license terms

**Bias Mitigation:**
- Balanced representation across programming languages
- Included code from diverse geographic regions and communities
- Filtered out code with discriminatory variable names or comments
- Ensured representation of different coding styles and paradigms

### Training Procedure

**Phase 1: Pretraining (35 days)**
- Objective: Causal language modeling on code corpus
- Batch size: 4 million tokens per batch
- Learning rate: 3e-4 with cosine decay
- Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8)
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision: bfloat16

**Phase 2: Supervised Fine-tuning (7 days)**
- Dataset: 150,000 high-quality code examples with human annotations
- Focus areas: Code quality, security, best practices
- Task types: Generation, completion, translation, debugging
- Evaluation: Held-out validation set with expert review

**Phase 3: RLHF (3 days)**
- Reward model trained on 50,000 human preference comparisons
- PPO optimization with KL penalty (β=0.01)
- Focus: Code correctness, safety, and alignment with user intent

## Performance

### Benchmark Results

| Benchmark | Dataset | Metric | Score |
|-----------|---------|--------|-------|
| HumanEval | openai/humaneval | pass@1 | 72.0% |
| HumanEval | openai/humaneval | pass@10 | 89.0% |
| MBPP | mbpp | pass@1 | 68.0% |
| MBPP | mbpp | pass@10 | 84.0% |
| CodeContests | deepmind/code_contests | pass@1 | 45.0% |
| MultiPL-E | Python | pass@1 | 72.0% |
| MultiPL-E | JavaScript | pass@1 | 68.0% |
| MultiPL-E | Java | pass@1 | 65.0% |
| MultiPL-E | C++ | pass@1 | 61.0% |
| DS-1000 | Data Science | pass@1 | 58.0% |

### Performance by Language

| Language | Pass@1 | Pass@10 | Notes |
|----------|--------|---------|-------|
| Python | 72.0% | 88.0% | Strongest performance |
| JavaScript | 68.0% | 85.0% | Web development focused |
| TypeScript | 67.0% | 84.0% | Type-safe JS variant |
| Java | 65.0% | 82.0% | Enterprise applications |
| C++ | 61.0% | 78.0% | System programming |
| Rust | 58.0% | 75.0% | Memory safety focused |
| Go | 64.0% | 80.0% | Concurrent programming |
| Ruby | 59.0% | 74.0% | Web frameworks |
| PHP | 60.0% | 76.0% | Web development |
| Swift | 56.0% | 72.0% | iOS development |

### Comparison to Other Models

| Model | HumanEval Pass@1 | MBPP Pass@1 | Parameters |
|-------|------------------|-------------|------------|
| GPT-4-turbo | 84.0% | 80.0% | Unknown |
| Claude-3.5-Sonnet | 82.0% | 78.0% | Unknown |
| **Troviku-1.1** | **72.0%** | **68.0%** | **7B** |
| CodeLlama-34B | 68.0% | 62.0% | 34B |
| StarCoder2-15B | 66.0% | 60.0% | 15B |
| WizardCoder-15B | 64.0% | 58.0% | 15B |

## Quick Start

### Installation

```bash
pip install troviku-client transformers torch
```

### Using Transformers Library

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "OpenTrouter/Troviku-1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "def calculate_fibonacci(n):\n    "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(code)
```

### Using Troviku Client

```python
from troviku_client import TrovikuClient, Language

client = TrovikuClient(api_key="your_api_key")

response = client.generate(
    prompt="Create a binary search tree implementation with insert and search methods",
    language=Language.PYTHON,
    max_tokens=1024
)

print(response.code)
```

### API Integration

```python
import requests

url = "https://api.opentrouter.ai/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "OpenTrouter/Troviku-1.1",
    "messages": [
        {"role": "user", "content": "Write a function to calculate Fibonacci numbers"}
    ],
    "temperature": 0.7
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())
```

## Model Architecture

**Architecture Type:** Transformer Decoder  
**Number of Layers:** 32  
**Hidden Size:** 4096  
**Attention Heads:** 32  
**Key-Value Heads:** 8 (Grouped Query Attention)  
**Intermediate Size:** 14336  
**Activation Function:** SiLU (Swish)  
**Vocabulary Size:** 32,768 tokens  
**Positional Encoding:** RoPE (Rotary Position Embedding)  
**Normalization:** RMSNorm  
**Precision:** bfloat16

## Hardware Requirements

### Minimum Requirements
- **GPU:** 16GB VRAM (e.g., NVIDIA RTX 4090, A10)
- **RAM:** 32GB system memory
- **Storage:** 20GB for model weights

### Recommended Requirements
- **GPU:** 24GB+ VRAM (e.g., NVIDIA A100, RTX 6000 Ada)
- **RAM:** 64GB system memory
- **Storage:** 50GB for model, cache, and datasets

### Quantization Support
- **int8:** 8GB VRAM, 2x faster inference
- **int4:** 4GB VRAM, 4x faster inference
- **GPTQ:** Optimized 4-bit quantization
- **AWQ:** Activation-aware quantization

## Limitations

### Technical Limitations
- Context window limited to 8,192 tokens
- May generate syntactically correct but logically flawed code
- Performance degrades on very specialized or proprietary frameworks
- Limited understanding of complex multi-file codebases
- May not always follow organization-specific coding standards

### Language-Specific Limitations
- Stronger performance on popular languages (Python, JavaScript, Java)
- Weaker performance on rare or legacy languages
- Limited knowledge of cutting-edge language features released after training cutoff
- May struggle with highly domain-specific DSLs

### Safety Considerations
- Generated code should always be reviewed by experienced developers
- Security-critical code requires thorough security audits
- May inadvertently suggest vulnerable code patterns
- Not suitable for safety-critical systems without extensive testing

### Bias Considerations
- May reflect biases present in training data (e.g., over-representation of certain coding styles)
- Training data predominantly from English-language repositories
- Potential underrepresentation of non-Western coding conventions
- May perpetuate historical biases in variable naming and comments

## Ethical Considerations

### Environmental Impact
- **Training Emissions:** Approximately 25 tons CO2 equivalent
- **Mitigation:** Used renewable energy data centers, carbon offset programs
- **Inference Efficiency:** Optimized for low-latency, energy-efficient deployment

### Attribution and Licensing
- All training data sourced from permissively licensed repositories
- Respects original authors' licensing terms
- Provides attribution capabilities in generated code comments
- Excludes copyleft-licensed code to prevent license contamination

### Dual-Use Concerns
The model could potentially be misused for:
- Generating malicious code or exploits
- Automating spam or phishing campaigns
- Creating code to circumvent security measures

**Mitigation Strategies:**
- Refusal training for malicious code generation requests
- Usage monitoring and rate limiting
- Terms of service enforcement
- Community reporting mechanisms
- Collaboration with security researchers

## License

This model is released under the **Apache License 2.0**.

### License Terms Summary
- **Commercial Use:** Permitted
- **Modification:** Permitted
- **Distribution:** Permitted
- **Patent Use:** Permitted
- **Private Use:** Permitted

**Conditions:**
- License and copyright notice must be included
- State changes made to the code
- Provide attribution to original authors

**Limitations:**
- No trademark use
- No liability or warranty

See the [LICENSE](LICENSE) file for full details.

## Citation

If you use Troviku-1.1 in your research or projects, please cite:

```bibtex
@misc{troviku2025,
  title={Troviku-1.1: A Specialized Code Generation Model},
  author={OpenTrouter Research Team},
  year={2025},
  publisher={OpenTrouter},
  howpublished={\url{https://github.com/OpenTrouter/Troviku-1.1}},
  note={Apache License 2.0}
}
```

## Support and Community

- **Documentation:** [https://docs.opentrouter.ai/troviku](https://docs.opentrouter.ai/troviku)
- **Issues:** [GitHub Issues](https://github.com/OpenTrouter/Troviku-1.1/issues)
- **Discord:** [OpenTrouter Community](https://discord.gg/opentrouter)
- **Email:** support@opentrouter.ai
- **Twitter:** [@OpenTrouter](https://twitter.com/opentrouter)

## Acknowledgments

The Troviku team acknowledges:
- The open-source community for providing training data
- BigCode project for The Stack v2 dataset
- Hugging Face for infrastructure and hosting
- NVIDIA for compute support
- All contributors who helped with model evaluation and testing

## Version History

### v1.1.0 (Current - November 3, 2025)
- Initial release of the Troviku series
- Support for 25+ programming languages
- Optimized inference performance
- Enhanced code quality and safety features
- RLHF alignment for improved code generation

### Upcoming Features (v1.2.0)
- Extended context window to 16,384 tokens
- Improved multi-file code understanding
- Enhanced support for rare programming languages
- Better handling of code comments and documentation
- Integration with popular IDEs