Spaces:
Sleeping
Sleeping
File size: 5,990 Bytes
ef635c1 ac1299b 026e990 345d3dd ef635c1 293c44d 345d3dd ef635c1 293c44d 738edf8 293c44d 345d3dd ef635c1 346c3c5 527d3c4 00327b5 346c3c5 982b341 346c3c5 527d3c4 982b341 527d3c4 ef635c1 346c3c5 982b341 346c3c5 982b341 346c3c5 982b341 346c3c5 982b341 9404fa8 982b341 00327b5 982b341 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
---
title: CodeReviewBench
emoji: π
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
- openai/gpt-4o-mini
- openai/gpt-4o
- claude-3-7-sonnet
- deepseek/deepseek-r1
---
# CodeReview Bench Leaderboard
A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench).
## Features
- **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
- **Dual Language Comments**: Supports both Russian and English comment languages
- **Comprehensive Metrics**:
- LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
- Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
- **Interactive Visualization**: Compare model performance across categories with radar plots
- **Easy Submission**: Submit your model results via web interface
## Metrics
### LLM-based Multimetric
- **Readability**: How easy the review is to understand
- **Relevance**: How relevant the review is to the code
- **Explanation Clarity**: How clear the explanations are
- **Problem Identification**: How well problems are identified
- **Actionability**: How actionable the suggestions are
- **Completeness**: How complete the review is
- **Specificity**: How specific the feedback is
- **Contextual Adequacy**: How well the review fits the context
- **Consistency**: How consistent the review style is
- **Brevity**: How concise the review is
### Exact-Match Metrics
- **Pass@1**: Percentage of correct reviews on first attempt
- **Pass@5**: Percentage of correct reviews in top 5 attempts
- **Pass@10**: Percentage of correct reviews in top 10 attempts
- **BLEU@10**: BLEU score for top 10 review candidates
## Programming Languages Supported
- Python
- JavaScript
- Java
- C++
- C#
- TypeScript
- Go
- Rust
- Swift
- Kotlin
- Ruby
- PHP
- C
- Scala
- R
- Dart
- Other
## Comment Languages
- Russian (ru)
- English (en)
## Example Categories
- Bug Fix
- Code Style
- Performance
- Security
- Refactoring
- Documentation
- Testing
- Architecture
- Other
## Installation
```bash
pip install -r requirements.txt
```
## Usage
```bash
python app.py
```
## Submission Format
Submit your results as a JSONL file where each line contains:
```json
{
"model_name": "your-model-name",
"programming_language": "python",
"comment_language": "en",
"readability": 8.5,
"relevance": 9.0,
"explanation_clarity": 7.8,
"problem_identification": 8.2,
"actionability": 8.7,
"completeness": 8.0,
"specificity": 7.5,
"contextual_adequacy": 8.3,
"consistency": 8.8,
"brevity": 7.2,
"pass_at_1": 0.75,
"pass_at_5": 0.88,
"pass_at_10": 0.92,
"bleu_at_10": 0.65,
"total_evaluations": 100
}
```
## Environment Variables
Set the following environment variables:
## Citation
<<<<<<< HEAD
- **Multi-tab Interface**: Organized navigation with dedicated sections
- **Advanced Filtering**: Real-time filtering by multiple criteria
- **Dark Theme**: Modern, GitHub-inspired dark interface
- **IP-based Submissions**: Secure submission tracking
- **Comprehensive Analytics**: Detailed performance insights
- **Data Export**: Multiple export formats
- **Rate Limiting**: Anti-spam protection
### π§ Technical Improvements
- **Modular Architecture**: Clean separation of concerns
- **Type Safety**: Full type annotations throughout
- **Error Handling**: Comprehensive error handling and logging
- **Data Validation**: Multi-layer validation with Pydantic
- **Performance**: Optimized data processing and display
## π Metrics & Evaluation
### Performance Metrics
- **BLEU**: Text similarity score (0.0-1.0)
- **Pass@1**: Success rate in single attempt (0.0-1.0)
- **Pass@5**: Success rate in 5 attempts (0.0-1.0)
- **Pass@10**: Success rate in 10 attempts (0.0-1.0)
### Quality Dimensions
1. **Readability**: How clear and readable are the reviews?
2. **Relevance**: How relevant to the code changes?
3. **Explanation Clarity**: How well does it explain issues?
4. **Problem Identification**: How effectively does it identify problems?
5. **Actionability**: How actionable are the suggestions?
6. **Completeness**: How thorough are the reviews?
7. **Specificity**: How specific are the comments?
8. **Contextual Adequacy**: How well does it understand context?
9. **Consistency**: How consistent across different reviews?
10. **Brevity**: How concise without losing important information?
## π Security Features
### Rate Limiting
- **5 submissions per IP per 24 hours**
- **Automatic IP tracking and logging**
- **Graceful error handling for rate limits**
### Data Validation
- **Model name format validation**
- **Score range validation (0.0-1.0 for performance, 0-10 for quality)**
- **Logical consistency checks (Pass@1 β€ Pass@5 β€ Pass@10)**
- **Required field validation**
### Audit Trail
- **Complete submission logging**
- **IP address tracking (partially masked for privacy)**
- **Timestamp recording**
- **Data integrity checks**
## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
## π Acknowledgments
- Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench)
- Built with [Gradio](https://gradio.app/) for the web interface
- Thanks to the open-source community for tools and inspiration
## π Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Check the documentation
- Contact the maintainers
---
**Built with β€οΈ for the code review research community**
|