File size: 5,990 Bytes
ef635c1
ac1299b
026e990
345d3dd
ef635c1
 
293c44d
345d3dd
ef635c1
 
293c44d
738edf8
293c44d
 
 
 
345d3dd
ef635c1
 
346c3c5
527d3c4
00327b5
346c3c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
982b341
 
 
 
 
346c3c5
527d3c4
 
982b341
527d3c4
ef635c1
346c3c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
982b341
 
346c3c5
982b341
346c3c5
982b341
 
346c3c5
982b341
9404fa8
982b341
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00327b5
982b341
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: CodeReviewBench
emoji: 😎
colorFrom: gray
colorTo: indigo
sdk: gradio

sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
- openai/gpt-4o-mini
- openai/gpt-4o
- claude-3-7-sonnet
- deepseek/deepseek-r1

---

# CodeReview Bench Leaderboard

A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench).
## Features

- **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
- **Dual Language Comments**: Supports both Russian and English comment languages
- **Comprehensive Metrics**:
  - LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
  - Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
- **Interactive Visualization**: Compare model performance across categories with radar plots
- **Easy Submission**: Submit your model results via web interface

## Metrics

### LLM-based Multimetric

- **Readability**: How easy the review is to understand
- **Relevance**: How relevant the review is to the code
- **Explanation Clarity**: How clear the explanations are
- **Problem Identification**: How well problems are identified
- **Actionability**: How actionable the suggestions are
- **Completeness**: How complete the review is
- **Specificity**: How specific the feedback is
- **Contextual Adequacy**: How well the review fits the context
- **Consistency**: How consistent the review style is
- **Brevity**: How concise the review is

### Exact-Match Metrics

- **Pass@1**: Percentage of correct reviews on first attempt
- **Pass@5**: Percentage of correct reviews in top 5 attempts
- **Pass@10**: Percentage of correct reviews in top 10 attempts
- **BLEU@10**: BLEU score for top 10 review candidates

## Programming Languages Supported

- Python
- JavaScript
- Java
- C++
- C#
- TypeScript
- Go
- Rust
- Swift
- Kotlin
- Ruby
- PHP
- C
- Scala
- R
- Dart
- Other

## Comment Languages

- Russian (ru)
- English (en)

## Example Categories

- Bug Fix
- Code Style
- Performance
- Security
- Refactoring
- Documentation
- Testing
- Architecture
- Other

## Installation

```bash
pip install -r requirements.txt
```

## Usage

```bash
python app.py
```

## Submission Format

Submit your results as a JSONL file where each line contains:

```json
{
  "model_name": "your-model-name",
  "programming_language": "python",
  "comment_language": "en",
  "readability": 8.5,
  "relevance": 9.0,
  "explanation_clarity": 7.8,
  "problem_identification": 8.2,
  "actionability": 8.7,
  "completeness": 8.0,
  "specificity": 7.5,
  "contextual_adequacy": 8.3,
  "consistency": 8.8,
  "brevity": 7.2,
  "pass_at_1": 0.75,
  "pass_at_5": 0.88,
  "pass_at_10": 0.92,
  "bleu_at_10": 0.65,
  "total_evaluations": 100
}
```

## Environment Variables

Set the following environment variables:


## Citation

<<<<<<< HEAD
- **Multi-tab Interface**: Organized navigation with dedicated sections
- **Advanced Filtering**: Real-time filtering by multiple criteria
- **Dark Theme**: Modern, GitHub-inspired dark interface
- **IP-based Submissions**: Secure submission tracking
- **Comprehensive Analytics**: Detailed performance insights
- **Data Export**: Multiple export formats
- **Rate Limiting**: Anti-spam protection

### πŸ”§ Technical Improvements

- **Modular Architecture**: Clean separation of concerns
- **Type Safety**: Full type annotations throughout
- **Error Handling**: Comprehensive error handling and logging
- **Data Validation**: Multi-layer validation with Pydantic
- **Performance**: Optimized data processing and display

## πŸ“ˆ Metrics & Evaluation

### Performance Metrics

- **BLEU**: Text similarity score (0.0-1.0)
- **Pass@1**: Success rate in single attempt (0.0-1.0)
- **Pass@5**: Success rate in 5 attempts (0.0-1.0)
- **Pass@10**: Success rate in 10 attempts (0.0-1.0)

### Quality Dimensions

1. **Readability**: How clear and readable are the reviews?
2. **Relevance**: How relevant to the code changes?
3. **Explanation Clarity**: How well does it explain issues?
4. **Problem Identification**: How effectively does it identify problems?
5. **Actionability**: How actionable are the suggestions?
6. **Completeness**: How thorough are the reviews?
7. **Specificity**: How specific are the comments?
8. **Contextual Adequacy**: How well does it understand context?
9. **Consistency**: How consistent across different reviews?
10. **Brevity**: How concise without losing important information?

## πŸ”’ Security Features

### Rate Limiting

- **5 submissions per IP per 24 hours**
- **Automatic IP tracking and logging**
- **Graceful error handling for rate limits**

### Data Validation

- **Model name format validation**
- **Score range validation (0.0-1.0 for performance, 0-10 for quality)**
- **Logical consistency checks (Pass@1 ≀ Pass@5 ≀ Pass@10)**
- **Required field validation**

### Audit Trail

- **Complete submission logging**
- **IP address tracking (partially masked for privacy)**
- **Timestamp recording**
- **Data integrity checks**

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

## πŸ™ Acknowledgments

- Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench)
- Built with [Gradio](https://gradio.app/) for the web interface
- Thanks to the open-source community for tools and inspiration

## πŸ“ž Support

For questions, issues, or contributions:

- Open an issue on GitHub
- Check the documentation
- Contact the maintainers

---

**Built with ❀️ for the code review research community**