CodeReviewBench / README.md
kenkaneki's picture
Update README.md
026e990 verified
---
title: CodeReviewBench
emoji: 😎
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
- openai/gpt-4o-mini
- openai/gpt-4o
- claude-3-7-sonnet
- deepseek/deepseek-r1
---
# CodeReview Bench Leaderboard
A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench).
## Features
- **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
- **Dual Language Comments**: Supports both Russian and English comment languages
- **Comprehensive Metrics**:
- LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
- Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
- **Interactive Visualization**: Compare model performance across categories with radar plots
- **Easy Submission**: Submit your model results via web interface
## Metrics
### LLM-based Multimetric
- **Readability**: How easy the review is to understand
- **Relevance**: How relevant the review is to the code
- **Explanation Clarity**: How clear the explanations are
- **Problem Identification**: How well problems are identified
- **Actionability**: How actionable the suggestions are
- **Completeness**: How complete the review is
- **Specificity**: How specific the feedback is
- **Contextual Adequacy**: How well the review fits the context
- **Consistency**: How consistent the review style is
- **Brevity**: How concise the review is
### Exact-Match Metrics
- **Pass@1**: Percentage of correct reviews on first attempt
- **Pass@5**: Percentage of correct reviews in top 5 attempts
- **Pass@10**: Percentage of correct reviews in top 10 attempts
- **BLEU@10**: BLEU score for top 10 review candidates
## Programming Languages Supported
- Python
- JavaScript
- Java
- C++
- C#
- TypeScript
- Go
- Rust
- Swift
- Kotlin
- Ruby
- PHP
- C
- Scala
- R
- Dart
- Other
## Comment Languages
- Russian (ru)
- English (en)
## Example Categories
- Bug Fix
- Code Style
- Performance
- Security
- Refactoring
- Documentation
- Testing
- Architecture
- Other
## Installation
```bash
pip install -r requirements.txt
```
## Usage
```bash
python app.py
```
## Submission Format
Submit your results as a JSONL file where each line contains:
```json
{
"model_name": "your-model-name",
"programming_language": "python",
"comment_language": "en",
"readability": 8.5,
"relevance": 9.0,
"explanation_clarity": 7.8,
"problem_identification": 8.2,
"actionability": 8.7,
"completeness": 8.0,
"specificity": 7.5,
"contextual_adequacy": 8.3,
"consistency": 8.8,
"brevity": 7.2,
"pass_at_1": 0.75,
"pass_at_5": 0.88,
"pass_at_10": 0.92,
"bleu_at_10": 0.65,
"total_evaluations": 100
}
```
## Environment Variables
Set the following environment variables:
## Citation
<<<<<<< HEAD
- **Multi-tab Interface**: Organized navigation with dedicated sections
- **Advanced Filtering**: Real-time filtering by multiple criteria
- **Dark Theme**: Modern, GitHub-inspired dark interface
- **IP-based Submissions**: Secure submission tracking
- **Comprehensive Analytics**: Detailed performance insights
- **Data Export**: Multiple export formats
- **Rate Limiting**: Anti-spam protection
### πŸ”§ Technical Improvements
- **Modular Architecture**: Clean separation of concerns
- **Type Safety**: Full type annotations throughout
- **Error Handling**: Comprehensive error handling and logging
- **Data Validation**: Multi-layer validation with Pydantic
- **Performance**: Optimized data processing and display
## πŸ“ˆ Metrics & Evaluation
### Performance Metrics
- **BLEU**: Text similarity score (0.0-1.0)
- **Pass@1**: Success rate in single attempt (0.0-1.0)
- **Pass@5**: Success rate in 5 attempts (0.0-1.0)
- **Pass@10**: Success rate in 10 attempts (0.0-1.0)
### Quality Dimensions
1. **Readability**: How clear and readable are the reviews?
2. **Relevance**: How relevant to the code changes?
3. **Explanation Clarity**: How well does it explain issues?
4. **Problem Identification**: How effectively does it identify problems?
5. **Actionability**: How actionable are the suggestions?
6. **Completeness**: How thorough are the reviews?
7. **Specificity**: How specific are the comments?
8. **Contextual Adequacy**: How well does it understand context?
9. **Consistency**: How consistent across different reviews?
10. **Brevity**: How concise without losing important information?
## πŸ”’ Security Features
### Rate Limiting
- **5 submissions per IP per 24 hours**
- **Automatic IP tracking and logging**
- **Graceful error handling for rate limits**
### Data Validation
- **Model name format validation**
- **Score range validation (0.0-1.0 for performance, 0-10 for quality)**
- **Logical consistency checks (Pass@1 ≀ Pass@5 ≀ Pass@10)**
- **Required field validation**
### Audit Trail
- **Complete submission logging**
- **IP address tracking (partially masked for privacy)**
- **Timestamp recording**
- **Data integrity checks**
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## πŸ“„ License
This project is licensed under the MIT License - see the LICENSE file for details.
## πŸ™ Acknowledgments
- Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench)
- Built with [Gradio](https://gradio.app/) for the web interface
- Thanks to the open-source community for tools and inspiration
## πŸ“ž Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Check the documentation
- Contact the maintainers
---
**Built with ❀️ for the code review research community**