Spaces:
Running
Running
title: CodeReviewBench | |
emoji: π | |
colorFrom: gray | |
colorTo: indigo | |
sdk: gradio | |
sdk_version: 4.44.1 | |
app_file: app.py | |
pinned: true | |
short_description: A comprehensive benchmark for codereview. | |
models: | |
- openai/gpt-4o-mini | |
- openai/gpt-4o | |
- claude-3-7-sonnet | |
- deepseek/deepseek-r1 | |
# CodeReview Bench Leaderboard | |
A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench). | |
## Features | |
- **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more | |
- **Dual Language Comments**: Supports both Russian and English comment languages | |
- **Comprehensive Metrics**: | |
- LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity) | |
- Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10) | |
- **Interactive Visualization**: Compare model performance across categories with radar plots | |
- **Easy Submission**: Submit your model results via web interface | |
## Metrics | |
### LLM-based Multimetric | |
- **Readability**: How easy the review is to understand | |
- **Relevance**: How relevant the review is to the code | |
- **Explanation Clarity**: How clear the explanations are | |
- **Problem Identification**: How well problems are identified | |
- **Actionability**: How actionable the suggestions are | |
- **Completeness**: How complete the review is | |
- **Specificity**: How specific the feedback is | |
- **Contextual Adequacy**: How well the review fits the context | |
- **Consistency**: How consistent the review style is | |
- **Brevity**: How concise the review is | |
### Exact-Match Metrics | |
- **Pass@1**: Percentage of correct reviews on first attempt | |
- **Pass@5**: Percentage of correct reviews in top 5 attempts | |
- **Pass@10**: Percentage of correct reviews in top 10 attempts | |
- **BLEU@10**: BLEU score for top 10 review candidates | |
## Programming Languages Supported | |
- Python | |
- JavaScript | |
- Java | |
- C++ | |
- C# | |
- TypeScript | |
- Go | |
- Rust | |
- Swift | |
- Kotlin | |
- Ruby | |
- PHP | |
- C | |
- Scala | |
- R | |
- Dart | |
- Other | |
## Comment Languages | |
- Russian (ru) | |
- English (en) | |
## Example Categories | |
- Bug Fix | |
- Code Style | |
- Performance | |
- Security | |
- Refactoring | |
- Documentation | |
- Testing | |
- Architecture | |
- Other | |
## Installation | |
```bash | |
pip install -r requirements.txt | |
``` | |
## Usage | |
```bash | |
python app.py | |
``` | |
## Submission Format | |
Submit your results as a JSONL file where each line contains: | |
```json | |
{ | |
"model_name": "your-model-name", | |
"programming_language": "python", | |
"comment_language": "en", | |
"readability": 8.5, | |
"relevance": 9.0, | |
"explanation_clarity": 7.8, | |
"problem_identification": 8.2, | |
"actionability": 8.7, | |
"completeness": 8.0, | |
"specificity": 7.5, | |
"contextual_adequacy": 8.3, | |
"consistency": 8.8, | |
"brevity": 7.2, | |
"pass_at_1": 0.75, | |
"pass_at_5": 0.88, | |
"pass_at_10": 0.92, | |
"bleu_at_10": 0.65, | |
"total_evaluations": 100 | |
} | |
``` | |
## Environment Variables | |
Set the following environment variables: | |
## Citation | |
<<<<<<< HEAD | |
- **Multi-tab Interface**: Organized navigation with dedicated sections | |
- **Advanced Filtering**: Real-time filtering by multiple criteria | |
- **Dark Theme**: Modern, GitHub-inspired dark interface | |
- **IP-based Submissions**: Secure submission tracking | |
- **Comprehensive Analytics**: Detailed performance insights | |
- **Data Export**: Multiple export formats | |
- **Rate Limiting**: Anti-spam protection | |
### π§ Technical Improvements | |
- **Modular Architecture**: Clean separation of concerns | |
- **Type Safety**: Full type annotations throughout | |
- **Error Handling**: Comprehensive error handling and logging | |
- **Data Validation**: Multi-layer validation with Pydantic | |
- **Performance**: Optimized data processing and display | |
## π Metrics & Evaluation | |
### Performance Metrics | |
- **BLEU**: Text similarity score (0.0-1.0) | |
- **Pass@1**: Success rate in single attempt (0.0-1.0) | |
- **Pass@5**: Success rate in 5 attempts (0.0-1.0) | |
- **Pass@10**: Success rate in 10 attempts (0.0-1.0) | |
### Quality Dimensions | |
1. **Readability**: How clear and readable are the reviews? | |
2. **Relevance**: How relevant to the code changes? | |
3. **Explanation Clarity**: How well does it explain issues? | |
4. **Problem Identification**: How effectively does it identify problems? | |
5. **Actionability**: How actionable are the suggestions? | |
6. **Completeness**: How thorough are the reviews? | |
7. **Specificity**: How specific are the comments? | |
8. **Contextual Adequacy**: How well does it understand context? | |
9. **Consistency**: How consistent across different reviews? | |
10. **Brevity**: How concise without losing important information? | |
## π Security Features | |
### Rate Limiting | |
- **5 submissions per IP per 24 hours** | |
- **Automatic IP tracking and logging** | |
- **Graceful error handling for rate limits** | |
### Data Validation | |
- **Model name format validation** | |
- **Score range validation (0.0-1.0 for performance, 0-10 for quality)** | |
- **Logical consistency checks (Pass@1 β€ Pass@5 β€ Pass@10)** | |
- **Required field validation** | |
### Audit Trail | |
- **Complete submission logging** | |
- **IP address tracking (partially masked for privacy)** | |
- **Timestamp recording** | |
- **Data integrity checks** | |
## π€ Contributing | |
1. Fork the repository | |
2. Create a feature branch | |
3. Make your changes | |
4. Add tests if applicable | |
5. Submit a pull request | |
## π License | |
This project is licensed under the MIT License - see the LICENSE file for details. | |
## π Acknowledgments | |
- Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench) | |
- Built with [Gradio](https://gradio.app/) for the web interface | |
- Thanks to the open-source community for tools and inspiration | |
## π Support | |
For questions, issues, or contributions: | |
- Open an issue on GitHub | |
- Check the documentation | |
- Contact the maintainers | |
--- | |
**Built with β€οΈ for the code review research community** | |