Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.42.0
metadata
title: CodeReviewBench
emoji: π
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
- openai/gpt-4o-mini
- openai/gpt-4o
- claude-3-7-sonnet
- deepseek/deepseek-r1
CodeReview Bench Leaderboard
A comprehensive benchmark and leaderboard for code review generation models, inspired by CodeReviewBench.
Features
- Multi-Language Support: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
- Dual Language Comments: Supports both Russian and English comment languages
- Comprehensive Metrics:
- LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
- Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
- Interactive Visualization: Compare model performance across categories with radar plots
- Easy Submission: Submit your model results via web interface
Metrics
LLM-based Multimetric
- Readability: How easy the review is to understand
- Relevance: How relevant the review is to the code
- Explanation Clarity: How clear the explanations are
- Problem Identification: How well problems are identified
- Actionability: How actionable the suggestions are
- Completeness: How complete the review is
- Specificity: How specific the feedback is
- Contextual Adequacy: How well the review fits the context
- Consistency: How consistent the review style is
- Brevity: How concise the review is
Exact-Match Metrics
- Pass@1: Percentage of correct reviews on first attempt
- Pass@5: Percentage of correct reviews in top 5 attempts
- Pass@10: Percentage of correct reviews in top 10 attempts
- BLEU@10: BLEU score for top 10 review candidates
Programming Languages Supported
- Python
- JavaScript
- Java
- C++
- C#
- TypeScript
- Go
- Rust
- Swift
- Kotlin
- Ruby
- PHP
- C
- Scala
- R
- Dart
- Other
Comment Languages
- Russian (ru)
- English (en)
Example Categories
- Bug Fix
- Code Style
- Performance
- Security
- Refactoring
- Documentation
- Testing
- Architecture
- Other
Installation
pip install -r requirements.txt
Usage
python app.py
Submission Format
Submit your results as a JSONL file where each line contains:
{
"model_name": "your-model-name",
"programming_language": "python",
"comment_language": "en",
"readability": 8.5,
"relevance": 9.0,
"explanation_clarity": 7.8,
"problem_identification": 8.2,
"actionability": 8.7,
"completeness": 8.0,
"specificity": 7.5,
"contextual_adequacy": 8.3,
"consistency": 8.8,
"brevity": 7.2,
"pass_at_1": 0.75,
"pass_at_5": 0.88,
"pass_at_10": 0.92,
"bleu_at_10": 0.65,
"total_evaluations": 100
}
Environment Variables
Set the following environment variables:
Citation
<<<<<<< HEAD
- Multi-tab Interface: Organized navigation with dedicated sections
- Advanced Filtering: Real-time filtering by multiple criteria
- Dark Theme: Modern, GitHub-inspired dark interface
- IP-based Submissions: Secure submission tracking
- Comprehensive Analytics: Detailed performance insights
- Data Export: Multiple export formats
- Rate Limiting: Anti-spam protection
π§ Technical Improvements
- Modular Architecture: Clean separation of concerns
- Type Safety: Full type annotations throughout
- Error Handling: Comprehensive error handling and logging
- Data Validation: Multi-layer validation with Pydantic
- Performance: Optimized data processing and display
π Metrics & Evaluation
Performance Metrics
- BLEU: Text similarity score (0.0-1.0)
- Pass@1: Success rate in single attempt (0.0-1.0)
- Pass@5: Success rate in 5 attempts (0.0-1.0)
- Pass@10: Success rate in 10 attempts (0.0-1.0)
Quality Dimensions
- Readability: How clear and readable are the reviews?
- Relevance: How relevant to the code changes?
- Explanation Clarity: How well does it explain issues?
- Problem Identification: How effectively does it identify problems?
- Actionability: How actionable are the suggestions?
- Completeness: How thorough are the reviews?
- Specificity: How specific are the comments?
- Contextual Adequacy: How well does it understand context?
- Consistency: How consistent across different reviews?
- Brevity: How concise without losing important information?
π Security Features
Rate Limiting
- 5 submissions per IP per 24 hours
- Automatic IP tracking and logging
- Graceful error handling for rate limits
Data Validation
- Model name format validation
- Score range validation (0.0-1.0 for performance, 0-10 for quality)
- Logical consistency checks (Pass@1 β€ Pass@5 β€ Pass@10)
- Required field validation
Audit Trail
- Complete submission logging
- IP address tracking (partially masked for privacy)
- Timestamp recording
- Data integrity checks
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Inspired by CodeReviewBench
- Built with Gradio for the web interface
- Thanks to the open-source community for tools and inspiration
π Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Check the documentation
- Contact the maintainers
Built with β€οΈ for the code review research community