metadata

title: CodeReviewBench
emoji: 😎
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
  - openai/gpt-4o-mini
  - openai/gpt-4o
  - claude-3-7-sonnet
  - deepseek/deepseek-r1

CodeReview Bench Leaderboard

A comprehensive benchmark and leaderboard for code review generation models, inspired by CodeReviewBench.

Features

Multi-Language Support: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
Dual Language Comments: Supports both Russian and English comment languages
Comprehensive Metrics:
- LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
- Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
Interactive Visualization: Compare model performance across categories with radar plots
Easy Submission: Submit your model results via web interface

Metrics

LLM-based Multimetric

Readability: How easy the review is to understand
Relevance: How relevant the review is to the code
Explanation Clarity: How clear the explanations are
Problem Identification: How well problems are identified
Actionability: How actionable the suggestions are
Completeness: How complete the review is
Specificity: How specific the feedback is
Contextual Adequacy: How well the review fits the context
Consistency: How consistent the review style is
Brevity: How concise the review is

Exact-Match Metrics

Pass@1: Percentage of correct reviews on first attempt
Pass@5: Percentage of correct reviews in top 5 attempts
Pass@10: Percentage of correct reviews in top 10 attempts
BLEU@10: BLEU score for top 10 review candidates

Programming Languages Supported

Python
JavaScript
Java
C++
C#
TypeScript
Go
Rust
Swift
Kotlin
Ruby
PHP
C
Scala
R
Dart
Other

Comment Languages

Russian (ru)
English (en)

Installation

pip install -r requirements.txt

Usage

python app.py

Submission Format

Submit your results as a JSONL file where each line contains:

{
  "model_name": "your-model-name",
  "programming_language": "python",
  "comment_language": "en",
  "readability": 8.5,
  "relevance": 9.0,
  "explanation_clarity": 7.8,
  "problem_identification": 8.2,
  "actionability": 8.7,
  "completeness": 8.0,
  "specificity": 7.5,
  "contextual_adequacy": 8.3,
  "consistency": 8.8,
  "brevity": 7.2,
  "pass_at_1": 0.75,
  "pass_at_5": 0.88,
  "pass_at_10": 0.92,
  "bleu_at_10": 0.65,
  "total_evaluations": 100
}

Environment Variables

Set the following environment variables:

Citation

<<<<<<< HEAD

Multi-tab Interface: Organized navigation with dedicated sections
Advanced Filtering: Real-time filtering by multiple criteria
Dark Theme: Modern, GitHub-inspired dark interface
IP-based Submissions: Secure submission tracking
Comprehensive Analytics: Detailed performance insights
Data Export: Multiple export formats
Rate Limiting: Anti-spam protection

🔧 Technical Improvements

Modular Architecture: Clean separation of concerns
Type Safety: Full type annotations throughout
Error Handling: Comprehensive error handling and logging
Data Validation: Multi-layer validation with Pydantic
Performance: Optimized data processing and display

📈 Metrics & Evaluation

Performance Metrics

BLEU: Text similarity score (0.0-1.0)
Pass@1: Success rate in single attempt (0.0-1.0)
Pass@5: Success rate in 5 attempts (0.0-1.0)
Pass@10: Success rate in 10 attempts (0.0-1.0)

Quality Dimensions

Readability: How clear and readable are the reviews?
Relevance: How relevant to the code changes?
Explanation Clarity: How well does it explain issues?
Problem Identification: How effectively does it identify problems?
Actionability: How actionable are the suggestions?
Completeness: How thorough are the reviews?
Specificity: How specific are the comments?
Contextual Adequacy: How well does it understand context?
Consistency: How consistent across different reviews?
Brevity: How concise without losing important information?

🔒 Security Features

Rate Limiting

5 submissions per IP per 24 hours
Automatic IP tracking and logging
Graceful error handling for rate limits

Data Validation

Model name format validation
Score range validation (0.0-1.0 for performance, 0-10 for quality)
Logical consistency checks (Pass@1 ≤ Pass@5 ≤ Pass@10)
Required field validation

Audit Trail

Complete submission logging
IP address tracking (partially masked for privacy)
Timestamp recording
Data integrity checks

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Inspired by CodeReviewBench
Built with Gradio for the web interface
Thanks to the open-source community for tools and inspiration

📞 Support

For questions, issues, or contributions:

Open an issue on GitHub
Check the documentation
Contact the maintainers

Built with ❤️ for the code review research community

Spaces:

kenkaneki
/

CodeReviewBench

Running