CodeReviewBench / README.md
kenkaneki's picture
Update README.md
026e990 verified

A newer version of the Gradio SDK is available: 5.42.0

Upgrade
metadata
title: CodeReviewBench
emoji: 😎
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
  - openai/gpt-4o-mini
  - openai/gpt-4o
  - claude-3-7-sonnet
  - deepseek/deepseek-r1

CodeReview Bench Leaderboard

A comprehensive benchmark and leaderboard for code review generation models, inspired by CodeReviewBench.

Features

  • Multi-Language Support: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
  • Dual Language Comments: Supports both Russian and English comment languages
  • Comprehensive Metrics:
    • LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
    • Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
  • Interactive Visualization: Compare model performance across categories with radar plots
  • Easy Submission: Submit your model results via web interface

Metrics

LLM-based Multimetric

  • Readability: How easy the review is to understand
  • Relevance: How relevant the review is to the code
  • Explanation Clarity: How clear the explanations are
  • Problem Identification: How well problems are identified
  • Actionability: How actionable the suggestions are
  • Completeness: How complete the review is
  • Specificity: How specific the feedback is
  • Contextual Adequacy: How well the review fits the context
  • Consistency: How consistent the review style is
  • Brevity: How concise the review is

Exact-Match Metrics

  • Pass@1: Percentage of correct reviews on first attempt
  • Pass@5: Percentage of correct reviews in top 5 attempts
  • Pass@10: Percentage of correct reviews in top 10 attempts
  • BLEU@10: BLEU score for top 10 review candidates

Programming Languages Supported

  • Python
  • JavaScript
  • Java
  • C++
  • C#
  • TypeScript
  • Go
  • Rust
  • Swift
  • Kotlin
  • Ruby
  • PHP
  • C
  • Scala
  • R
  • Dart
  • Other

Comment Languages

  • Russian (ru)
  • English (en)

Example Categories

  • Bug Fix
  • Code Style
  • Performance
  • Security
  • Refactoring
  • Documentation
  • Testing
  • Architecture
  • Other

Installation

pip install -r requirements.txt

Usage

python app.py

Submission Format

Submit your results as a JSONL file where each line contains:

{
  "model_name": "your-model-name",
  "programming_language": "python",
  "comment_language": "en",
  "readability": 8.5,
  "relevance": 9.0,
  "explanation_clarity": 7.8,
  "problem_identification": 8.2,
  "actionability": 8.7,
  "completeness": 8.0,
  "specificity": 7.5,
  "contextual_adequacy": 8.3,
  "consistency": 8.8,
  "brevity": 7.2,
  "pass_at_1": 0.75,
  "pass_at_5": 0.88,
  "pass_at_10": 0.92,
  "bleu_at_10": 0.65,
  "total_evaluations": 100
}

Environment Variables

Set the following environment variables:

Citation

<<<<<<< HEAD

  • Multi-tab Interface: Organized navigation with dedicated sections
  • Advanced Filtering: Real-time filtering by multiple criteria
  • Dark Theme: Modern, GitHub-inspired dark interface
  • IP-based Submissions: Secure submission tracking
  • Comprehensive Analytics: Detailed performance insights
  • Data Export: Multiple export formats
  • Rate Limiting: Anti-spam protection

πŸ”§ Technical Improvements

  • Modular Architecture: Clean separation of concerns
  • Type Safety: Full type annotations throughout
  • Error Handling: Comprehensive error handling and logging
  • Data Validation: Multi-layer validation with Pydantic
  • Performance: Optimized data processing and display

πŸ“ˆ Metrics & Evaluation

Performance Metrics

  • BLEU: Text similarity score (0.0-1.0)
  • Pass@1: Success rate in single attempt (0.0-1.0)
  • Pass@5: Success rate in 5 attempts (0.0-1.0)
  • Pass@10: Success rate in 10 attempts (0.0-1.0)

Quality Dimensions

  1. Readability: How clear and readable are the reviews?
  2. Relevance: How relevant to the code changes?
  3. Explanation Clarity: How well does it explain issues?
  4. Problem Identification: How effectively does it identify problems?
  5. Actionability: How actionable are the suggestions?
  6. Completeness: How thorough are the reviews?
  7. Specificity: How specific are the comments?
  8. Contextual Adequacy: How well does it understand context?
  9. Consistency: How consistent across different reviews?
  10. Brevity: How concise without losing important information?

πŸ”’ Security Features

Rate Limiting

  • 5 submissions per IP per 24 hours
  • Automatic IP tracking and logging
  • Graceful error handling for rate limits

Data Validation

  • Model name format validation
  • Score range validation (0.0-1.0 for performance, 0-10 for quality)
  • Logical consistency checks (Pass@1 ≀ Pass@5 ≀ Pass@10)
  • Required field validation

Audit Trail

  • Complete submission logging
  • IP address tracking (partially masked for privacy)
  • Timestamp recording
  • Data integrity checks

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Inspired by CodeReviewBench
  • Built with Gradio for the web interface
  • Thanks to the open-source community for tools and inspiration

πŸ“ž Support

For questions, issues, or contributions:

  • Open an issue on GitHub
  • Check the documentation
  • Contact the maintainers

Built with ❀️ for the code review research community