jmercat commited on
Commit
3a9cbd7
Β·
1 Parent(s): 8daa4df

Fix HuggingFace Space configuration - add proper SDK settings and clean requirements

Browse files
Files changed (2) hide show
  1. README.md +22 -63
  2. requirements.txt +0 -9
README.md CHANGED
@@ -1,82 +1,41 @@
1
  ---
2
- title: OpenThoughts Model Benchmark Explorer
3
  emoji: πŸ“Š
4
  colorFrom: blue
5
  colorTo: red
6
  sdk: streamlit
7
  sdk_version: 1.28.0
8
- app_file: benchmark_explorer_app.py
9
  pinned: false
10
- license: mit
11
  ---
12
 
13
- # πŸ”¬ OpenThoughts Evalchemy Benchmark Explorer
14
 
15
- Exploring correlations and relationships between LLMs performance across different reasoning benchmarks.
16
- This explorer is built on top of the [OpenThoughts](https://github.com/open-thoughts/open-thoughts) project to explore the model that we have trained and evaluated as well as external models that we have evaluated.
17
- All evaluation results were produced and logged using [Evalchemy](https://github.com/mlfoundations/evalchemy).
18
 
19
  ## Features
20
 
21
- ### πŸ“Š Overview Dashboard
22
- - Key metrics and dataset statistics
23
- - Benchmark coverage visualization
24
- - Quick correlation insights
25
- - Category-based analysis
26
 
27
- ### πŸ”₯ Interactive Heatmap
28
- - Multiple correlation methods (Pearson, Spearman, Kendall)
29
- - Interactive hover tooltips
30
- - Real-time correlation statistics
31
- - Distribution analysis
32
-
33
- ### πŸ“ˆ Scatter Plot Explorer
34
- - Dynamic benchmark selection
35
- - Interactive scatter plots with regression lines
36
- - Multiple correlation coefficients
37
- - Data point exploration
38
-
39
- ### 🎯 Model Performance Analysis
40
- - Model search and filtering
41
- - Performance rankings
42
- - Radar chart comparisons
43
- - Side-by-side model analysis
44
-
45
- ### πŸ“‹ Statistical Summary
46
- - Comprehensive dataset statistics
47
- - Benchmark-wise analysis
48
- - Export capabilities
49
- - Correlation summaries
50
-
51
- ### πŸ”¬ Uncertainty Analysis
52
- - Measurement precision analysis
53
- - Error bar visualizations with 95% CI
54
- - Signal-to-noise ratios
55
- - Uncertainty-aware correlations
56
-
57
- ## Benchmark Categories
58
-
59
- - **Math** (red): AIME24, AIME25, AMC23, MATH500
60
- - **Code** (blue): CodeElo, CodeForces, LiveCodeBench v2 & v5
61
- - **Science** (green): GPQADiamond, JEEBench
62
- - **General** (orange): MMLUPro, HLE
63
-
64
- ## Data Filtering Options
65
-
66
- - Category-based filtering
67
- - Zero-value filtering with threshold
68
- - Minimum coverage requirements
69
- - Dynamic slider ranges based on actual data
70
 
71
- ## Architecture
72
 
73
- - **Frontend**: Streamlit with Plotly interactive visualizations
74
- - **Backend**: Pandas/NumPy for data processing, SciPy for statistics
75
- - **Caching**: Smart caching for performance optimization
76
- - **Real-time**: On-the-fly correlation computation for dynamic filtering
 
 
77
 
78
- ## Usage
79
 
80
- The application automatically loads benchmark data and provides six specialized analysis modules. Use the sidebar controls to filter data and customize the analysis based on your needs.
 
 
81
 
82
- Perfect for researchers, practitioners, and anyone interested in understanding the relationships between different AI evaluation benchmarks.
 
1
  ---
2
+ title: OpenThoughts Benchmark Explorer
3
  emoji: πŸ“Š
4
  colorFrom: blue
5
  colorTo: red
6
  sdk: streamlit
7
  sdk_version: 1.28.0
8
+ app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
  ---
12
 
13
+ # OpenThoughts Evalchemy Benchmark Explorer
14
 
15
+ A comprehensive web application for exploring OpenThoughts benchmark correlations and model performance.
 
 
16
 
17
  ## Features
18
 
19
+ - Interactive correlation heatmaps
20
+ - Scatter plot explorer with uncertainty analysis
21
+ - Model performance comparisons
22
+ - Statistical summaries and uncertainty analysis
 
23
 
24
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ The app automatically loads benchmark data and provides multiple views for analysis:
27
 
28
+ 1. **Overview Dashboard**: High-level summary of benchmarks and correlations
29
+ 2. **Interactive Heatmap**: Correlation matrix visualization
30
+ 3. **Scatter Explorer**: Detailed pairwise benchmark comparisons
31
+ 4. **Model Performance**: Individual model analysis
32
+ 5. **Statistical Summary**: Correlation statistics across methods
33
+ 6. **Uncertainty Analysis**: Measurement reliability analysis
34
 
35
+ ## Data Files
36
 
37
+ The app requires two CSV files:
38
+ - `comprehensive_benchmark_scores.csv`: Main benchmark scores
39
+ - `benchmark_standard_errors.csv`: Standard error estimates (optional)
40
 
41
+ These files should be in the root directory of the repository.
requirements.txt CHANGED
@@ -1,12 +1,3 @@
1
- fastapi
2
- uvicorn
3
- requests
4
- sqlalchemy
5
- asyncpg
6
- aiohttp
7
- python-json-logger
8
- psycopg2-binary
9
- antlr4-python3-runtime==4.11
10
  streamlit>=1.28.0
11
  pandas>=2.0.0
12
  numpy>=1.24.0
 
 
 
 
 
 
 
 
 
 
1
  streamlit>=1.28.0
2
  pandas>=2.0.0
3
  numpy>=1.24.0