tonthatthienvu commited on
Commit
fb96d1e
Β·
1 Parent(s): 30709ab

Update Claude.md

Browse files
Files changed (1) hide show
  1. CLAUDE.md +127 -234
CLAUDE.md CHANGED
@@ -1,262 +1,155 @@
1
- # CLAUDE.md - HuggingFace Space Deployment
2
 
3
- This file provides guidance to Claude Code (claude.ai/code) when working with the **HuggingFace Space deployment** of the GAIA Solver.
4
 
5
- ## πŸ† PRODUCTION DEPLOYMENT STATUS
6
 
7
- **βœ… LIVE HUGGING FACE SPACE**: https://huggingface.co/spaces/tonthatthienvu/Final_Assignment
8
 
9
- **🎯 Achievement**: 85% accuracy GAIA Agent successfully deployed to production
10
 
11
- **πŸš€ Key Features**:
12
- - Production-ready Gradio interface with Advanced GAIA Agent
13
- - 42 specialized tools for research, chess, Excel, and multimedia processing
14
- - Multi-agent classification system with intelligent question routing
15
- - Real-time progress tracking and comprehensive error handling
16
- - Perfect accuracy on chess (Rd5), Excel ($89,706.00), Wikipedia (FunkMonk)
17
-
18
- **πŸ“Š Performance**: 85% overall accuracy (17/20 correct on GAIA benchmark)
19
-
20
- ## HuggingFace Space Development Commands
21
-
22
- **Environment Setup:**
23
  ```bash
24
- # Navigate to HF Space directory
25
- cd /Users/tttv/github/GAIA_Solver/huggingface_space
26
 
27
- # Check current space status
28
- git status
29
- git log --oneline -3
30
 
31
- # Test core functionality (basic check)
32
- python3 -c "from main import GAIASolver; print('βœ… Core GAIASolver available')"
33
- python3 -c "from async_complete_test_hf import HFAsyncGAIATestSystem; print('βœ… Advanced testing available')"
34
  ```
35
 
36
- **Running the HF Space Locally:**
37
  ```bash
38
- # Install dependencies for local testing
39
- pip install gradio python-dotenv litellm smolagents
40
 
41
- # Run the Gradio interface locally
42
- python app.py
43
 
44
- # Test individual components
45
- python -c "from gaia_tools import GAIA_TOOLS; print(f'Available tools: {len(GAIA_TOOLS)}')"
46
  ```
47
 
48
- **Testing Commands (Space-Optimized):**
49
  ```bash
50
- # Test advanced infrastructure
51
- python3 -c "from async_complete_test import AsyncGAIATestSystem; print('βœ… Advanced system available')"
52
-
53
- # Test HF-specific integration
54
- python3 -c "from async_complete_test_hf import run_hf_comprehensive_test; print('βœ… HF integration ready')"
55
 
56
  # Test question classification
57
- python3 -c "from question_classifier import QuestionClassifier; c = QuestionClassifier(); print('βœ… Classifier ready')"
 
58
 
59
- # Test specific question processing
60
- python3 tests/test_specific_question.py <question_id> # If tests directory exists
61
- ```
 
62
 
63
- **🌐 HuggingFace Space Deployment:**
64
- ```bash
65
- # Standard deployment workflow
66
- git add .
67
- git commit -m "feat: Update GAIA Agent with latest improvements"
68
- git push origin main
69
 
70
- # The space automatically rebuilds and deploys (2-3 minutes)
71
- # Live URL: https://huggingface.co/spaces/tonthatthienvu/Final_Assignment
72
 
73
- # Check deployment status
74
- curl -s https://huggingface.co/spaces/tonthatthienvu/Final_Assignment | grep -i "building\|running"
75
- ```
76
 
77
- **File Synchronization with Main Repository:**
78
- ```bash
79
- # Copy latest improvements from main repo to space
80
- cp /Users/tttv/github/GAIA_Solver/main.py .
81
- cp /Users/tttv/github/GAIA_Solver/gaia_tools.py .
82
- cp /Users/tttv/github/GAIA_Solver/question_classifier.py .
83
-
84
- # Copy advanced testing infrastructure
85
- cp /Users/tttv/github/GAIA_Solver/async_complete_test.py .
86
- cp /Users/tttv/github/GAIA_Solver/async_question_processor.py .
87
- cp /Users/tttv/github/GAIA_Solver/classification_analyzer.py .
88
- cp /Users/tttv/github/GAIA_Solver/summary_report_generator.py .
89
-
90
- # Copy supporting files
91
- cp /Users/tttv/github/GAIA_Solver/universal_fen_correction.py .
92
- cp /Users/tttv/github/GAIA_Solver/enhanced_wikipedia_tools.py .
93
- cp /Users/tttv/github/GAIA_Solver/wikipedia_featured_articles_by_date.py .
94
- ```
95
 
96
- ## Architecture Overview (HF Space-Specific)
97
-
98
- ### Multi-Agent Classification System
99
-
100
- The HF Space deployment uses the same **LLM-based question classification** with HF Space optimizations:
101
-
102
- **Core Components:**
103
- - `QuestionClassifier` (question_classifier.py) - Uses Qwen2.5-7B with fallback to rule-based classification
104
- - `GAIASolver` (main.py) - Main solver with enhanced error handling for HF Space environment
105
- - `GAIA_TOOLS` (gaia_tools.py) - 42 specialized tools with graceful dependency fallbacks
106
-
107
- **HF Space Optimizations:**
108
- - **Dependency Fallbacks**: Graceful handling of missing dependencies (google.generativeai, etc.)
109
- - **Memory Management**: Session cleanup after comprehensive testing
110
- - **Resource Limits**: Optimized concurrent processing (2-3 max vs 5 in source)
111
- - **Error Recovery**: Enhanced error handling for HF Space constraints
112
-
113
- ### Advanced Testing Infrastructure (New!)
114
-
115
- **βœ… Priority 1 Enhancements Deployed:**
116
- - `AsyncGAIATestSystem` - Full async testing with honest accuracy measurement
117
- - `HFAsyncGAIATestSystem` - HF Space-optimized version with auto-fallback
118
- - `ClassificationAnalyzer` - Performance analysis by question type
119
- - `SummaryReportGenerator` - Comprehensive reporting with improvement recommendations
120
-
121
- **Testing Modes:**
122
- 1. **Advanced Mode** (when all dependencies available):
123
- - Uses `AsyncGAIATestSystem` for full functionality
124
- - Honest accuracy measurement (no hardcoded overrides)
125
- - Classification-based performance analysis
126
- - Tool effectiveness ranking
127
- - Improvement recommendations
128
-
129
- 2. **Basic Mode** (fallback):
130
- - Uses simplified testing infrastructure
131
- - Standard accuracy measurement
132
- - Basic progress tracking
133
-
134
- ### HF Space-Specific Features
135
-
136
- **Production Interface (app.py):**
137
- - **Real-time Testing Mode Indicators**: Shows whether Advanced or Basic testing is active
138
- - **Enhanced Progress Tracking**: Live updates with detailed analytics
139
- - **Classification Performance**: Shows accuracy per question type (research, multimedia, chess, etc.)
140
- - **Tool Effectiveness**: Top 5 performing tools with success rates
141
- - **Memory Management**: Automatic cleanup after testing sessions
142
-
143
- **Dependency Management:**
144
- - **Graceful Degradation**: Missing dependencies don't break the system
145
- - **Smart Fallbacks**: Automatic fallback to simpler alternatives
146
- - **Error Recovery**: Comprehensive error handling for HF Space environment
147
-
148
- ## Key Implementation Details (HF Space)
149
-
150
- **Enhanced Error Handling:**
151
- ```python
152
- # Example: Graceful handling of missing dependencies
153
- try:
154
- import google.generativeai as genai
155
- GEMINI_AVAILABLE = True
156
- except ImportError:
157
- GEMINI_AVAILABLE = False
158
- genai = None
159
-
160
- # Tools check availability before execution
161
- if not GEMINI_AVAILABLE:
162
- return "Error: Gemini Vision API not available for image analysis"
163
- ```
164
 
165
- **Memory Optimization:**
166
- ```python
167
- def _cleanup_session(self):
168
- """Clean up session resources for memory management."""
169
- # Clean up temporary files
170
- # Force garbage collection
171
- # Optimize for HF Space resource constraints
172
  ```
173
-
174
- **Advanced vs Basic Testing Auto-Detection:**
175
- ```python
176
- # Automatically uses advanced testing when available
177
- if ADVANCED_TESTING and self.advanced_system:
178
- return await self._run_advanced_test(question_limit)
179
- else:
180
- return await self._run_basic_test(question_limit)
 
 
 
 
 
 
 
 
 
181
  ```
182
 
183
- ## Environment Requirements (HF Space)
184
-
185
- **Required for Full Functionality:**
186
- - GEMINI_API_KEY (for image/video analysis and fallback reasoning)
187
- - HUGGINGFACE_TOKEN (for question classification model)
188
- - KLUSTER_API_KEY (optional, for Qwen 3-235B via Kluster.ai)
189
-
190
- **HF Space Dependencies:**
191
- - gradio (for web interface)
192
- - python-dotenv (for environment variables)
193
- - litellm (for model integration)
194
- - smolagents (for agent framework)
195
-
196
- **Optional Dependencies (with fallbacks):**
197
- - google-generativeai (for Gemini Vision - graceful fallback if missing)
198
- - pandas + openpyxl (for Excel processing - error messages if missing)
199
-
200
- **Deployment Constraints:**
201
- - **Memory**: Optimized for HF Space memory limits
202
- - **Concurrency**: Limited to 2-3 concurrent questions vs 5 in source
203
- - **Timeout**: 10-30 minutes per question vs longer timeouts in source
204
- - **Storage**: Uses /tmp for temporary files
205
-
206
- ## Current Status & Capabilities
207
-
208
- ### πŸš€ **Recently Enhanced (Priority 1 Complete):**
209
-
210
- **βœ… Advanced Testing Infrastructure:**
211
- - Full async testing system deployed
212
- - Honest accuracy measurement active
213
- - Classification-based performance analysis
214
- - Real-time progress tracking with mode indicators
215
-
216
- **βœ… Production Optimizations:**
217
- - Memory management and session cleanup
218
- - Graceful dependency fallbacks
219
- - Enhanced error handling for HF Space environment
220
- - Resource-optimized concurrent processing
221
-
222
- **βœ… Web Interface Enhancements:**
223
- - Testing mode indicators (Advanced vs Basic)
224
- - Classification performance insights
225
- - Tool effectiveness metrics
226
- - Improvement recommendations display
227
-
228
- ### System Performance (Live Deployment)
229
-
230
- - **Chess Analysis**: βœ… **PERFECT ACCURACY** - Universal FEN correction with multi-tool consensus
231
- - **Wikipedia Research**: βœ… **PERFECT ACCURACY** - Enhanced parsing and anti-hallucination safeguards
232
- - **Excel Processing**: βœ… **PERFECT ACCURACY** - Comprehensive spreadsheet analysis
233
- - **Video+Audio Analysis**: βœ… **ENHANCED** - Gemini 2.0 Flash integration for dialogue transcription
234
- - **Japanese Baseball Research**: βœ… **ENHANCED** - Hybrid anti-hallucination solution
235
-
236
- ### Deployment Status
237
-
238
- **βœ… PRODUCTION READY**: Live at https://huggingface.co/spaces/tonthatthienvu/Final_Assignment
239
- - 85% GAIA benchmark accuracy
240
- - Advanced testing infrastructure active
241
- - Real-time progress tracking
242
- - Comprehensive error handling
243
- - Memory-optimized for HF Space environment
244
-
245
- ## Development Workflow
246
-
247
- **Standard Development Cycle:**
248
- 1. Make changes in `/Users/tttv/github/GAIA_Solver/huggingface_space/`
249
- 2. Test locally (if dependencies available) or commit for HF testing
250
- 3. `git add . && git commit -m "feat: Description"`
251
- 4. `git push origin main`
252
- 5. Monitor automatic rebuild at HF Space URL
253
- 6. Verify functionality in live deployment
254
-
255
- **Best Practices for HF Space:**
256
- - Always test import fallbacks for optional dependencies
257
- - Use resource-efficient concurrent processing
258
- - Implement proper cleanup after intensive operations
259
- - Provide clear error messages for missing dependencies
260
- - Monitor memory usage during testing operations
261
-
262
- This HF Space deployment maintains the same 85% accuracy as the source repository while being optimized for the HuggingFace Space production environment.
 
1
+ # CLAUDE.md
2
 
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
 
5
+ ## Project Overview
6
 
7
+ This is a **production-ready GAIA benchmark AI agent** achieving 85% accuracy through a sophisticated multi-agent architecture. The system has been **fully refactored** into a modular, maintainable architecture that specializes in complex question answering across multimedia, research, file processing, chess analysis, and mathematical reasoning domains.
8
 
9
+ ## Development Commands
10
 
11
+ ### Setup and Installation
 
 
 
 
 
 
 
 
 
 
 
12
  ```bash
13
+ # Install dependencies
14
+ pip install -r requirements.txt
15
 
16
+ # Test API key configuration
17
+ python test_api_keys.py
 
18
 
19
+ # Verify core functionality
20
+ python -c "from main import GAIASolver; print('βœ… Core GAIASolver available')"
 
21
  ```
22
 
23
+ ### Running the System
24
  ```bash
25
+ # Run legacy monolithic solver
26
+ python main.py
27
 
28
+ # Run refactored modular solver (recommended)
29
+ python main_refactored.py
30
 
31
+ # Run Gradio web interface
32
+ python app.py
33
  ```
34
 
35
+ ### Testing Commands
36
  ```bash
37
+ # Comprehensive async testing
38
+ python async_complete_test.py
 
 
 
39
 
40
  # Test question classification
41
+ python test_improved_classification.py
42
+ python final_classification_test.py
43
 
44
+ # Test YouTube functionality
45
+ python direct_youtube_test.py
46
+ python simple_youtube_test.py
47
+ python test_youtube_question.py
48
 
49
+ # Test individual components
50
+ python -c "from gaia_tools import GAIA_TOOLS; print(f'Available tools: {len(GAIA_TOOLS)}')"
51
+ python -c "from question_classifier import QuestionClassifier; c = QuestionClassifier(); print('βœ… Classifier ready')"
52
+ ```
 
 
53
 
54
+ ## Architecture Overview
 
55
 
56
+ ### Dual Architecture Design
 
 
57
 
58
+ This project maintains both **legacy monolithic** and **refactored modular** architectures:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ **Legacy Architecture (main.py):**
61
+ - Monolithic 1285-line solver with all functionality integrated
62
+ - Comprehensive tool collection in gaia_tools.py (4887 lines)
63
+ - Single-file approach for rapid development and deployment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
+ **Refactored Architecture (gaia/ package):**
 
 
 
 
 
 
66
  ```
67
+ gaia/
68
+ β”œβ”€β”€ core/ # Main solver logic
69
+ β”‚ β”œβ”€β”€ solver.py # GAIASolver main class
70
+ β”‚ β”œβ”€β”€ answer_extractor.py # Specialized answer extraction classes
71
+ β”‚ └── question_processor.py # Question classification and processing
72
+ β”œβ”€β”€ tools/ # Tool implementations
73
+ β”‚ β”œβ”€β”€ base.py # Abstract tool interface and registry
74
+ β”‚ β”œβ”€β”€ registry.py # Tool discovery and management
75
+ β”‚ └── [specialized tool modules]
76
+ β”œβ”€β”€ models/ # Model providers and management
77
+ β”‚ β”œβ”€β”€ manager.py # ModelManager with fallback chains
78
+ β”‚ └── providers.py # LiteLLM, Gemini, Kluster providers
79
+ β”œβ”€β”€ config/ # Configuration management
80
+ β”‚ └── settings.py # Config, ModelConfig classes
81
+ └── utils/ # Utilities and helpers
82
+ β”œβ”€β”€ exceptions.py # Custom exception hierarchy
83
+ └── logging.py # Logging configuration
84
  ```
85
 
86
+ ### Core Components
87
+
88
+ **GAIASolver (main.py):** Legacy monolithic solver with 1000+ lines of sophisticated processing logic
89
+ **GAIASolver (gaia/core/solver.py):** Refactored main orchestrator using dependency injection
90
+ **QuestionClassifier:** LLM-based intelligent routing with pattern-based fallbacks
91
+ **GAIA_TOOLS:** 42 specialized tools including enhanced Wikipedia research, chess analysis, Excel processing, and multimedia analysis
92
+ **ModelManager:** Handles model initialization, fallback chains (Kluster.ai β†’ Gemini β†’ Qwen), and lifecycle management
93
+
94
+ ### Question Type Specialization
95
+
96
+ **Research Questions (92% accuracy):**
97
+ - Enhanced Wikipedia tools with date-specific searches and Featured Articles integration
98
+ - Multi-step research coordination with cross-validation
99
+ - Anti-hallucination safeguards to prevent fabrication
100
+
101
+ **Chess Questions (100% accuracy):**
102
+ - Universal FEN correction system handling any vision error pattern
103
+ - Multi-tool consensus system for maximum accuracy
104
+ - Perfect algebraic notation extraction
105
+
106
+ **YouTube/Multimedia Questions:**
107
+ - Enhanced URL detection with multiple regex patterns
108
+ - Forced classification override for YouTube content
109
+ - Specialized prompts with explicit tool usage instructions
110
+
111
+ **File Processing (100% accuracy):**
112
+ - Format-specific tools for Excel (.xlsx/.xls), Python (.py), text files
113
+ - Deterministic Python execution with sandboxed environment
114
+ - Financial calculation specialization with proper currency formatting
115
+
116
+ ## Environment Configuration
117
+
118
+ ### Required API Keys (set in .env)
119
+ - `GEMINI_API_KEY` - Primary model (Gemini Flash 2.0)
120
+ - `HUGGINGFACE_TOKEN` - Fallback model and classification
121
+ - `KLUSTER_API_KEY` - Optional premium model access
122
+
123
+ ### Model Fallback Chain
124
+ 1. **Kluster.ai** (Qwen3-235B, Gemma3-27B) - Premium option
125
+ 2. **Gemini Flash 2.0** - Primary production model
126
+ 3. **Qwen 2.5-72B** - Reliable fallback via HuggingFace
127
+
128
+ ## Key Design Patterns
129
+
130
+ ### Anti-Hallucination Architecture
131
+ - **Tool result prioritization**: Always uses exact tool outputs over internal reasoning
132
+ - **Cross-validation**: Multiple verification methods for critical information
133
+ - **Source attribution**: Clear tracking and validation of information sources
134
+ - **Validation rules**: Type-specific answer extraction and verification
135
+
136
+ ### Performance Optimizations
137
+ - **Fresh agent creation** for each question to avoid token accumulation
138
+ - **Concurrent processing** support with async operations
139
+ - **15-minute web cache** for improved response times
140
+ - **Exponential backoff** for API rate limiting
141
+
142
+ ## File Organization
143
+
144
+ ### Core Files
145
+ - `main.py` - Legacy monolithic solver (1285 lines)
146
+ - `main_refactored.py` - Entry point for refactored architecture
147
+ - `gaia_tools.py` - 42 specialized tools with robust error handling (4887 lines)
148
+ - `question_classifier.py` - LLM + pattern-based classification system
149
+ - `app.py` - Production Gradio interface with comprehensive error handling
150
+
151
+ ### Supporting Files
152
+ - `async_complete_test.py` - Comprehensive async testing infrastructure
153
+ - `enhanced_wikipedia_tools.py` - Advanced Wikipedia research capabilities
154
+ - `universal_fen_correction.py` - Chess-specific FEN notation correction
155
+ - `wikipedia_featured_articles_by_date.py` - Date-specific Wikipedia searches