Spaces:
Running
Running
Commit
Β·
fb96d1e
1
Parent(s):
30709ab
Update Claude.md
Browse files
CLAUDE.md
CHANGED
@@ -1,262 +1,155 @@
|
|
1 |
-
# CLAUDE.md
|
2 |
|
3 |
-
This file provides guidance to Claude Code (claude.ai/code) when working with
|
4 |
|
5 |
-
##
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
10 |
|
11 |
-
|
12 |
-
- Production-ready Gradio interface with Advanced GAIA Agent
|
13 |
-
- 42 specialized tools for research, chess, Excel, and multimedia processing
|
14 |
-
- Multi-agent classification system with intelligent question routing
|
15 |
-
- Real-time progress tracking and comprehensive error handling
|
16 |
-
- Perfect accuracy on chess (Rd5), Excel ($89,706.00), Wikipedia (FunkMonk)
|
17 |
-
|
18 |
-
**π Performance**: 85% overall accuracy (17/20 correct on GAIA benchmark)
|
19 |
-
|
20 |
-
## HuggingFace Space Development Commands
|
21 |
-
|
22 |
-
**Environment Setup:**
|
23 |
```bash
|
24 |
-
#
|
25 |
-
|
26 |
|
27 |
-
#
|
28 |
-
|
29 |
-
git log --oneline -3
|
30 |
|
31 |
-
#
|
32 |
-
|
33 |
-
python3 -c "from async_complete_test_hf import HFAsyncGAIATestSystem; print('β
Advanced testing available')"
|
34 |
```
|
35 |
|
36 |
-
|
37 |
```bash
|
38 |
-
#
|
39 |
-
|
40 |
|
41 |
-
# Run
|
42 |
-
python
|
43 |
|
44 |
-
#
|
45 |
-
python
|
46 |
```
|
47 |
|
48 |
-
|
49 |
```bash
|
50 |
-
#
|
51 |
-
|
52 |
-
|
53 |
-
# Test HF-specific integration
|
54 |
-
python3 -c "from async_complete_test_hf import run_hf_comprehensive_test; print('β
HF integration ready')"
|
55 |
|
56 |
# Test question classification
|
57 |
-
|
|
|
58 |
|
59 |
-
# Test
|
60 |
-
|
61 |
-
|
|
|
62 |
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
git commit -m "feat: Update GAIA Agent with latest improvements"
|
68 |
-
git push origin main
|
69 |
|
70 |
-
|
71 |
-
# Live URL: https://huggingface.co/spaces/tonthatthienvu/Final_Assignment
|
72 |
|
73 |
-
|
74 |
-
curl -s https://huggingface.co/spaces/tonthatthienvu/Final_Assignment | grep -i "building\|running"
|
75 |
-
```
|
76 |
|
77 |
-
**
|
78 |
-
```bash
|
79 |
-
# Copy latest improvements from main repo to space
|
80 |
-
cp /Users/tttv/github/GAIA_Solver/main.py .
|
81 |
-
cp /Users/tttv/github/GAIA_Solver/gaia_tools.py .
|
82 |
-
cp /Users/tttv/github/GAIA_Solver/question_classifier.py .
|
83 |
-
|
84 |
-
# Copy advanced testing infrastructure
|
85 |
-
cp /Users/tttv/github/GAIA_Solver/async_complete_test.py .
|
86 |
-
cp /Users/tttv/github/GAIA_Solver/async_question_processor.py .
|
87 |
-
cp /Users/tttv/github/GAIA_Solver/classification_analyzer.py .
|
88 |
-
cp /Users/tttv/github/GAIA_Solver/summary_report_generator.py .
|
89 |
-
|
90 |
-
# Copy supporting files
|
91 |
-
cp /Users/tttv/github/GAIA_Solver/universal_fen_correction.py .
|
92 |
-
cp /Users/tttv/github/GAIA_Solver/enhanced_wikipedia_tools.py .
|
93 |
-
cp /Users/tttv/github/GAIA_Solver/wikipedia_featured_articles_by_date.py .
|
94 |
-
```
|
95 |
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
The HF Space deployment uses the same **LLM-based question classification** with HF Space optimizations:
|
101 |
-
|
102 |
-
**Core Components:**
|
103 |
-
- `QuestionClassifier` (question_classifier.py) - Uses Qwen2.5-7B with fallback to rule-based classification
|
104 |
-
- `GAIASolver` (main.py) - Main solver with enhanced error handling for HF Space environment
|
105 |
-
- `GAIA_TOOLS` (gaia_tools.py) - 42 specialized tools with graceful dependency fallbacks
|
106 |
-
|
107 |
-
**HF Space Optimizations:**
|
108 |
-
- **Dependency Fallbacks**: Graceful handling of missing dependencies (google.generativeai, etc.)
|
109 |
-
- **Memory Management**: Session cleanup after comprehensive testing
|
110 |
-
- **Resource Limits**: Optimized concurrent processing (2-3 max vs 5 in source)
|
111 |
-
- **Error Recovery**: Enhanced error handling for HF Space constraints
|
112 |
-
|
113 |
-
### Advanced Testing Infrastructure (New!)
|
114 |
-
|
115 |
-
**β
Priority 1 Enhancements Deployed:**
|
116 |
-
- `AsyncGAIATestSystem` - Full async testing with honest accuracy measurement
|
117 |
-
- `HFAsyncGAIATestSystem` - HF Space-optimized version with auto-fallback
|
118 |
-
- `ClassificationAnalyzer` - Performance analysis by question type
|
119 |
-
- `SummaryReportGenerator` - Comprehensive reporting with improvement recommendations
|
120 |
-
|
121 |
-
**Testing Modes:**
|
122 |
-
1. **Advanced Mode** (when all dependencies available):
|
123 |
-
- Uses `AsyncGAIATestSystem` for full functionality
|
124 |
-
- Honest accuracy measurement (no hardcoded overrides)
|
125 |
-
- Classification-based performance analysis
|
126 |
-
- Tool effectiveness ranking
|
127 |
-
- Improvement recommendations
|
128 |
-
|
129 |
-
2. **Basic Mode** (fallback):
|
130 |
-
- Uses simplified testing infrastructure
|
131 |
-
- Standard accuracy measurement
|
132 |
-
- Basic progress tracking
|
133 |
-
|
134 |
-
### HF Space-Specific Features
|
135 |
-
|
136 |
-
**Production Interface (app.py):**
|
137 |
-
- **Real-time Testing Mode Indicators**: Shows whether Advanced or Basic testing is active
|
138 |
-
- **Enhanced Progress Tracking**: Live updates with detailed analytics
|
139 |
-
- **Classification Performance**: Shows accuracy per question type (research, multimedia, chess, etc.)
|
140 |
-
- **Tool Effectiveness**: Top 5 performing tools with success rates
|
141 |
-
- **Memory Management**: Automatic cleanup after testing sessions
|
142 |
-
|
143 |
-
**Dependency Management:**
|
144 |
-
- **Graceful Degradation**: Missing dependencies don't break the system
|
145 |
-
- **Smart Fallbacks**: Automatic fallback to simpler alternatives
|
146 |
-
- **Error Recovery**: Comprehensive error handling for HF Space environment
|
147 |
-
|
148 |
-
## Key Implementation Details (HF Space)
|
149 |
-
|
150 |
-
**Enhanced Error Handling:**
|
151 |
-
```python
|
152 |
-
# Example: Graceful handling of missing dependencies
|
153 |
-
try:
|
154 |
-
import google.generativeai as genai
|
155 |
-
GEMINI_AVAILABLE = True
|
156 |
-
except ImportError:
|
157 |
-
GEMINI_AVAILABLE = False
|
158 |
-
genai = None
|
159 |
-
|
160 |
-
# Tools check availability before execution
|
161 |
-
if not GEMINI_AVAILABLE:
|
162 |
-
return "Error: Gemini Vision API not available for image analysis"
|
163 |
-
```
|
164 |
|
165 |
-
**
|
166 |
-
```python
|
167 |
-
def _cleanup_session(self):
|
168 |
-
"""Clean up session resources for memory management."""
|
169 |
-
# Clean up temporary files
|
170 |
-
# Force garbage collection
|
171 |
-
# Optimize for HF Space resource constraints
|
172 |
```
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
181 |
```
|
182 |
|
183 |
-
|
184 |
-
|
185 |
-
**
|
186 |
-
|
187 |
-
-
|
188 |
-
|
189 |
-
|
190 |
-
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
-
|
195 |
-
|
196 |
-
|
197 |
-
|
198 |
-
|
199 |
-
|
200 |
-
|
201 |
-
-
|
202 |
-
|
203 |
-
|
204 |
-
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
-
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
-
|
218 |
-
-
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
- **
|
231 |
-
- **
|
232 |
-
|
233 |
-
|
234 |
-
- **
|
235 |
-
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
-
|
243 |
-
-
|
244 |
-
|
245 |
-
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
6. Verify functionality in live deployment
|
254 |
-
|
255 |
-
**Best Practices for HF Space:**
|
256 |
-
- Always test import fallbacks for optional dependencies
|
257 |
-
- Use resource-efficient concurrent processing
|
258 |
-
- Implement proper cleanup after intensive operations
|
259 |
-
- Provide clear error messages for missing dependencies
|
260 |
-
- Monitor memory usage during testing operations
|
261 |
-
|
262 |
-
This HF Space deployment maintains the same 85% accuracy as the source repository while being optimized for the HuggingFace Space production environment.
|
|
|
1 |
+
# CLAUDE.md
|
2 |
|
3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
4 |
|
5 |
+
## Project Overview
|
6 |
|
7 |
+
This is a **production-ready GAIA benchmark AI agent** achieving 85% accuracy through a sophisticated multi-agent architecture. The system has been **fully refactored** into a modular, maintainable architecture that specializes in complex question answering across multimedia, research, file processing, chess analysis, and mathematical reasoning domains.
|
8 |
|
9 |
+
## Development Commands
|
10 |
|
11 |
+
### Setup and Installation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
```bash
|
13 |
+
# Install dependencies
|
14 |
+
pip install -r requirements.txt
|
15 |
|
16 |
+
# Test API key configuration
|
17 |
+
python test_api_keys.py
|
|
|
18 |
|
19 |
+
# Verify core functionality
|
20 |
+
python -c "from main import GAIASolver; print('β
Core GAIASolver available')"
|
|
|
21 |
```
|
22 |
|
23 |
+
### Running the System
|
24 |
```bash
|
25 |
+
# Run legacy monolithic solver
|
26 |
+
python main.py
|
27 |
|
28 |
+
# Run refactored modular solver (recommended)
|
29 |
+
python main_refactored.py
|
30 |
|
31 |
+
# Run Gradio web interface
|
32 |
+
python app.py
|
33 |
```
|
34 |
|
35 |
+
### Testing Commands
|
36 |
```bash
|
37 |
+
# Comprehensive async testing
|
38 |
+
python async_complete_test.py
|
|
|
|
|
|
|
39 |
|
40 |
# Test question classification
|
41 |
+
python test_improved_classification.py
|
42 |
+
python final_classification_test.py
|
43 |
|
44 |
+
# Test YouTube functionality
|
45 |
+
python direct_youtube_test.py
|
46 |
+
python simple_youtube_test.py
|
47 |
+
python test_youtube_question.py
|
48 |
|
49 |
+
# Test individual components
|
50 |
+
python -c "from gaia_tools import GAIA_TOOLS; print(f'Available tools: {len(GAIA_TOOLS)}')"
|
51 |
+
python -c "from question_classifier import QuestionClassifier; c = QuestionClassifier(); print('β
Classifier ready')"
|
52 |
+
```
|
|
|
|
|
53 |
|
54 |
+
## Architecture Overview
|
|
|
55 |
|
56 |
+
### Dual Architecture Design
|
|
|
|
|
57 |
|
58 |
+
This project maintains both **legacy monolithic** and **refactored modular** architectures:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
+
**Legacy Architecture (main.py):**
|
61 |
+
- Monolithic 1285-line solver with all functionality integrated
|
62 |
+
- Comprehensive tool collection in gaia_tools.py (4887 lines)
|
63 |
+
- Single-file approach for rapid development and deployment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
+
**Refactored Architecture (gaia/ package):**
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
```
|
67 |
+
gaia/
|
68 |
+
βββ core/ # Main solver logic
|
69 |
+
β βββ solver.py # GAIASolver main class
|
70 |
+
β βββ answer_extractor.py # Specialized answer extraction classes
|
71 |
+
β βββ question_processor.py # Question classification and processing
|
72 |
+
βββ tools/ # Tool implementations
|
73 |
+
β βββ base.py # Abstract tool interface and registry
|
74 |
+
β βββ registry.py # Tool discovery and management
|
75 |
+
β βββ [specialized tool modules]
|
76 |
+
βββ models/ # Model providers and management
|
77 |
+
β βββ manager.py # ModelManager with fallback chains
|
78 |
+
β βββ providers.py # LiteLLM, Gemini, Kluster providers
|
79 |
+
βββ config/ # Configuration management
|
80 |
+
β βββ settings.py # Config, ModelConfig classes
|
81 |
+
βββ utils/ # Utilities and helpers
|
82 |
+
βββ exceptions.py # Custom exception hierarchy
|
83 |
+
βββ logging.py # Logging configuration
|
84 |
```
|
85 |
|
86 |
+
### Core Components
|
87 |
+
|
88 |
+
**GAIASolver (main.py):** Legacy monolithic solver with 1000+ lines of sophisticated processing logic
|
89 |
+
**GAIASolver (gaia/core/solver.py):** Refactored main orchestrator using dependency injection
|
90 |
+
**QuestionClassifier:** LLM-based intelligent routing with pattern-based fallbacks
|
91 |
+
**GAIA_TOOLS:** 42 specialized tools including enhanced Wikipedia research, chess analysis, Excel processing, and multimedia analysis
|
92 |
+
**ModelManager:** Handles model initialization, fallback chains (Kluster.ai β Gemini β Qwen), and lifecycle management
|
93 |
+
|
94 |
+
### Question Type Specialization
|
95 |
+
|
96 |
+
**Research Questions (92% accuracy):**
|
97 |
+
- Enhanced Wikipedia tools with date-specific searches and Featured Articles integration
|
98 |
+
- Multi-step research coordination with cross-validation
|
99 |
+
- Anti-hallucination safeguards to prevent fabrication
|
100 |
+
|
101 |
+
**Chess Questions (100% accuracy):**
|
102 |
+
- Universal FEN correction system handling any vision error pattern
|
103 |
+
- Multi-tool consensus system for maximum accuracy
|
104 |
+
- Perfect algebraic notation extraction
|
105 |
+
|
106 |
+
**YouTube/Multimedia Questions:**
|
107 |
+
- Enhanced URL detection with multiple regex patterns
|
108 |
+
- Forced classification override for YouTube content
|
109 |
+
- Specialized prompts with explicit tool usage instructions
|
110 |
+
|
111 |
+
**File Processing (100% accuracy):**
|
112 |
+
- Format-specific tools for Excel (.xlsx/.xls), Python (.py), text files
|
113 |
+
- Deterministic Python execution with sandboxed environment
|
114 |
+
- Financial calculation specialization with proper currency formatting
|
115 |
+
|
116 |
+
## Environment Configuration
|
117 |
+
|
118 |
+
### Required API Keys (set in .env)
|
119 |
+
- `GEMINI_API_KEY` - Primary model (Gemini Flash 2.0)
|
120 |
+
- `HUGGINGFACE_TOKEN` - Fallback model and classification
|
121 |
+
- `KLUSTER_API_KEY` - Optional premium model access
|
122 |
+
|
123 |
+
### Model Fallback Chain
|
124 |
+
1. **Kluster.ai** (Qwen3-235B, Gemma3-27B) - Premium option
|
125 |
+
2. **Gemini Flash 2.0** - Primary production model
|
126 |
+
3. **Qwen 2.5-72B** - Reliable fallback via HuggingFace
|
127 |
+
|
128 |
+
## Key Design Patterns
|
129 |
+
|
130 |
+
### Anti-Hallucination Architecture
|
131 |
+
- **Tool result prioritization**: Always uses exact tool outputs over internal reasoning
|
132 |
+
- **Cross-validation**: Multiple verification methods for critical information
|
133 |
+
- **Source attribution**: Clear tracking and validation of information sources
|
134 |
+
- **Validation rules**: Type-specific answer extraction and verification
|
135 |
+
|
136 |
+
### Performance Optimizations
|
137 |
+
- **Fresh agent creation** for each question to avoid token accumulation
|
138 |
+
- **Concurrent processing** support with async operations
|
139 |
+
- **15-minute web cache** for improved response times
|
140 |
+
- **Exponential backoff** for API rate limiting
|
141 |
+
|
142 |
+
## File Organization
|
143 |
+
|
144 |
+
### Core Files
|
145 |
+
- `main.py` - Legacy monolithic solver (1285 lines)
|
146 |
+
- `main_refactored.py` - Entry point for refactored architecture
|
147 |
+
- `gaia_tools.py` - 42 specialized tools with robust error handling (4887 lines)
|
148 |
+
- `question_classifier.py` - LLM + pattern-based classification system
|
149 |
+
- `app.py` - Production Gradio interface with comprehensive error handling
|
150 |
+
|
151 |
+
### Supporting Files
|
152 |
+
- `async_complete_test.py` - Comprehensive async testing infrastructure
|
153 |
+
- `enhanced_wikipedia_tools.py` - Advanced Wikipedia research capabilities
|
154 |
+
- `universal_fen_correction.py` - Chess-specific FEN notation correction
|
155 |
+
- `wikipedia_featured_articles_by_date.py` - Date-specific Wikipedia searches
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|