Omachoko
๐Ÿš€ GAIA Multi-Agent System - Enhanced with 10+ AI Models
e9d5104
---
title: ๐Ÿš€ GAIA Multi-Agent System - BENCHMARK OPTIMIZED
emoji: ๐Ÿ•ต๐Ÿปโ€โ™‚๏ธ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
# optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
hf_oauth_expiration_minutes: 480
---
# ๐Ÿš€ GAIA Multi-Agent System - BENCHMARK OPTIMIZED
A **GAIA benchmark-optimized AI agent system** specifically designed for **exact-match evaluation** with aggressive response cleaning and direct answer formatting.
## ๐ŸŽฏ **GAIA Benchmark Compliance**
### **๐Ÿ”ฅ Exact-Match Optimization**
- **Direct Answers Only**: No "The answer is" prefixes or explanations
- **Clean Responses**: Complete removal of thinking processes and reasoning
- **Perfect Formatting**: Numbers, facts, or comma-separated lists as required
- **API-Ready**: Responses formatted exactly for GAIA submission
### **๐Ÿง  Multi-Model AI Integration**
- **10+ AI Models**: DeepSeek-R1, GPT-4o, Llama-3.3-70B, Kimi-Dev-72B, and more
- **6 AI Providers**: Together, Novita, Featherless-AI, Fireworks-AI, HuggingFace, OpenAI
- **Priority-Based Fallback**: Intelligent model selection with graceful degradation
- **Aggressive Cleaning**: Specialized extraction for benchmark compliance
### **โšก Performance Features**
- **Fallback Speed**: <100ms responses for common questions
- **High Accuracy**: Optimized for GAIA Level 1 questions (targeting 30%+ score)
- **Exact Match**: Designed for GAIA's strict evaluation criteria
- **Response Validation**: Built-in compliance checking
## ๐Ÿ—๏ธ **GAIA-Optimized Architecture**
### **Core Components**
```
๐ŸŽฏ GAIA Benchmark-Optimized System
โ”œโ”€โ”€ ๐Ÿค– BasicAgent (GAIA Interface)
โ”œโ”€โ”€ ๐Ÿง  MultiModelGAIASystem (Optimized Core)
โ”œโ”€โ”€ ๐Ÿ”ง Multi-Provider AI Clients (10+ Models)
โ”‚ โ”œโ”€โ”€ ๐Ÿ”ฅ Together (DeepSeek-R1, Llama-3.3-70B)
โ”‚ โ”œโ”€โ”€ โšก Novita (MiniMax-M1-80k, DeepSeek variants)
โ”‚ โ”œโ”€โ”€ ๐Ÿชถ Featherless-AI (Kimi-Dev-72B, Jan-nano)
โ”‚ โ”œโ”€โ”€ ๐Ÿš€ Fireworks-AI (Llama-3.1-8B)
โ”‚ โ”œโ”€โ”€ ๐Ÿค— HF-Inference (Specialized tasks)
โ”‚ โ””โ”€โ”€ ๐Ÿค– OpenAI (GPT-4o, GPT-3.5-turbo)
โ”œโ”€โ”€ ๐Ÿ›ก๏ธ Enhanced Fallback System (Exact answers)
โ”œโ”€โ”€ ๐Ÿงฝ Aggressive Response Cleaning (Benchmark compliance)
โ””โ”€โ”€ ๐ŸŽจ Gradio Interface (GAIA evaluation ready)
```
### **GAIA Processing Pipeline**
1. **Question Analysis** โ†’ Determine question type and expected format
2. **Fallback Check** โ†’ Fast, accurate answers for simple questions
3. **AI Model Query** โ†’ Multi-model reasoning with DeepSeek-R1 priority
4. **Response Extraction** โ†’ Aggressive cleaning to remove all reasoning
5. **Format Compliance** โ†’ Final validation for exact-match submission
## ๐Ÿš€ **Getting Started**
### **Installation**
```bash
# Clone the repository
git clone <your-repo-url>
cd Final_Assignment_Template
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
```
### **Configuration**
1. **Set HF Token** (Required for AI models):
```bash
export HF_TOKEN="your_hf_token_here"
```
2. **Set OpenAI Key** (Optional, for GPT models):
```bash
export OPENAI_API_KEY="your_openai_key_here"
```
3. **Test GAIA Compliance**:
```bash
python test_gaia.py
```
4. **Launch Web Interface**:
```bash
python app.py
```
## ๐Ÿงช **Testing & Validation**
### **GAIA Compliance Testing**
```bash
# Run comprehensive GAIA compliance tests
python test_gaia.py
# Expected output:
# โœ… Responses are GAIA compliant
# โœ… Reasoning is properly cleaned
# โœ… API format is correct
# โœ… Ready for exact-match evaluation
```
### **Expected GAIA Results**
- โœ… **Math**: "What is 15 + 27?" โ†’ "42" (not "The answer is 42")
- โœ… **Geography**: "What is the capital of Germany?" โ†’ "Berlin" (not "The capital of Germany is Berlin")
- โœ… **Science**: "How many planets are in our solar system?" โ†’ "8" (not "There are 8 planets")
## ๐Ÿ“Š **GAIA Benchmark Performance**
### **Target Metrics**
- **Level 1 Questions**: Targeting 30%+ accuracy for course completion
- **Response Time**: <5 seconds average per question
- **Compliance Rate**: 90%+ exact-match format compliance
- **Fallback Coverage**: 100% availability even without AI models
### **Question Types Optimized**
| Type | GAIA Format | Example Response |
|------|-------------|------------------|
| ๐Ÿงฎ **Mathematical** | Just the number | "42" |
| ๐ŸŒ **Geographical** | Just the place name | "Paris" |
| ๐Ÿ”ฌ **Scientific** | Just the fact/value | "8" |
| ๐Ÿ“ **Factual** | Concise answer | "H2O" |
| ๐Ÿ“Š **Lists** | Comma-separated | "apples, oranges, bananas" |
## ๐Ÿ”ง **Technical Implementation**
### **Response Cleaning Process**
```python
# GAIA-optimized cleaning pipeline:
1. Remove <think> tags completely
2. Extract explicit answer markers
3. Remove reasoning phrases
4. Clean formatting artifacts
5. Validate compliance
6. Return direct answer only
```
### **Key Dependencies**
```txt
gradio>=5.34.2 # Web interface with OAuth
huggingface_hub # Multi-model AI integration
transformers # Model support
requests # API communication
pandas # Results handling
openai # GPT model access
```
### **Environment Variables**
```bash
# Required for HuggingFace models
HF_TOKEN="hf_your_token_here"
# Required for OpenAI models
OPENAI_API_KEY="sk-your_openai_key_here"
# Auto-set in HuggingFace Spaces
SPACE_ID="your_space_id"
SPACE_HOST="your_space_host"
```
## ๐ŸŒŸ **GAIA Optimization Features**
### **Aggressive Response Cleaning**
- **Thinking Process Removal**: Complete elimination of <think> tags
- **Reasoning Extraction**: Removes "Let me think", "First", "Therefore"
- **Answer Isolation**: Extracts only the final answer value
- **Format Standardization**: Numbers, names, lists only
### **Exact-Match Compliance**
- **No Prefixes**: Removes "The answer is", "Result:", etc.
- **Clean Numbers**: "42" not "42." or "The result is 42"
- **Direct Facts**: "Paris" not "The capital is Paris"
- **Concise Lists**: "red, blue, green" not "The colors are red, blue, and green"
### **API Submission Ready**
- **JSON Format**: Perfect structure for GAIA API
- **Error Handling**: Graceful failures with default responses
- **Validation**: Built-in compliance checking before submission
- **Logging**: Detailed tracking for debugging
## ๐Ÿ“ˆ **Deployment**
### **Local Development**
```bash
python app.py
# Access at http://localhost:7860
```
### **Hugging Face Spaces**
1. Fork this repository
2. Create new Space on Hugging Face
3. Set `HF_TOKEN` and `OPENAI_API_KEY` as repository secrets
4. Deploy automatically with OAuth enabled
### **Production Optimization**
- Multi-model fallback ensures high availability
- Aggressive caching for common questions
- API rate limit management
- Comprehensive error handling
## ๐ŸŽฏ **GAIA Benchmark Ready!**
Your GAIA-optimized multi-agent system is specifically designed for:
- ๐ŸŽฏ **Exact-Match Evaluation** with clean, direct answers
- ๐Ÿง  **Multi-Model Intelligence** via DeepSeek-R1 and 9 other models
- ๐Ÿ›ก๏ธ **Reliable Fallback** for 100% question coverage
- ๐Ÿ“ **Perfect Compliance** with GAIA submission requirements
- ๐Ÿš€ **Production Ready** with comprehensive testing
**Target Achievement**: 30%+ score on GAIA Level 1 questions for course completion
**Next Steps**:
1. Set your `HF_TOKEN` and `OPENAI_API_KEY`
2. Run `python test_gaia.py` to verify compliance
3. Deploy to HuggingFace Spaces
4. Submit to GAIA benchmark! ๐Ÿš€
**Note**: The system provides reliable fallback responses even without API keys, ensuring baseline functionality for all question types.