|
--- |
|
title: ๐ GAIA Multi-Agent System - BENCHMARK OPTIMIZED |
|
emoji: ๐ต๐ปโโ๏ธ |
|
colorFrom: indigo |
|
colorTo: indigo |
|
sdk: gradio |
|
sdk_version: 5.25.2 |
|
app_file: app.py |
|
pinned: false |
|
hf_oauth: true |
|
|
|
hf_oauth_expiration_minutes: 480 |
|
--- |
|
|
|
# ๐ GAIA Multi-Agent System - BENCHMARK OPTIMIZED |
|
|
|
A **GAIA benchmark-optimized AI agent system** specifically designed for **exact-match evaluation** with aggressive response cleaning and direct answer formatting. |
|
|
|
## ๐ฏ **GAIA Benchmark Compliance** |
|
|
|
### **๐ฅ Exact-Match Optimization** |
|
- **Direct Answers Only**: No "The answer is" prefixes or explanations |
|
- **Clean Responses**: Complete removal of thinking processes and reasoning |
|
- **Perfect Formatting**: Numbers, facts, or comma-separated lists as required |
|
- **API-Ready**: Responses formatted exactly for GAIA submission |
|
|
|
### **๐ง Multi-Model AI Integration** |
|
- **10+ AI Models**: DeepSeek-R1, GPT-4o, Llama-3.3-70B, Kimi-Dev-72B, and more |
|
- **6 AI Providers**: Together, Novita, Featherless-AI, Fireworks-AI, HuggingFace, OpenAI |
|
- **Priority-Based Fallback**: Intelligent model selection with graceful degradation |
|
- **Aggressive Cleaning**: Specialized extraction for benchmark compliance |
|
|
|
### **โก Performance Features** |
|
- **Fallback Speed**: <100ms responses for common questions |
|
- **High Accuracy**: Optimized for GAIA Level 1 questions (targeting 30%+ score) |
|
- **Exact Match**: Designed for GAIA's strict evaluation criteria |
|
- **Response Validation**: Built-in compliance checking |
|
|
|
## ๐๏ธ **GAIA-Optimized Architecture** |
|
|
|
### **Core Components** |
|
|
|
``` |
|
๐ฏ GAIA Benchmark-Optimized System |
|
โโโ ๐ค BasicAgent (GAIA Interface) |
|
โโโ ๐ง MultiModelGAIASystem (Optimized Core) |
|
โโโ ๐ง Multi-Provider AI Clients (10+ Models) |
|
โ โโโ ๐ฅ Together (DeepSeek-R1, Llama-3.3-70B) |
|
โ โโโ โก Novita (MiniMax-M1-80k, DeepSeek variants) |
|
โ โโโ ๐ชถ Featherless-AI (Kimi-Dev-72B, Jan-nano) |
|
โ โโโ ๐ Fireworks-AI (Llama-3.1-8B) |
|
โ โโโ ๐ค HF-Inference (Specialized tasks) |
|
โ โโโ ๐ค OpenAI (GPT-4o, GPT-3.5-turbo) |
|
โโโ ๐ก๏ธ Enhanced Fallback System (Exact answers) |
|
โโโ ๐งฝ Aggressive Response Cleaning (Benchmark compliance) |
|
โโโ ๐จ Gradio Interface (GAIA evaluation ready) |
|
``` |
|
|
|
### **GAIA Processing Pipeline** |
|
|
|
1. **Question Analysis** โ Determine question type and expected format |
|
2. **Fallback Check** โ Fast, accurate answers for simple questions |
|
3. **AI Model Query** โ Multi-model reasoning with DeepSeek-R1 priority |
|
4. **Response Extraction** โ Aggressive cleaning to remove all reasoning |
|
5. **Format Compliance** โ Final validation for exact-match submission |
|
|
|
## ๐ **Getting Started** |
|
|
|
### **Installation** |
|
|
|
```bash |
|
# Clone the repository |
|
git clone <your-repo-url> |
|
cd Final_Assignment_Template |
|
|
|
# Create virtual environment |
|
python -m venv .venv |
|
source .venv/bin/activate # Linux/Mac |
|
# or |
|
.venv\Scripts\activate # Windows |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### **Configuration** |
|
|
|
1. **Set HF Token** (Required for AI models): |
|
```bash |
|
export HF_TOKEN="your_hf_token_here" |
|
``` |
|
|
|
2. **Set OpenAI Key** (Optional, for GPT models): |
|
```bash |
|
export OPENAI_API_KEY="your_openai_key_here" |
|
``` |
|
|
|
3. **Test GAIA Compliance**: |
|
```bash |
|
python test_gaia.py |
|
``` |
|
|
|
4. **Launch Web Interface**: |
|
```bash |
|
python app.py |
|
``` |
|
|
|
## ๐งช **Testing & Validation** |
|
|
|
### **GAIA Compliance Testing** |
|
|
|
```bash |
|
# Run comprehensive GAIA compliance tests |
|
python test_gaia.py |
|
|
|
# Expected output: |
|
# โ
Responses are GAIA compliant |
|
# โ
Reasoning is properly cleaned |
|
# โ
API format is correct |
|
# โ
Ready for exact-match evaluation |
|
``` |
|
|
|
### **Expected GAIA Results** |
|
- โ
**Math**: "What is 15 + 27?" โ "42" (not "The answer is 42") |
|
- โ
**Geography**: "What is the capital of Germany?" โ "Berlin" (not "The capital of Germany is Berlin") |
|
- โ
**Science**: "How many planets are in our solar system?" โ "8" (not "There are 8 planets") |
|
|
|
## ๐ **GAIA Benchmark Performance** |
|
|
|
### **Target Metrics** |
|
- **Level 1 Questions**: Targeting 30%+ accuracy for course completion |
|
- **Response Time**: <5 seconds average per question |
|
- **Compliance Rate**: 90%+ exact-match format compliance |
|
- **Fallback Coverage**: 100% availability even without AI models |
|
|
|
### **Question Types Optimized** |
|
|
|
| Type | GAIA Format | Example Response | |
|
|------|-------------|------------------| |
|
| ๐งฎ **Mathematical** | Just the number | "42" | |
|
| ๐ **Geographical** | Just the place name | "Paris" | |
|
| ๐ฌ **Scientific** | Just the fact/value | "8" | |
|
| ๐ **Factual** | Concise answer | "H2O" | |
|
| ๐ **Lists** | Comma-separated | "apples, oranges, bananas" | |
|
|
|
## ๐ง **Technical Implementation** |
|
|
|
### **Response Cleaning Process** |
|
|
|
```python |
|
# GAIA-optimized cleaning pipeline: |
|
1. Remove <think> tags completely |
|
2. Extract explicit answer markers |
|
3. Remove reasoning phrases |
|
4. Clean formatting artifacts |
|
5. Validate compliance |
|
6. Return direct answer only |
|
``` |
|
|
|
### **Key Dependencies** |
|
|
|
```txt |
|
gradio>=5.34.2 # Web interface with OAuth |
|
huggingface_hub # Multi-model AI integration |
|
transformers # Model support |
|
requests # API communication |
|
pandas # Results handling |
|
openai # GPT model access |
|
``` |
|
|
|
### **Environment Variables** |
|
|
|
```bash |
|
# Required for HuggingFace models |
|
HF_TOKEN="hf_your_token_here" |
|
|
|
# Required for OpenAI models |
|
OPENAI_API_KEY="sk-your_openai_key_here" |
|
|
|
# Auto-set in HuggingFace Spaces |
|
SPACE_ID="your_space_id" |
|
SPACE_HOST="your_space_host" |
|
``` |
|
|
|
## ๐ **GAIA Optimization Features** |
|
|
|
### **Aggressive Response Cleaning** |
|
- **Thinking Process Removal**: Complete elimination of <think> tags |
|
- **Reasoning Extraction**: Removes "Let me think", "First", "Therefore" |
|
- **Answer Isolation**: Extracts only the final answer value |
|
- **Format Standardization**: Numbers, names, lists only |
|
|
|
### **Exact-Match Compliance** |
|
- **No Prefixes**: Removes "The answer is", "Result:", etc. |
|
- **Clean Numbers**: "42" not "42." or "The result is 42" |
|
- **Direct Facts**: "Paris" not "The capital is Paris" |
|
- **Concise Lists**: "red, blue, green" not "The colors are red, blue, and green" |
|
|
|
### **API Submission Ready** |
|
- **JSON Format**: Perfect structure for GAIA API |
|
- **Error Handling**: Graceful failures with default responses |
|
- **Validation**: Built-in compliance checking before submission |
|
- **Logging**: Detailed tracking for debugging |
|
|
|
## ๐ **Deployment** |
|
|
|
### **Local Development** |
|
```bash |
|
python app.py |
|
# Access at http://localhost:7860 |
|
``` |
|
|
|
### **Hugging Face Spaces** |
|
1. Fork this repository |
|
2. Create new Space on Hugging Face |
|
3. Set `HF_TOKEN` and `OPENAI_API_KEY` as repository secrets |
|
4. Deploy automatically with OAuth enabled |
|
|
|
### **Production Optimization** |
|
- Multi-model fallback ensures high availability |
|
- Aggressive caching for common questions |
|
- API rate limit management |
|
- Comprehensive error handling |
|
|
|
## ๐ฏ **GAIA Benchmark Ready!** |
|
|
|
Your GAIA-optimized multi-agent system is specifically designed for: |
|
|
|
- ๐ฏ **Exact-Match Evaluation** with clean, direct answers |
|
- ๐ง **Multi-Model Intelligence** via DeepSeek-R1 and 9 other models |
|
- ๐ก๏ธ **Reliable Fallback** for 100% question coverage |
|
- ๐ **Perfect Compliance** with GAIA submission requirements |
|
- ๐ **Production Ready** with comprehensive testing |
|
|
|
**Target Achievement**: 30%+ score on GAIA Level 1 questions for course completion |
|
|
|
**Next Steps**: |
|
1. Set your `HF_TOKEN` and `OPENAI_API_KEY` |
|
2. Run `python test_gaia.py` to verify compliance |
|
3. Deploy to HuggingFace Spaces |
|
4. Submit to GAIA benchmark! ๐ |
|
|
|
**Note**: The system provides reliable fallback responses even without API keys, ensuring baseline functionality for all question types. |