A newer version of the Gradio SDK is available:
5.34.2
metadata
title: ๐ GAIA Multi-Agent System - BENCHMARK OPTIMIZED
emoji: ๐ต๐ปโโ๏ธ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480
๐ GAIA Multi-Agent System - BENCHMARK OPTIMIZED
A GAIA benchmark-optimized AI agent system specifically designed for exact-match evaluation with aggressive response cleaning and direct answer formatting.
๐ฏ GAIA Benchmark Compliance
๐ฅ Exact-Match Optimization
- Direct Answers Only: No "The answer is" prefixes or explanations
- Clean Responses: Complete removal of thinking processes and reasoning
- Perfect Formatting: Numbers, facts, or comma-separated lists as required
- API-Ready: Responses formatted exactly for GAIA submission
๐ง Multi-Model AI Integration
- 10+ AI Models: DeepSeek-R1, GPT-4o, Llama-3.3-70B, Kimi-Dev-72B, and more
- 6 AI Providers: Together, Novita, Featherless-AI, Fireworks-AI, HuggingFace, OpenAI
- Priority-Based Fallback: Intelligent model selection with graceful degradation
- Aggressive Cleaning: Specialized extraction for benchmark compliance
โก Performance Features
- Fallback Speed: <100ms responses for common questions
- High Accuracy: Optimized for GAIA Level 1 questions (targeting 30%+ score)
- Exact Match: Designed for GAIA's strict evaluation criteria
- Response Validation: Built-in compliance checking
๐๏ธ GAIA-Optimized Architecture
Core Components
๐ฏ GAIA Benchmark-Optimized System
โโโ ๐ค BasicAgent (GAIA Interface)
โโโ ๐ง MultiModelGAIASystem (Optimized Core)
โโโ ๐ง Multi-Provider AI Clients (10+ Models)
โ โโโ ๐ฅ Together (DeepSeek-R1, Llama-3.3-70B)
โ โโโ โก Novita (MiniMax-M1-80k, DeepSeek variants)
โ โโโ ๐ชถ Featherless-AI (Kimi-Dev-72B, Jan-nano)
โ โโโ ๐ Fireworks-AI (Llama-3.1-8B)
โ โโโ ๐ค HF-Inference (Specialized tasks)
โ โโโ ๐ค OpenAI (GPT-4o, GPT-3.5-turbo)
โโโ ๐ก๏ธ Enhanced Fallback System (Exact answers)
โโโ ๐งฝ Aggressive Response Cleaning (Benchmark compliance)
โโโ ๐จ Gradio Interface (GAIA evaluation ready)
GAIA Processing Pipeline
- Question Analysis โ Determine question type and expected format
- Fallback Check โ Fast, accurate answers for simple questions
- AI Model Query โ Multi-model reasoning with DeepSeek-R1 priority
- Response Extraction โ Aggressive cleaning to remove all reasoning
- Format Compliance โ Final validation for exact-match submission
๐ Getting Started
Installation
# Clone the repository
git clone <your-repo-url>
cd Final_Assignment_Template
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
Configuration
Set HF Token (Required for AI models):
export HF_TOKEN="your_hf_token_here"
Set OpenAI Key (Optional, for GPT models):
export OPENAI_API_KEY="your_openai_key_here"
Test GAIA Compliance:
python test_gaia.py
Launch Web Interface:
python app.py
๐งช Testing & Validation
GAIA Compliance Testing
# Run comprehensive GAIA compliance tests
python test_gaia.py
# Expected output:
# โ
Responses are GAIA compliant
# โ
Reasoning is properly cleaned
# โ
API format is correct
# โ
Ready for exact-match evaluation
Expected GAIA Results
- โ Math: "What is 15 + 27?" โ "42" (not "The answer is 42")
- โ Geography: "What is the capital of Germany?" โ "Berlin" (not "The capital of Germany is Berlin")
- โ Science: "How many planets are in our solar system?" โ "8" (not "There are 8 planets")
๐ GAIA Benchmark Performance
Target Metrics
- Level 1 Questions: Targeting 30%+ accuracy for course completion
- Response Time: <5 seconds average per question
- Compliance Rate: 90%+ exact-match format compliance
- Fallback Coverage: 100% availability even without AI models
Question Types Optimized
Type | GAIA Format | Example Response |
---|---|---|
๐งฎ Mathematical | Just the number | "42" |
๐ Geographical | Just the place name | "Paris" |
๐ฌ Scientific | Just the fact/value | "8" |
๐ Factual | Concise answer | "H2O" |
๐ Lists | Comma-separated | "apples, oranges, bananas" |
๐ง Technical Implementation
Response Cleaning Process
# GAIA-optimized cleaning pipeline:
1. Remove <think> tags completely
2. Extract explicit answer markers
3. Remove reasoning phrases
4. Clean formatting artifacts
5. Validate compliance
6. Return direct answer only
Key Dependencies
gradio>=5.34.2 # Web interface with OAuth
huggingface_hub # Multi-model AI integration
transformers # Model support
requests # API communication
pandas # Results handling
openai # GPT model access
Environment Variables
# Required for HuggingFace models
HF_TOKEN="hf_your_token_here"
# Required for OpenAI models
OPENAI_API_KEY="sk-your_openai_key_here"
# Auto-set in HuggingFace Spaces
SPACE_ID="your_space_id"
SPACE_HOST="your_space_host"
๐ GAIA Optimization Features
Aggressive Response Cleaning
- Thinking Process Removal: Complete elimination of tags
- Reasoning Extraction: Removes "Let me think", "First", "Therefore"
- Answer Isolation: Extracts only the final answer value
- Format Standardization: Numbers, names, lists only
Exact-Match Compliance
- No Prefixes: Removes "The answer is", "Result:", etc.
- Clean Numbers: "42" not "42." or "The result is 42"
- Direct Facts: "Paris" not "The capital is Paris"
- Concise Lists: "red, blue, green" not "The colors are red, blue, and green"
API Submission Ready
- JSON Format: Perfect structure for GAIA API
- Error Handling: Graceful failures with default responses
- Validation: Built-in compliance checking before submission
- Logging: Detailed tracking for debugging
๐ Deployment
Local Development
python app.py
# Access at http://localhost:7860
Hugging Face Spaces
- Fork this repository
- Create new Space on Hugging Face
- Set
HF_TOKEN
andOPENAI_API_KEY
as repository secrets - Deploy automatically with OAuth enabled
Production Optimization
- Multi-model fallback ensures high availability
- Aggressive caching for common questions
- API rate limit management
- Comprehensive error handling
๐ฏ GAIA Benchmark Ready!
Your GAIA-optimized multi-agent system is specifically designed for:
- ๐ฏ Exact-Match Evaluation with clean, direct answers
- ๐ง Multi-Model Intelligence via DeepSeek-R1 and 9 other models
- ๐ก๏ธ Reliable Fallback for 100% question coverage
- ๐ Perfect Compliance with GAIA submission requirements
- ๐ Production Ready with comprehensive testing
Target Achievement: 30%+ score on GAIA Level 1 questions for course completion
Next Steps:
- Set your
HF_TOKEN
andOPENAI_API_KEY
- Run
python test_gaia.py
to verify compliance - Deploy to HuggingFace Spaces
- Submit to GAIA benchmark! ๐
Note: The system provides reliable fallback responses even without API keys, ensuring baseline functionality for all question types.