Spaces:

schoolkithub
/

multi-agent-gaia-system

Running

App Files Files Community

multi-agent-gaia-system / README.md

Omachoko

🚀 GAIA Multi-Agent System - Enhanced with 10+ AI Models

e9d5104 3 days ago

preview code

raw

history blame contribute delete

7.86 kB

	---
	title: 🚀 GAIA Multi-Agent System - BENCHMARK OPTIMIZED
	emoji: 🕵🏻‍♂️
	colorFrom: indigo
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.25.2
	app_file: app.py
	pinned: false
	hf_oauth: true
	# optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
	hf_oauth_expiration_minutes: 480
	---

	# 🚀 GAIA Multi-Agent System - BENCHMARK OPTIMIZED

	A GAIA benchmark-optimized AI agent system specifically designed for exact-match evaluation with aggressive response cleaning and direct answer formatting.

	## 🎯 GAIA Benchmark Compliance

	### 🔥 Exact-Match Optimization
	- Direct Answers Only: No "The answer is" prefixes or explanations
	- Clean Responses: Complete removal of thinking processes and reasoning
	- Perfect Formatting: Numbers, facts, or comma-separated lists as required
	- API-Ready: Responses formatted exactly for GAIA submission

	### 🧠 Multi-Model AI Integration
	- 10+ AI Models: DeepSeek-R1, GPT-4o, Llama-3.3-70B, Kimi-Dev-72B, and more
	- 6 AI Providers: Together, Novita, Featherless-AI, Fireworks-AI, HuggingFace, OpenAI
	- Priority-Based Fallback: Intelligent model selection with graceful degradation
	- Aggressive Cleaning: Specialized extraction for benchmark compliance

	### ⚡ Performance Features
	- Fallback Speed: <100ms responses for common questions
	- High Accuracy: Optimized for GAIA Level 1 questions (targeting 30%+ score)
	- Exact Match: Designed for GAIA's strict evaluation criteria
	- Response Validation: Built-in compliance checking

	## 🏗️ GAIA-Optimized Architecture

	### Core Components

	```
	🎯 GAIA Benchmark-Optimized System
	├── 🤖 BasicAgent (GAIA Interface)
	├── 🧠 MultiModelGAIASystem (Optimized Core)
	├── 🔧 Multi-Provider AI Clients (10+ Models)
	│ ├── 🔥 Together (DeepSeek-R1, Llama-3.3-70B)
	│ ├── ⚡ Novita (MiniMax-M1-80k, DeepSeek variants)
	│ ├── 🪶 Featherless-AI (Kimi-Dev-72B, Jan-nano)
	│ ├── 🚀 Fireworks-AI (Llama-3.1-8B)
	│ ├── 🤗 HF-Inference (Specialized tasks)
	│ └── 🤖 OpenAI (GPT-4o, GPT-3.5-turbo)
	├── 🛡️ Enhanced Fallback System (Exact answers)
	├── 🧽 Aggressive Response Cleaning (Benchmark compliance)
	└── 🎨 Gradio Interface (GAIA evaluation ready)
	```

	### GAIA Processing Pipeline

	1. Question Analysis → Determine question type and expected format
	2. Fallback Check → Fast, accurate answers for simple questions
	3. AI Model Query → Multi-model reasoning with DeepSeek-R1 priority
	4. Response Extraction → Aggressive cleaning to remove all reasoning
	5. Format Compliance → Final validation for exact-match submission

	## 🚀 Getting Started

	### Installation

	```bash
	# Clone the repository
	git clone <your-repo-url>
	cd Final_Assignment_Template

	# Create virtual environment
	python -m venv .venv
	source .venv/bin/activate # Linux/Mac
	# or
	.venv\Scripts\activate # Windows

	# Install dependencies
	pip install -r requirements.txt
	```

	### Configuration

	1. Set HF Token (Required for AI models):
	```bash
	export HF_TOKEN="your_hf_token_here"
	```

	2. Set OpenAI Key (Optional, for GPT models):
	```bash
	export OPENAI_API_KEY="your_openai_key_here"
	```

	3. Test GAIA Compliance:
	```bash
	python test_gaia.py
	```

	4. Launch Web Interface:
	```bash
	python app.py
	```

	## 🧪 Testing & Validation

	### GAIA Compliance Testing

	```bash
	# Run comprehensive GAIA compliance tests
	python test_gaia.py

	# Expected output:
	# ✅ Responses are GAIA compliant
	# ✅ Reasoning is properly cleaned
	# ✅ API format is correct
	# ✅ Ready for exact-match evaluation
	```

	### Expected GAIA Results
	- ✅ Math: "What is 15 + 27?" → "42" (not "The answer is 42")
	- ✅ Geography: "What is the capital of Germany?" → "Berlin" (not "The capital of Germany is Berlin")
	- ✅ Science: "How many planets are in our solar system?" → "8" (not "There are 8 planets")

	## 📊 GAIA Benchmark Performance

	### Target Metrics
	- Level 1 Questions: Targeting 30%+ accuracy for course completion
	- Response Time: <5 seconds average per question
	- Compliance Rate: 90%+ exact-match format compliance
	- Fallback Coverage: 100% availability even without AI models

	### Question Types Optimized

	\| Type \| GAIA Format \| Example Response \|
	\|------\|-------------\|------------------\|
	\| 🧮 Mathematical \| Just the number \| "42" \|
	\| 🌍 Geographical \| Just the place name \| "Paris" \|
	\| 🔬 Scientific \| Just the fact/value \| "8" \|
	\| 📝 Factual \| Concise answer \| "H2O" \|
	\| 📊 Lists \| Comma-separated \| "apples, oranges, bananas" \|

	## 🔧 Technical Implementation

	### Response Cleaning Process

	```python
	# GAIA-optimized cleaning pipeline:
	1. Remove <think> tags completely
	2. Extract explicit answer markers
	3. Remove reasoning phrases
	4. Clean formatting artifacts
	5. Validate compliance
	6. Return direct answer only
	```

	### Key Dependencies

	```txt
	gradio>=5.34.2 # Web interface with OAuth
	huggingface_hub # Multi-model AI integration
	transformers # Model support
	requests # API communication
	pandas # Results handling
	openai # GPT model access
	```

	### Environment Variables

	```bash
	# Required for HuggingFace models
	HF_TOKEN="hf_your_token_here"

	# Required for OpenAI models
	OPENAI_API_KEY="sk-your_openai_key_here"

	# Auto-set in HuggingFace Spaces
	SPACE_ID="your_space_id"
	SPACE_HOST="your_space_host"
	```

	## 🌟 GAIA Optimization Features

	### Aggressive Response Cleaning
	- Thinking Process Removal: Complete elimination of <think> tags
	- Reasoning Extraction: Removes "Let me think", "First", "Therefore"
	- Answer Isolation: Extracts only the final answer value
	- Format Standardization: Numbers, names, lists only

	### Exact-Match Compliance
	- No Prefixes: Removes "The answer is", "Result:", etc.
	- Clean Numbers: "42" not "42." or "The result is 42"
	- Direct Facts: "Paris" not "The capital is Paris"
	- Concise Lists: "red, blue, green" not "The colors are red, blue, and green"

	### API Submission Ready
	- JSON Format: Perfect structure for GAIA API
	- Error Handling: Graceful failures with default responses
	- Validation: Built-in compliance checking before submission
	- Logging: Detailed tracking for debugging

	## 📈 Deployment

	### Local Development
	```bash
	python app.py
	# Access at http://localhost:7860
	```

	### Hugging Face Spaces
	1. Fork this repository
	2. Create new Space on Hugging Face
	3. Set `HF_TOKEN` and `OPENAI_API_KEY` as repository secrets
	4. Deploy automatically with OAuth enabled

	### Production Optimization
	- Multi-model fallback ensures high availability
	- Aggressive caching for common questions
	- API rate limit management
	- Comprehensive error handling

	## 🎯 GAIA Benchmark Ready!

	Your GAIA-optimized multi-agent system is specifically designed for:

	- 🎯 Exact-Match Evaluation with clean, direct answers
	- 🧠 Multi-Model Intelligence via DeepSeek-R1 and 9 other models
	- 🛡️ Reliable Fallback for 100% question coverage
	- 📏 Perfect Compliance with GAIA submission requirements
	- 🚀 Production Ready with comprehensive testing

	Target Achievement: 30%+ score on GAIA Level 1 questions for course completion

	Next Steps:
	1. Set your `HF_TOKEN` and `OPENAI_API_KEY`
	2. Run `python test_gaia.py` to verify compliance
	3. Deploy to HuggingFace Spaces
	4. Submit to GAIA benchmark! 🚀

	Note: The system provides reliable fallback responses even without API keys, ensuring baseline functionality for all question types.