A newer version of the Gradio SDK is available:
5.34.2
title: GAIA Agent Project
emoji: π±
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
hf_oauth: true
GAIA Agent Project
AI agent for the GAIA benchmark, built for the Hugging Face Agents Course Certificate of Excellence.
Overview
This project implements an AI agent that can solve tasks from the GAIA (General AI Assistants) benchmark. The agent uses xAI's Grok API for reasoning and includes tools for web search, file handling, and mathematical calculations.
Goal
Achieve β₯30% score on the GAIA benchmark to earn the Certificate of Excellence from the Hugging Face Agents Course.
Project Structure
βββ agent.py # Main GAIA agent implementation
βββ tools.py # Tool implementations (web search, file handling)
βββ evaluate.py # Evaluation script and scoring
βββ test_agent.py # Test suite for verification
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ .gitignore # Git ignore rules
βββ submission.jsonl # Generated submission file
Setup
1. Install Dependencies
pip install -r requirements.txt
2. API Configuration
The agent uses xAI's Grok API. The API key is already configured in the code for this project.
3. Optional: SerpAPI for Enhanced Web Search
For better web search results, you can sign up for SerpAPI:
- Visit https://serpapi.com/ and create an account
- Get your API key
- Update the
serpapi_key
inagent.py
Usage
Quick Test
Run the test suite to verify everything is working:
python test_agent.py
Full Evaluation
Run the full evaluation on sample tasks:
python evaluate.py
Run with maximum number of tasks limit:
python evaluate.py --max-tasks 10
Run with custom dataset:
python evaluate.py --dataset path/to/gaia_dataset.jsonl
Components
Agent (agent.py
)
- GAIAAgent: Main agent class that processes GAIA tasks
- call_grok(): Interface to xAI Grok API with retry logic
- process_task(): Main task processing pipeline
- extract_final_answer(): Extracts formatted answers from responses
Tools (tools.py
)
- web_search(): Web search with SerpAPI fallback to DuckDuckGo
- read_file(): Handles text, CSV, and image files
- execute_code(): Safe Python code execution (limited)
- calculate_simple_math(): Basic mathematical calculations
Evaluation (evaluate.py
)
- evaluate_agent(): Main evaluation function
- load_gaia_dataset(): Loads GAIA dataset from JSON/JSONL
- normalize_answer(): Normalizes answers for comparison
- create_sample_dataset(): Creates sample tasks for testing
Features
- β xAI Grok API integration with retry logic
- β Web search capabilities (SerpAPI + DuckDuckGo fallback)
- β Multi-format file handling (text, CSV, images)
- β OCR support for image-based tasks (with pytesseract)
- β Safe code execution environment
- β Comprehensive evaluation system
- β JSONL submission format generation
- β Progress tracking and scoring
GAIA Task Types
The agent handles different GAIA task levels:
- Level 1: Simple questions requiring basic knowledge
- Level 2: Multi-step reasoning tasks
- Level 3: Complex tasks involving files, images, or code
Sample Tasks
The evaluation includes sample tasks like:
- Basic arithmetic: "What is 15 + 27?"
- General knowledge: "What is the capital of France?"
- Date calculations: "How many days are in a leap year?"
- Multi-step math: "What is 2 * 6 * 7?"
- Historical facts: "What year did World War II end?"
Scoring
- Target: β₯30% accuracy for Certificate of Excellence
- Current leaderboard top score: ~76%
- Evaluation provides detailed per-task feedback
- Generates
submission.jsonl
in required format
Troubleshooting
API Issues
- Verify internet connection
- Check API key validity
- Monitor rate limits
Import Errors
- Ensure all dependencies are installed:
pip install -r requirements.txt
- For OCR: Install system dependency
tesseract-ocr
File Reading Issues
- Check file paths and permissions
- Verify file formats are supported
Development
Testing
Run the test suite before making changes:
python test_agent.py
Adding New Tools
- Implement the tool function in
tools.py
- Import and use in
agent.py
- Add tests in
test_agent.py
Improving Performance
- Optimize prompts for better reasoning
- Add more sophisticated web search
- Enhance file processing capabilities
- Implement better answer extraction
Submission
- Run evaluation:
python evaluate.py
- Upload
submission.jsonl
to the Hugging Face leaderboard - Verify score β₯30% for certificate eligibility
Resources
License
This project is created for educational purposes as part of the Hugging Face Agents Course.
Good luck achieving the 30% score for your Certificate of Excellence! π