CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a production-ready GAIA benchmark AI agent achieving 85% accuracy through a sophisticated multi-agent architecture. The system has been fully refactored into a modular, maintainable architecture that specializes in complex question answering across multimedia, research, file processing, chess analysis, and mathematical reasoning domains.

Development Commands

Setup and Installation

# Install dependencies
pip install -r requirements.txt

# Test API key configuration
python test_api_keys.py

# Verify core functionality
python -c "from main import GAIASolver; print('✅ Core GAIASolver available')"

Running the System

# Run legacy monolithic solver
python main.py

# Run refactored modular solver (recommended)
python main_refactored.py

# Run Gradio web interface
python app.py

Testing Commands

# Comprehensive async testing
python async_complete_test.py

# Test question classification
python test_improved_classification.py
python final_classification_test.py

# Test YouTube functionality
python direct_youtube_test.py
python simple_youtube_test.py
python test_youtube_question.py

# Test individual components
python -c "from gaia_tools import GAIA_TOOLS; print(f'Available tools: {len(GAIA_TOOLS)}')"
python -c "from question_classifier import QuestionClassifier; c = QuestionClassifier(); print('✅ Classifier ready')"

Architecture Overview

Dual Architecture Design

This project maintains both legacy monolithic and refactored modular architectures:

Legacy Architecture (main.py):

Monolithic 1285-line solver with all functionality integrated
Comprehensive tool collection in gaia_tools.py (4887 lines)
Single-file approach for rapid development and deployment

Refactored Architecture (gaia/ package):

gaia/
├── core/           # Main solver logic
│   ├── solver.py           # GAIASolver main class
│   ├── answer_extractor.py # Specialized answer extraction classes
│   └── question_processor.py # Question classification and processing
├── tools/          # Tool implementations  
│   ├── base.py            # Abstract tool interface and registry
│   ├── registry.py        # Tool discovery and management
│   └── [specialized tool modules]
├── models/         # Model providers and management
│   ├── manager.py         # ModelManager with fallback chains
│   └── providers.py       # LiteLLM, Gemini, Kluster providers
├── config/         # Configuration management
│   └── settings.py        # Config, ModelConfig classes
└── utils/          # Utilities and helpers
    ├── exceptions.py      # Custom exception hierarchy
    └── logging.py         # Logging configuration

Core Components

GAIASolver (main.py): Legacy monolithic solver with 1000+ lines of sophisticated processing logic GAIASolver (gaia/core/solver.py): Refactored main orchestrator using dependency injection QuestionClassifier: LLM-based intelligent routing with pattern-based fallbacks GAIA_TOOLS: 42 specialized tools including enhanced Wikipedia research, chess analysis, Excel processing, and multimedia analysis ModelManager: Handles model initialization, fallback chains (Kluster.ai → Gemini → Qwen), and lifecycle management

Question Type Specialization

Research Questions (92% accuracy):

Enhanced Wikipedia tools with date-specific searches and Featured Articles integration
Multi-step research coordination with cross-validation
Anti-hallucination safeguards to prevent fabrication

Chess Questions (100% accuracy):

Universal FEN correction system handling any vision error pattern
Multi-tool consensus system for maximum accuracy
Perfect algebraic notation extraction

YouTube/Multimedia Questions:

Enhanced URL detection with multiple regex patterns
Forced classification override for YouTube content
Specialized prompts with explicit tool usage instructions

File Processing (100% accuracy):

Format-specific tools for Excel (.xlsx/.xls), Python (.py), text files
Deterministic Python execution with sandboxed environment
Financial calculation specialization with proper currency formatting

Environment Configuration

Required API Keys (set in .env)

GEMINI_API_KEY - Primary model (Gemini Flash 2.0)
HUGGINGFACE_TOKEN - Fallback model and classification
KLUSTER_API_KEY - Optional premium model access

Model Fallback Chain

Kluster.ai (Qwen3-235B, Gemma3-27B) - Premium option
Gemini Flash 2.0 - Primary production model
Qwen 2.5-72B - Reliable fallback via HuggingFace

Key Design Patterns

Anti-Hallucination Architecture

Tool result prioritization: Always uses exact tool outputs over internal reasoning
Cross-validation: Multiple verification methods for critical information
Source attribution: Clear tracking and validation of information sources
Validation rules: Type-specific answer extraction and verification

Performance Optimizations

Fresh agent creation for each question to avoid token accumulation
Concurrent processing support with async operations
15-minute web cache for improved response times
Exponential backoff for API rate limiting

File Organization

Core Files

main.py - Legacy monolithic solver (1285 lines)
main_refactored.py - Entry point for refactored architecture
gaia_tools.py - 42 specialized tools with robust error handling (4887 lines)
question_classifier.py - LLM + pattern-based classification system
app.py - Production Gradio interface with comprehensive error handling

Supporting Files

async_complete_test.py - Comprehensive async testing infrastructure
enhanced_wikipedia_tools.py - Advanced Wikipedia research capabilities
universal_fen_correction.py - Chess-specific FEN notation correction
wikipedia_featured_articles_by_date.py - Date-specific Wikipedia searches

Local Configuration Notes

huggingface token can get from secrets in .env