--- title: KnowledgeBridge emoji: 📚 colorFrom: yellow colorTo: red sdk: docker pinned: false license: mit short_description: 'A sophisticated AI-powered knowledge retrieval and analysis ' tags: - agent-demo-track --- # KnowledgeBridge 🚀 **An AI-Enhanced Knowledge Discovery Platform with Document Processing & Vector Search** A production-ready AI-powered knowledge retrieval system featuring real document upload, OCR processing, vector embeddings, and distributed computing for large-scale document analysis and semantic search. ![Security Status](https://img.shields.io/badge/Security-Hardened-green) ![TypeScript](https://img.shields.io/badge/TypeScript-100%25-blue) ![AI Models](https://img.shields.io/badge/AI-Nebius%20DeepSeek-purple) ![License](https://img.shields.io/badge/License-MIT-yellow) ## 🎯 Hackathon Submission **🤖 Track 3: Agentic Demo Showcase** **Submitted to**: [Hugging Face Agents-MCP-Hackathon](https://huggingface.co/Agents-MCP-Hackathon) **Live Demo**: [Try KnowledgeBridge on Hugging Face Spaces](https://huggingface.co/spaces/Agents-MCP-Hackathon/KnowledgeBridge [Video Link]{https://drive.google.com/drive/folders/1iQafhb7PmO6zWW-JDq1eWGo8KN10Ctdf?usp=sharing} ### **🚀 "Show us the most incredible things that your agents can do!"** KnowledgeBridge demonstrates sophisticated AI agent orchestration through multi-modal knowledge discovery, intelligent query enhancement, and autonomous research synthesis. ## 🤖 Agentic Capabilities Showcase ### 🧠 **Multi-Agent Orchestration** - **Coordinated Search Agents**: Simultaneous deployment across GitHub, Wikipedia, ArXiv, and web sources - **Intelligent Load Balancing**: Agents dynamically distribute workload based on query type and source availability - **Fallback Agent Strategy**: Backup agents activate when primary sources fail or timeout - **Real-Time Coordination**: Agents communicate results and adapt search strategies collaboratively ### 🔍 **Query Enhancement Agents** - **Intent Recognition Agents**: AI agents analyze user intent and suggest optimal search strategies - **Semantic Expansion Agents**: Agents enhance queries with related terms and concepts - **Context-Aware Agents**: Agents consider previous searches and user preferences - **Multi-Modal Query Agents**: Agents adapt search approach based on content type (code, academic, general) ### 📊 **Document Processing & Analysis Agents** - **OCR Processing Agents**: Autonomous PDF and image text extraction using Modal's distributed Tesseract OCR - **Vector Embedding Agents**: Generate 1536-dimensional embeddings and build FAISS indices at scale - **Batch Processing Agents**: Coordinate distributed document processing across Modal compute nodes - **Research Synthesis Agents**: AI agents combine insights from multiple sources into coherent analysis - **Quality Assessment Agents**: Agents evaluate source credibility and content relevance ### 🛡️ **Security & Validation Agents** - **URL Validation Agents**: Intelligent agents verify link accessibility and content authenticity - **Rate Limiting Agents**: Protective agents prevent API abuse (100 requests/15min, 10/min for sensitive endpoints) - **Input Sanitization Agents**: Security agents validate and clean all user inputs - **Error Recovery Agents**: Resilient agents handle failures gracefully and maintain system stability ### 🌐 **Intelligent Integration Agents** - **ArXiv Academic Agents**: Specialized agents for academic paper validation and retrieval - **GitHub Repository Agents**: Code-focused agents with author filtering and relevance scoring - **Wikipedia Knowledge Agents**: Authoritative content agents with intelligent caching strategies - **Cross-Platform Synthesis Agents**: Agents that combine and rank results across all sources ## 🏗️ Technical Architecture ### **Frontend Stack** - **React 18** with TypeScript for type-safe development - **Wouter Router** for lightweight client-side routing - **TanStack Query** for efficient data fetching and caching - **Radix UI + Tailwind CSS** for accessible, modern components - **Framer Motion** for smooth animations and transitions ### **Backend Stack** - **Node.js + Express** with comprehensive middleware - **SQLite Database** with real document storage and metadata - **File Upload System** supporting PDFs, images, text files (50MB each) - **Express Rate Limit** for API protection - **Helmet.js** for security headers ### **AI & Distributed Computing** - **Nebius AI Platform** - Advanced LLM and embedding capabilities - **DeepSeek-R1-0528** for chat completions and document analysis - **BAAI/bge-en-icl** for embedding generation (1536 dimensions) - **Query Enhancement** and intelligent content analysis - **Modal.com Platform** - Production heavy workloads - **OCR Processing**: PDF/image text extraction with PyPDF2 + Tesseract - **FAISS Vector Indexing**: Distributed index building for large document collections - **High-Performance Search**: Sub-second similarity search across millions of vectors - **Batch Processing**: Concurrent document processing with 2-4GB memory per task - **Persistent Storage**: Modal volumes for cross-session index storage ## 🚀 Quick Start ### **Environment Configuration** Create a `.env` file in the project root: ```bash # Nebius AI Configuration (Required) NEBIUS_API_KEY=your_nebius_api_key_here # Modal Configuration (Optional - for advanced processing) MODAL_TOKEN_ID=your_modal_token_id MODAL_TOKEN_SECRET=your_modal_token_secret MODAL_BASE_URL=https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run # GitHub Configuration (Optional - for repository search) GITHUB_TOKEN=your_github_token_here # Node Environment NODE_ENV=development ``` ### **Development Setup** ```bash # Install dependencies npm install # Start development server npm run dev # Build for production npm run build # Type checking npm run check ``` The application will be available at `http://localhost:5000` ## 🎯 Usage Guide ### **Document Upload & Processing** 1. **Upload Documents**: Drag and drop PDFs, images, text files (up to 50MB each) 2. **Automatic Processing**: OCR extraction via Modal for PDFs/images, embedding generation 3. **Status Tracking**: Monitor processing status (pending → processing → completed) 4. **Batch Operations**: Process multiple documents and build vector indices ### **Vector Search** 1. **Semantic Search**: Query your processed documents using vector similarity 2. **Index Management**: Build FAISS indices from your document collections 3. **Performance Comparison**: Side-by-side vector vs. keyword search results 4. **Relevance Scoring**: AI-powered relevance scores with detailed metrics ### **AI-Enhanced Search** 1. **Traditional Search**: Natural language queries across web sources 2. **Query Enhancement**: AI-powered query improvement suggestions 3. **Multi-Source Results**: Combined results from GitHub, Wikipedia, ArXiv 4. **Research Synthesis**: AI analysis and synthesis of search results ### **Knowledge Management** - **Document Library**: Manage uploaded documents with metadata - **Citation Generation**: Export results in multiple academic formats - **Knowledge Graph**: Interactive visualization of document relationships ## 🔧 API Reference ### **Document Management** ```typescript POST /api/documents/upload // Multipart form data with files[] // Optional: title, source GET /api/documents/list // Query params: limit, offset, sourceType, processingStatus POST /api/documents/process/:id { operations: ["extract_text", "generate_embedding", "build_index"]; indexName?: string; } POST /api/documents/process/batch { documentIds: number[]; operations: ["extract_text", "generate_embedding"]; indexName?: string; } DELETE /api/documents/:id // Deletes document and associated file ``` ### **Vector Search & Indexing** ```typescript POST /api/documents/search/vector { query: string; indexName?: string; maxResults?: number; } POST /api/documents/index/build { documentIds?: number[]; // Optional: specific documents indexName?: string; } GET /api/documents/status/:id // Returns processing status and metadata ``` ### **Traditional Search & AI** ```typescript POST /api/search { query: string; searchType: "semantic" | "keyword" | "hybrid"; limit: number; filters?: { sourceTypes?: string[]; }; } POST /api/analyze-document { content: string; analysisType: "summary" | "classification" | "key_points"; useMarkdown?: boolean; } POST /api/enhance-query { query: string; context?: string; } ``` ### **Health Check** ```typescript GET /api/health // Returns comprehensive health status of all services including: // - Nebius AI (embeddings, chat completions) // - Modal.com (API connectivity, function availability) // - External APIs (GitHub, Wikipedia, ArXiv) ``` ## 🚀 Performance & Reliability ### **Performance Metrics** - **Document Upload**: <1s for files up to 50MB with progress tracking - **OCR Processing**: 5-15 seconds per PDF/image via Modal distributed computing - **Vector Search**: <500ms for similarity search across large document collections - **Index Building**: 10-60 seconds for 100-1000 documents using FAISS - **Nebius AI**: - Document analysis: 3-5 seconds for comprehensive analysis - Embedding generation: 500ms-1s per document - Query enhancement: 1-2 seconds - **Traditional Search**: <100ms for local database queries ### **Production Scalability** - **Distributed Computing**: Modal automatically scales compute resources (2-4GB per task) - **Concurrent Processing**: Parallel document processing across multiple nodes - **Persistent Storage**: SQLite for metadata, Modal volumes for vector indices - **Batch Operations**: Process hundreds of documents simultaneously - **Intelligent Caching**: Optimized repeated operations and query results - **Graceful Fallbacks**: Continues operation when external services unavailable - **Resource Optimization**: Automatic cleanup and memory management ### **Error Handling** - React Error Boundaries prevent UI crashes - Comprehensive API error responses - Automatic retry logic for network requests - User-friendly error messages ## 🔒 Security Features ### **Input Protection** - Request body size limits (10MB) - Comprehensive input sanitization - SQL injection prevention - XSS protection with CSP headers ### **API Security** - Rate limiting on all endpoints - Secure environment variable handling - No hardcoded credentials - Proper error logging without information disclosure ### **Infrastructure Security** - Helmet.js security headers - CORS configuration - Secure cookie handling - Production-ready error handling ## 🛠️ Development ### **Code Quality** - 100% TypeScript coverage - ESLint + Prettier configuration - Comprehensive error handling - Type-safe API contracts with Zod validation ### **Testing** ```bash # Type checking npm run check # Development server npm run dev # Production build npm run build ``` ## 🎉 Latest Features - ✅ **Document Upload System**: Real file upload with drag-and-drop, supporting PDFs, images, text files - ✅ **OCR Processing Pipeline**: Modal-powered text extraction from PDFs and images using Tesseract - ✅ **Vector Search Engine**: FAISS-based semantic search with distributed index building - ✅ **SQLite Database**: Persistent storage replacing in-memory data with full metadata tracking - ✅ **Batch Processing**: Concurrent document processing across Modal's distributed compute nodes - ✅ **Production Ready**: Real heavy workloads utilizing Modal's computational capabilities ## 📚 Production Architecture ### **Complete Document Processing Pipeline** **📄 Document Upload → 🔄 Processing → 🔍 Search → 📊 Analysis** 1. **Upload & Storage**: - Multi-file drag-and-drop interface (PDFs, images, text files) - SQLite database with full metadata tracking - File validation and organization by date 2. **Modal Distributed Processing**: - OCR text extraction using Tesseract for images/PDFs - Parallel processing across compute nodes (2-4GB per task) - Batch operations for large document collections 3. **AI Analysis & Embeddings**: - Nebius AI generates 1536-dimensional embeddings - Document classification and content analysis - Quality assessment and metadata enrichment 4. **Vector Index & Search**: - FAISS index building via Modal's distributed computing - High-performance semantic similarity search - Persistent storage across sessions ### **Service Integration & Division of Responsibilities** ## **🧠 Nebius AI: Language Intelligence & AI Reasoning** ### **Used For:** - **📝 Document Analysis**: Classification, summarization, key points extraction, quality scoring - **🔍 Search Intelligence**: Query enhancement, intent understanding, relevance scoring - **💭 AI Reasoning**: Research synthesis, explanations, conversational responses - **🎯 Embeddings**: Real-time text-to-vector conversion using BAAI/bge-en-icl model - **📊 Content Understanding**: All language comprehension and semantic analysis ### **Specific Endpoints:** - `/api/analyze-document` - Document analysis with DeepSeek-R1 model - `/api/enhance-query` - AI-powered query improvement - `/api/embeddings` - Generate vector embeddings - `/api/research-synthesis` - Combine insights from multiple sources - `/api/ai-search` - Enhanced semantic search --- ## **⚡ Modal.com: Heavy Computation & Distributed Processing** ### **Used For:** - **📄 OCR Processing**: PDF and image text extraction using Tesseract - **🔧 Vector Operations**: FAISS index building and high-performance search - **📦 Batch Processing**: Concurrent processing of large document collections - **💾 Infrastructure**: Serverless scaling, persistent storage, distributed compute - **🚀 Heavy Workloads**: All computationally intensive operations ### **Specific Endpoints:** - `/api/documents/process/:id` - OCR text extraction via Modal - `/api/documents/index/build` - FAISS vector index creation - `/api/documents/search/vector` - High-performance vector search - `/api/documents/process/batch` - Distributed batch processing ### **Live Deployment**: [Modal App](https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run) --- ## **🔄 How They Work Together** ### **Document Processing Pipeline:** 1. **Upload** → Local file storage 2. **OCR** → **Modal** extracts text from PDFs/images 3. **Analysis** → **Nebius** analyzes content and generates embeddings 4. **Indexing** → **Modal** builds FAISS vector index 5. **Search** → **Modal** performs vector search, **Nebius** scores relevance ### **Search Workflow:** 1. **Query Enhancement** → **Nebius** improves user queries 2. **Vector Search** → **Modal** finds similar documents 3. **Traditional Search** → Local database + external APIs 4. **Ranking** → **Nebius** scores and ranks combined results 5. **Synthesis** → **Nebius** generates insights --- ## **📊 Clear Division:** | Feature | Nebius AI | Modal.com | |---------|-----------|-----------| | **OCR Processing** | ❌ | ✅ | | **Document Analysis** | ✅ | ❌ | | **Vector Search** | ❌ | ✅ | | **Query Enhancement** | ✅ | ❌ | | **Batch Processing** | ❌ | ✅ | | **Embeddings** | ✅ | ✅* | | **Research Synthesis** | ✅ | ❌ | *Modal only for batch embeddings, Nebius for real-time **Nebius = "The Brain"** (AI intelligence) **Modal = "The Engine"** (computational power) ### **Intelligent Fallbacks** - **Modal Unavailable**: Local processing for text files, basic search - **Nebius Unavailable**: Mock embeddings, simplified analysis - **Network Issues**: Cached results and offline functionality ## 🏆 Track 3: Agentic Demo Showcase Features ### **🤖 "Show us the most incredible things that your agents can do!"** KnowledgeBridge demonstrates sophisticated multi-agent systems in action: ### **🧠 Autonomous Agent Workflows** - **Smart Agent Coordination**: Multiple specialized agents work together to fulfill complex research tasks - **Adaptive Agent Behavior**: Agents dynamically adjust strategies based on query complexity and source availability - **Multi-Modal Agent Processing**: Different agent types (search, analysis, validation) collaborate seamlessly - **Intelligent Agent Fallbacks**: Backup agents activate automatically when primary agents encounter issues ### **🔍 Real-Time Agent Decision Making** - **Query Analysis Agents**: Instantly determine optimal search strategies across 4+ sources - **Load Balancing Agents**: Distribute workload intelligently based on API response times and rate limits - **Quality Control Agents**: Evaluate and filter results in real-time for relevance and authenticity - **Synthesis Agents**: Combine disparate information sources into coherent, actionable insights ### **📊 Advanced Agent Orchestration** - **Parallel Agent Execution**: Simultaneous deployment of search agents across GitHub, Wikipedia, ArXiv - **Agent Communication Protocols**: Real-time coordination between agents for optimal resource utilization - **Adaptive Agent Learning**: Agents improve performance based on user interactions and feedback - **Error Recovery Agents**: Autonomous problem-solving when individual agents encounter failures ### **🛡️ Production-Grade Agent Infrastructure** - **Security Agent Monitoring**: Continuous protection against abuse with intelligent rate limiting - **Validation Agent Networks**: Multi-layer content verification and URL authenticity checking - **Performance Agent Optimization**: Automatic scaling and resource management for enterprise workloads - **Resilience Agent Systems**: Graceful degradation and fault tolerance across all agent operations ### **⚡ Agent Performance Metrics** - **Sub-second Agent Response**: Query analysis and routing in <100ms - **Concurrent Agent Processing**: 4+ agents working simultaneously on complex research tasks - **Intelligent Agent Caching**: Smart result storage and retrieval for enhanced performance - **Scalable Agent Architecture**: Horizontal scaling support for enterprise deployment ## 📄 License MIT License - see [LICENSE](LICENSE) file for details. ## 🔗 Related Resources ### **AI Services** - [Nebius AI Documentation](https://docs.nebius.ai/) - Advanced language models and embeddings - [Modal Documentation](https://modal.com/docs) - Serverless computing platform - **Live Modal App**: [https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run](https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run) - **Modal API Docs**: [https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run/docs](https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run/docs) ### **Frontend Technologies** - [React Query Documentation](https://tanstack.com/query/latest) - [Radix UI Components](https://www.radix-ui.com/) - [Tailwind CSS](https://tailwindcss.com/) ### **AI Models** - [DeepSeek Models](https://platform.deepseek.com/) - Advanced reasoning capabilities - [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) - Embedding model for semantic search --- ## 🚀 Agents-MCP-Hackathon Submission Summary **KnowledgeBridge** showcases the incredible power of AI agents through: 🤖 **Multi-Agent Orchestration** - Coordinated intelligence across search, analysis, and synthesis agents 🔍 **Real-Time Decision Making** - Agents adapt strategies and optimize performance dynamically 📊 **Advanced Agent Workflows** - Complex multi-step processes handled autonomously 🛡️ **Production-Ready Agent Infrastructure** - Enterprise-grade security and resilience **Track 3: Agentic Demo Showcase** - Demonstrating what happens when sophisticated AI agents work together to revolutionize knowledge discovery and research workflows. **Built for the Hugging Face Agents-MCP-Hackathon** 🏆 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference