KnowledgeBridge / README.md
fazeel007's picture
Fix nebius AI
24425b1
metadata
title: KnowledgeBridge
emoji: πŸ“š
colorFrom: yellow
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: 'A sophisticated AI-powered knowledge retrieval and analysis '
tags:
  - agent-demo-track

KnowledgeBridge

πŸš€ An AI-Enhanced Knowledge Discovery Platform with Document Processing & Vector Search

A production-ready AI-powered knowledge retrieval system featuring real document upload, OCR processing, vector embeddings, and distributed computing for large-scale document analysis and semantic search.

Security Status TypeScript AI Models License

🎯 Hackathon Submission

πŸ€– Track 3: Agentic Demo Showcase

Submitted to: Hugging Face Agents-MCP-Hackathon

Live Demo: [Try KnowledgeBridge on Hugging Face Spaces](https://huggingface.co/spaces/Agents-MCP-Hackathon/KnowledgeBridge

[Video Link]{https://drive.google.com/drive/folders/1iQafhb7PmO6zWW-JDq1eWGo8KN10Ctdf?usp=sharing}

πŸš€ "Show us the most incredible things that your agents can do!"

KnowledgeBridge demonstrates sophisticated AI agent orchestration through multi-modal knowledge discovery, intelligent query enhancement, and autonomous research synthesis.

πŸ€– Agentic Capabilities Showcase

🧠 Multi-Agent Orchestration

  • Coordinated Search Agents: Simultaneous deployment across GitHub, Wikipedia, ArXiv, and web sources
  • Intelligent Load Balancing: Agents dynamically distribute workload based on query type and source availability
  • Fallback Agent Strategy: Backup agents activate when primary sources fail or timeout
  • Real-Time Coordination: Agents communicate results and adapt search strategies collaboratively

πŸ” Query Enhancement Agents

  • Intent Recognition Agents: AI agents analyze user intent and suggest optimal search strategies
  • Semantic Expansion Agents: Agents enhance queries with related terms and concepts
  • Context-Aware Agents: Agents consider previous searches and user preferences
  • Multi-Modal Query Agents: Agents adapt search approach based on content type (code, academic, general)

πŸ“Š Document Processing & Analysis Agents

  • OCR Processing Agents: Autonomous PDF and image text extraction using Modal's distributed Tesseract OCR
  • Vector Embedding Agents: Generate 1536-dimensional embeddings and build FAISS indices at scale
  • Batch Processing Agents: Coordinate distributed document processing across Modal compute nodes
  • Research Synthesis Agents: AI agents combine insights from multiple sources into coherent analysis
  • Quality Assessment Agents: Agents evaluate source credibility and content relevance

πŸ›‘οΈ Security & Validation Agents

  • URL Validation Agents: Intelligent agents verify link accessibility and content authenticity
  • Rate Limiting Agents: Protective agents prevent API abuse (100 requests/15min, 10/min for sensitive endpoints)
  • Input Sanitization Agents: Security agents validate and clean all user inputs
  • Error Recovery Agents: Resilient agents handle failures gracefully and maintain system stability

🌐 Intelligent Integration Agents

  • ArXiv Academic Agents: Specialized agents for academic paper validation and retrieval
  • GitHub Repository Agents: Code-focused agents with author filtering and relevance scoring
  • Wikipedia Knowledge Agents: Authoritative content agents with intelligent caching strategies
  • Cross-Platform Synthesis Agents: Agents that combine and rank results across all sources

πŸ—οΈ Technical Architecture

Frontend Stack

  • React 18 with TypeScript for type-safe development
  • Wouter Router for lightweight client-side routing
  • TanStack Query for efficient data fetching and caching
  • Radix UI + Tailwind CSS for accessible, modern components
  • Framer Motion for smooth animations and transitions

Backend Stack

  • Node.js + Express with comprehensive middleware
  • SQLite Database with real document storage and metadata
  • File Upload System supporting PDFs, images, text files (50MB each)
  • Express Rate Limit for API protection
  • Helmet.js for security headers

AI & Distributed Computing

  • Nebius AI Platform - Advanced LLM and embedding capabilities
    • DeepSeek-R1-0528 for chat completions and document analysis
    • BAAI/bge-en-icl for embedding generation (1536 dimensions)
    • Query Enhancement and intelligent content analysis
  • Modal.com Platform - Production heavy workloads
    • OCR Processing: PDF/image text extraction with PyPDF2 + Tesseract
    • FAISS Vector Indexing: Distributed index building for large document collections
    • High-Performance Search: Sub-second similarity search across millions of vectors
    • Batch Processing: Concurrent document processing with 2-4GB memory per task
    • Persistent Storage: Modal volumes for cross-session index storage

πŸš€ Quick Start

Environment Configuration

Create a .env file in the project root:

# Nebius AI Configuration (Required)
NEBIUS_API_KEY=your_nebius_api_key_here

# Modal Configuration (Optional - for advanced processing)
MODAL_TOKEN_ID=your_modal_token_id
MODAL_TOKEN_SECRET=your_modal_token_secret
MODAL_BASE_URL=https://fazeelusmani18--knowledgebridge-main-fastapi-app.modal.run

# GitHub Configuration (Optional - for repository search)
GITHUB_TOKEN=your_github_token_here

# Node Environment
NODE_ENV=development

Development Setup

# Install dependencies
npm install

# Start development server
npm run dev

# Build for production
npm run build

# Type checking
npm run check

The application will be available at http://localhost:5000

🎯 Usage Guide

Document Upload & Processing

  1. Upload Documents: Drag and drop PDFs, images, text files (up to 50MB each)
  2. Automatic Processing: OCR extraction via Modal for PDFs/images, embedding generation
  3. Status Tracking: Monitor processing status (pending β†’ processing β†’ completed)
  4. Batch Operations: Process multiple documents and build vector indices

Vector Search

  1. Semantic Search: Query your processed documents using vector similarity
  2. Index Management: Build FAISS indices from your document collections
  3. Performance Comparison: Side-by-side vector vs. keyword search results
  4. Relevance Scoring: AI-powered relevance scores with detailed metrics

AI-Enhanced Search

  1. Traditional Search: Natural language queries across web sources
  2. Query Enhancement: AI-powered query improvement suggestions
  3. Multi-Source Results: Combined results from GitHub, Wikipedia, ArXiv
  4. Research Synthesis: AI analysis and synthesis of search results

Knowledge Management

  • Document Library: Manage uploaded documents with metadata
  • Citation Generation: Export results in multiple academic formats
  • Knowledge Graph: Interactive visualization of document relationships

πŸ”§ API Reference

Document Management

POST /api/documents/upload
// Multipart form data with files[]
// Optional: title, source

GET /api/documents/list
// Query params: limit, offset, sourceType, processingStatus

POST /api/documents/process/:id
{
  operations: ["extract_text", "generate_embedding", "build_index"];
  indexName?: string;
}

POST /api/documents/process/batch
{
  documentIds: number[];
  operations: ["extract_text", "generate_embedding"];
  indexName?: string;
}

DELETE /api/documents/:id
// Deletes document and associated file

Vector Search & Indexing

POST /api/documents/search/vector
{
  query: string;
  indexName?: string;
  maxResults?: number;
}

POST /api/documents/index/build
{
  documentIds?: number[]; // Optional: specific documents
  indexName?: string;
}

GET /api/documents/status/:id
// Returns processing status and metadata

Traditional Search & AI

POST /api/search
{
  query: string;
  searchType: "semantic" | "keyword" | "hybrid";
  limit: number;
  filters?: { sourceTypes?: string[]; };
}

POST /api/analyze-document
{
  content: string;
  analysisType: "summary" | "classification" | "key_points";
  useMarkdown?: boolean;
}

POST /api/enhance-query
{
  query: string;
  context?: string;
}

Health Check

GET /api/health
// Returns comprehensive health status of all services including:
// - Nebius AI (embeddings, chat completions)
// - Modal.com (API connectivity, function availability)
// - External APIs (GitHub, Wikipedia, ArXiv)

πŸš€ Performance & Reliability

Performance Metrics

  • Document Upload: <1s for files up to 50MB with progress tracking
  • OCR Processing: 5-15 seconds per PDF/image via Modal distributed computing
  • Vector Search: <500ms for similarity search across large document collections
  • Index Building: 10-60 seconds for 100-1000 documents using FAISS
  • Nebius AI:
    • Document analysis: 3-5 seconds for comprehensive analysis
    • Embedding generation: 500ms-1s per document
    • Query enhancement: 1-2 seconds
  • Traditional Search: <100ms for local database queries

Production Scalability

  • Distributed Computing: Modal automatically scales compute resources (2-4GB per task)
  • Concurrent Processing: Parallel document processing across multiple nodes
  • Persistent Storage: SQLite for metadata, Modal volumes for vector indices
  • Batch Operations: Process hundreds of documents simultaneously
  • Intelligent Caching: Optimized repeated operations and query results
  • Graceful Fallbacks: Continues operation when external services unavailable
  • Resource Optimization: Automatic cleanup and memory management

Error Handling

  • React Error Boundaries prevent UI crashes
  • Comprehensive API error responses
  • Automatic retry logic for network requests
  • User-friendly error messages

πŸ”’ Security Features

Input Protection

  • Request body size limits (10MB)
  • Comprehensive input sanitization
  • SQL injection prevention
  • XSS protection with CSP headers

API Security

  • Rate limiting on all endpoints
  • Secure environment variable handling
  • No hardcoded credentials
  • Proper error logging without information disclosure

Infrastructure Security

  • Helmet.js security headers
  • CORS configuration
  • Secure cookie handling
  • Production-ready error handling

πŸ› οΈ Development

Code Quality

  • 100% TypeScript coverage
  • ESLint + Prettier configuration
  • Comprehensive error handling
  • Type-safe API contracts with Zod validation

Testing

# Type checking
npm run check

# Development server
npm run dev

# Production build
npm run build

πŸŽ‰ Latest Features

  • βœ… Document Upload System: Real file upload with drag-and-drop, supporting PDFs, images, text files
  • βœ… OCR Processing Pipeline: Modal-powered text extraction from PDFs and images using Tesseract
  • βœ… Vector Search Engine: FAISS-based semantic search with distributed index building
  • βœ… SQLite Database: Persistent storage replacing in-memory data with full metadata tracking
  • βœ… Batch Processing: Concurrent document processing across Modal's distributed compute nodes
  • βœ… Production Ready: Real heavy workloads utilizing Modal's computational capabilities

πŸ“š Production Architecture

Complete Document Processing Pipeline

πŸ“„ Document Upload β†’ πŸ”„ Processing β†’ πŸ” Search β†’ πŸ“Š Analysis

  1. Upload & Storage:

    • Multi-file drag-and-drop interface (PDFs, images, text files)
    • SQLite database with full metadata tracking
    • File validation and organization by date
  2. Modal Distributed Processing:

    • OCR text extraction using Tesseract for images/PDFs
    • Parallel processing across compute nodes (2-4GB per task)
    • Batch operations for large document collections
  3. AI Analysis & Embeddings:

    • Nebius AI generates 1536-dimensional embeddings
    • Document classification and content analysis
    • Quality assessment and metadata enrichment
  4. Vector Index & Search:

    • FAISS index building via Modal's distributed computing
    • High-performance semantic similarity search
    • Persistent storage across sessions

Service Integration & Division of Responsibilities

🧠 Nebius AI: Language Intelligence & AI Reasoning

Used For:

  • πŸ“ Document Analysis: Classification, summarization, key points extraction, quality scoring
  • πŸ” Search Intelligence: Query enhancement, intent understanding, relevance scoring
  • πŸ’­ AI Reasoning: Research synthesis, explanations, conversational responses
  • 🎯 Embeddings: Real-time text-to-vector conversion using BAAI/bge-en-icl model
  • πŸ“Š Content Understanding: All language comprehension and semantic analysis

Specific Endpoints:

  • /api/analyze-document - Document analysis with DeepSeek-R1 model
  • /api/enhance-query - AI-powered query improvement
  • /api/embeddings - Generate vector embeddings
  • /api/research-synthesis - Combine insights from multiple sources
  • /api/ai-search - Enhanced semantic search

⚑ Modal.com: Heavy Computation & Distributed Processing

Used For:

  • πŸ“„ OCR Processing: PDF and image text extraction using Tesseract
  • πŸ”§ Vector Operations: FAISS index building and high-performance search
  • πŸ“¦ Batch Processing: Concurrent processing of large document collections
  • πŸ’Ύ Infrastructure: Serverless scaling, persistent storage, distributed compute
  • πŸš€ Heavy Workloads: All computationally intensive operations

Specific Endpoints:

  • /api/documents/process/:id - OCR text extraction via Modal
  • /api/documents/index/build - FAISS vector index creation
  • /api/documents/search/vector - High-performance vector search
  • /api/documents/process/batch - Distributed batch processing

Live Deployment: Modal App


πŸ”„ How They Work Together

Document Processing Pipeline:

  1. Upload β†’ Local file storage
  2. OCR β†’ Modal extracts text from PDFs/images
  3. Analysis β†’ Nebius analyzes content and generates embeddings
  4. Indexing β†’ Modal builds FAISS vector index
  5. Search β†’ Modal performs vector search, Nebius scores relevance

Search Workflow:

  1. Query Enhancement β†’ Nebius improves user queries
  2. Vector Search β†’ Modal finds similar documents
  3. Traditional Search β†’ Local database + external APIs
  4. Ranking β†’ Nebius scores and ranks combined results
  5. Synthesis β†’ Nebius generates insights

πŸ“Š Clear Division:

Feature Nebius AI Modal.com
OCR Processing ❌ βœ…
Document Analysis βœ… ❌
Vector Search ❌ βœ…
Query Enhancement βœ… ❌
Batch Processing ❌ βœ…
Embeddings βœ… βœ…*
Research Synthesis βœ… ❌

*Modal only for batch embeddings, Nebius for real-time

Nebius = "The Brain" (AI intelligence)
Modal = "The Engine" (computational power)

Intelligent Fallbacks

  • Modal Unavailable: Local processing for text files, basic search
  • Nebius Unavailable: Mock embeddings, simplified analysis
  • Network Issues: Cached results and offline functionality

πŸ† Track 3: Agentic Demo Showcase Features

πŸ€– "Show us the most incredible things that your agents can do!"

KnowledgeBridge demonstrates sophisticated multi-agent systems in action:

🧠 Autonomous Agent Workflows

  • Smart Agent Coordination: Multiple specialized agents work together to fulfill complex research tasks
  • Adaptive Agent Behavior: Agents dynamically adjust strategies based on query complexity and source availability
  • Multi-Modal Agent Processing: Different agent types (search, analysis, validation) collaborate seamlessly
  • Intelligent Agent Fallbacks: Backup agents activate automatically when primary agents encounter issues

πŸ” Real-Time Agent Decision Making

  • Query Analysis Agents: Instantly determine optimal search strategies across 4+ sources
  • Load Balancing Agents: Distribute workload intelligently based on API response times and rate limits
  • Quality Control Agents: Evaluate and filter results in real-time for relevance and authenticity
  • Synthesis Agents: Combine disparate information sources into coherent, actionable insights

πŸ“Š Advanced Agent Orchestration

  • Parallel Agent Execution: Simultaneous deployment of search agents across GitHub, Wikipedia, ArXiv
  • Agent Communication Protocols: Real-time coordination between agents for optimal resource utilization
  • Adaptive Agent Learning: Agents improve performance based on user interactions and feedback
  • Error Recovery Agents: Autonomous problem-solving when individual agents encounter failures

πŸ›‘οΈ Production-Grade Agent Infrastructure

  • Security Agent Monitoring: Continuous protection against abuse with intelligent rate limiting
  • Validation Agent Networks: Multi-layer content verification and URL authenticity checking
  • Performance Agent Optimization: Automatic scaling and resource management for enterprise workloads
  • Resilience Agent Systems: Graceful degradation and fault tolerance across all agent operations

⚑ Agent Performance Metrics

  • Sub-second Agent Response: Query analysis and routing in <100ms
  • Concurrent Agent Processing: 4+ agents working simultaneously on complex research tasks
  • Intelligent Agent Caching: Smart result storage and retrieval for enhanced performance
  • Scalable Agent Architecture: Horizontal scaling support for enterprise deployment

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ”— Related Resources

AI Services

Frontend Technologies

AI Models


πŸš€ Agents-MCP-Hackathon Submission Summary

KnowledgeBridge showcases the incredible power of AI agents through:

πŸ€– Multi-Agent Orchestration - Coordinated intelligence across search, analysis, and synthesis agents
πŸ” Real-Time Decision Making - Agents adapt strategies and optimize performance dynamically
πŸ“Š Advanced Agent Workflows - Complex multi-step processes handled autonomously
πŸ›‘οΈ Production-Ready Agent Infrastructure - Enterprise-grade security and resilience

Track 3: Agentic Demo Showcase - Demonstrating what happens when sophisticated AI agents work together to revolutionize knowledge discovery and research workflows.

Built for the Hugging Face Agents-MCP-Hackathon πŸ†

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference