Markit_v2 / README.md
AnseMin's picture
Update README to include demo video section
bf4414c

A newer version of the Gradio SDK is available: 5.37.0

Upgrade
metadata
title: Markit_v2
emoji: πŸ“„
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.14.0
app_file: app.py
build_script: build.sh
startup_script: setup.sh
pinned: false
hf_oauth: true

Document to Markdown Converter with RAG Chat

Author: Anse Min | πŸ€— Hugging Face Space | GitHub | LinkedIn

A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).

πŸŽ₯ Demo Video

Markit Demo Video

▢️ Watch Full Demo (YouTube)

Complete walkthrough of Markit's flagship features including multi-document processing, RAG chat, and advanced retrieval strategies

Table of contents

🎬 Live Demos

1. Multi-Document Processing (Flagship Feature)

Multi-Document Processing Demo

What it does: Process up to 5 files simultaneously (20MB combined) with 4 intelligent processing types:

  • πŸ”— Combined: Merge documents with smart duplicate removal
  • πŸ“‘ Individual: Separate sections per document with clear organization
  • πŸ“ˆ Summary: Executive overview + detailed analysis of all documents
  • βš–οΈ Comparison: Cross-document analysis with similarities/differences tables

Why it matters: Industry-leading multi-document processing that compares and contrasts information across different files, handles mixed file types seamlessly, and recognizes relationships across document boundaries.

Multi-Document Processing Types

Industry-leading multi-document processing with 4 intelligent processing types

2. Single Document Conversion Flow

Single Document Conversion Demo

What it does: Convert PDFs, Office documents, images, and more to Markdown using 5 powerful parsers:

  • Gemini Flash: AI-powered understanding with high accuracy
  • Mistral OCR: Fastest processing with document understanding
  • Docling: Open source with advanced PDF table recognition
  • GOT-OCR: Mathematical/scientific documents to LaTeX
  • MarkItDown: High accuracy for CSV/XML and broad format support

Why it matters: Perfect table preservation creates enhanced markdown tables for superior RAG context, unlike standard PDF text extraction.

Parser Selection Guide

Choose the right parser for your specific needs and document types

3. RAG Chat System in Action

RAG Chat System Demo

What it does: Chat with your converted documents using 4 advanced retrieval strategies:

  • 🎯 Similarity: Traditional semantic similarity using embeddings
  • πŸ”€ MMR: Diverse results with reduced redundancy
  • πŸ” BM25: Traditional keyword-based retrieval
  • πŸ”— Hybrid: Combines semantic + keyword search (recommended)

Why it matters: Ask for markdown tables in chat responses (impossible with standard PDF RAG), get streaming responses with document context, and easily clear data directly from the interface.

RAG Retrieval Strategies

Advanced RAG system with 4 retrieval strategies for optimal document search

4. Query Ranker Analysis

Query Ranker Demo

What it does: Interactive document search with:

  • Real-time ranking of document chunks with confidence scores
  • Method comparison to test different retrieval strategies
  • Adjustable results (1-10) with responsive slider control
  • Transparent scoring with actual ChromaDB similarity scores

Why it matters: Provides complete transparency into how your RAG system finds and ranks information, helping you optimize retrieval strategies.

5. GOT-OCR LaTeX Processing

GOT-OCR LaTeX Demo

What it does: Advanced LaTeX processing for mathematical and scientific documents:

  • Native LaTeX output with no LLM conversion for maximum accuracy
  • Mathpix rendering using the same library as official GOT-OCR demo
  • RAG-compatible chunking that preserves LaTeX structures and mathematical tables
  • Professional display with proper mathematical formatting

Why it matters: Perfect for research papers, scientific documents, and academic content with complex equations and structured data.

🎯 System Overview

Overall System Workflow

Complete workflow from document upload to intelligent RAG chat interaction

πŸ”§ Environment Setup

Required API Keys

GOOGLE_API_KEY=your_gemini_api_key_here    # For Gemini Flash parser and RAG chat
OPENAI_API_KEY=your_openai_api_key_here    # For embeddings and AI descriptions  
MISTRAL_API_KEY=your_mistral_api_key_here  # For Mistral OCR parser (optional)

Key Configuration Options

DEBUG=true                        # Enable debug logging
MAX_FILE_SIZE=10485760           # 10MB per file limit
MAX_BATCH_FILES=5                # Maximum files for multi-document processing
MAX_BATCH_SIZE=20971520          # 20MB combined limit for batch processing
CHUNK_SIZE=1000                  # Document chunk size for Markdown content
RETRIEVAL_K=4                    # Number of documents to retrieve for RAG

πŸš€ Local Development

Quick Start

# Clone repository
git clone https://github.com/ansemin/Markit_v2
cd Markit_v2

# Create environment file
cp .env.example .env
# Edit .env with your API keys

# Install dependencies
pip install -r requirements.txt

# Run application
python app.py                    # Full environment setup (HF Spaces compatible)
python run_app.py               # Local development (faster startup)
python run_app.py --clear-data-and-run  # Testing with clean data

Data Management

Two ways to clear data:

  1. UI Method: Chat tab β†’ "πŸ—‘οΈ Clear All Data" button (works in both local and HF Space)
  2. CLI Method: python run_app.py --clear-data-and-run

What gets cleared: Vector store embeddings, chat history, and session data

πŸ” Technical Details

Retrieval Strategy Performance

Method Best For Accuracy
🎯 Similarity General semantic questions Good
πŸ”€ MMR Diverse perspectives Good
πŸ” BM25 Exact keyword searches Medium
πŸ”— Hybrid Most queries (recommended) Excellent

Core Technologies

  • Parsers: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
  • RAG System: OpenAI embeddings + ChromaDB vector store + Gemini 2.5 Flash
  • UI Framework: Gradio with modular component architecture
  • GPU Support: ZeroGPU integration for HF Spaces

Smart Content-Aware Chunking

  • Markdown chunking: Preserves tables and code blocks
  • LaTeX chunking: Preserves mathematical tables, environments, and structures
  • Automatic format detection: Optimal chunking strategy per document type

Credits


πŸš€ Try it live on Hugging Face Spaces