AnseMin commited on
Commit
2a9686e
Β·
1 Parent(s): ab0e18b

Enhance README to reflect new features and improvements in document processing

Browse files

- Updated GOT-OCR section to highlight native LaTeX output and Mathpix rendering capabilities.
- Introduced Smart Content-Aware Chunking for both Markdown and LaTeX, detailing preservation of structures and automatic format detection.
- Added new configuration options for LaTeX chunking in the application settings.
- Expanded usage instructions for GOT-OCR and other parsers, clarifying output formats.
- Included a new section on advanced LaTeX processing features and use cases for better user guidance.

Files changed (1) hide show
  1. README.md +67 -6
README.md CHANGED
@@ -24,7 +24,7 @@ A Hugging Face Space that converts various document formats to Markdown and lets
24
  - Multiple parser options:
25
  - MarkItDown: For comprehensive document conversion
26
  - Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
27
- - GOT-OCR: For image-based OCR with LaTeX support
28
  - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
29
  - Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
30
  - **πŸ†• Intelligent Processing Types**:
@@ -42,7 +42,10 @@ A Hugging Face Space that converts various document formats to Markdown and lets
42
  - **BM25 Keyword Search**: Traditional keyword-based retrieval
43
  - **Hybrid Search**: Combines semantic + keyword search for best accuracy
44
  - **Intelligent document retrieval** using vector embeddings
45
- - **Markdown-aware chunking** that preserves tables and code blocks
 
 
 
46
  - **Streaming chat responses** for real-time interaction
47
  - **Chat history management** with session persistence
48
  - **Usage limits** to prevent abuse on public spaces
@@ -168,8 +171,10 @@ The application uses centralized configuration management. You can enhance funct
168
  - `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
169
  - `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
170
  - `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
171
- - `CHUNK_SIZE`: Document chunk size for RAG (default: 1000)
172
- - `CHUNK_OVERLAP`: Overlap between chunks (default: 200)
 
 
173
  - `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
174
  - `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
175
  - `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
@@ -199,7 +204,9 @@ The application uses centralized configuration management. You can enhance funct
199
  - **"Gemini Flash"** for AI-powered text extraction
200
  4. Select an OCR method based on your chosen parser
201
  5. Click "Convert"
202
- 6. View the Markdown output and download the converted file
 
 
203
 
204
  #### πŸ“‚ **Multi-Document Processing** (NEW!)
205
  1. Go to the **"Document Converter"** tab
@@ -319,10 +326,58 @@ The application uses centralized configuration management. You can enhance funct
319
  - **Enhanced Error Messages**: Detailed error reporting for debugging
320
  - **Centralized Logging**: Configurable logging levels and output formats
321
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
322
  ## Credits
323
 
324
  - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
325
  - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
 
326
  - [Gradio](https://gradio.app/) for the UI framework
327
 
328
  ---
@@ -343,6 +398,7 @@ The system supports **four different retrieval methods** for optimal document se
343
  - **Best for**: General questions and semantic understanding
344
  - **Use case**: "What is the main topic of this document?"
345
  - **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
 
346
 
347
  ### **2. πŸ”€ MMR (Maximal Marginal Relevance)**
348
  - **How it works**: Balances relevance with result diversity to reduce redundancy
@@ -469,7 +525,12 @@ markit_v2/
469
 
470
  ### 🧠 **RAG System Architecture:**
471
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
472
- - **Markdown-Aware Chunking** (`src/rag/chunking.py`): Preserves tables and code blocks as whole units
 
 
 
 
 
473
  - **πŸ†• Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
474
  - **Similarity Search**: Traditional semantic retrieval using embeddings
475
  - **MMR Support**: Maximal Marginal Relevance for diverse results
 
24
  - Multiple parser options:
25
  - MarkItDown: For comprehensive document conversion
26
  - Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
27
+ - GOT-OCR: For image-based OCR with **native LaTeX output** and Mathpix rendering
28
  - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
29
  - Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
30
  - **πŸ†• Intelligent Processing Types**:
 
42
  - **BM25 Keyword Search**: Traditional keyword-based retrieval
43
  - **Hybrid Search**: Combines semantic + keyword search for best accuracy
44
  - **Intelligent document retrieval** using vector embeddings
45
+ - **πŸ†• Smart Content-Aware Chunking**:
46
+ - **Markdown chunking** that preserves tables and code blocks
47
+ - **LaTeX chunking** that preserves mathematical tables, environments, and structures
48
+ - **Automatic format detection** for optimal chunking strategy
49
  - **Streaming chat responses** for real-time interaction
50
  - **Chat history management** with session persistence
51
  - **Usage limits** to prevent abuse on public spaces
 
171
  - `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
172
  - `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
173
  - `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
174
+ - `CHUNK_SIZE`: Document chunk size for Markdown content (default: 1000)
175
+ - `CHUNK_OVERLAP`: Overlap between chunks for Markdown (default: 200)
176
+ - `LATEX_CHUNK_SIZE`: Document chunk size for LaTeX content (default: 1200)
177
+ - `LATEX_CHUNK_OVERLAP`: Overlap between chunks for LaTeX (default: 150)
178
  - `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
179
  - `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
180
  - `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
 
204
  - **"Gemini Flash"** for AI-powered text extraction
205
  4. Select an OCR method based on your chosen parser
206
  5. Click "Convert"
207
+ 6. **For GOT-OCR**: View the LaTeX output with **Mathpix rendering** for proper mathematical and tabular display
208
+ 7. **For other parsers**: View the Markdown output
209
+ 8. Download the converted file (.tex for GOT-OCR, .md for others)
210
 
211
  #### πŸ“‚ **Multi-Document Processing** (NEW!)
212
  1. Go to the **"Document Converter"** tab
 
326
  - **Enhanced Error Messages**: Detailed error reporting for debugging
327
  - **Centralized Logging**: Configurable logging levels and output formats
328
 
329
+ ## πŸ“„ GOT-OCR LaTeX Processing
330
+
331
+ Markit v2 features **advanced LaTeX processing** for GOT-OCR results, providing proper mathematical and tabular content handling:
332
+
333
+ ### **🎯 Key Features:**
334
+
335
+ #### **1. Native LaTeX Output**
336
+ - **No LLM conversion**: GOT-OCR returns raw LaTeX directly for maximum accuracy
337
+ - **Preserves mathematical structures**: Complex formulas, tables, and equations remain intact
338
+ - **.tex file output**: Save files in proper LaTeX format for external use
339
+
340
+ #### **2. Mathpix Markdown Rendering**
341
+ - **Professional display**: Uses Mathpix Markdown library (same as official GOT-OCR demo)
342
+ - **Complex table support**: Renders `\begin{tabular}`, `\multirow`, `\multicolumn` properly
343
+ - **Mathematical expressions**: Displays LaTeX math with proper formatting
344
+ - **Base64 iframe embedding**: Secure, isolated rendering environment
345
+
346
+ #### **3. RAG-Compatible LaTeX Chunking**
347
+ - **LaTeX-aware chunker**: Specialized chunking preserves LaTeX structures
348
+ - **Complete table preservation**: Entire `\begin{tabular}...\end{tabular}` blocks stay intact
349
+ - **Environment detection**: Maintains `\begin{env}...\end{env}` pairs
350
+ - **Intelligent separators**: Uses LaTeX commands (`\section`, `\title`) as break points
351
+
352
+ #### **4. Enhanced Metadata**
353
+ - **Content type tracking**: `content_type: "latex"` for proper handling
354
+ - **Structure detection**: Identifies tables, environments, and mathematical content
355
+ - **Auto-format detection**: GOT-OCR results automatically use LaTeX chunker
356
+
357
+ ### **πŸ”§ Technical Implementation:**
358
+
359
+ ```javascript
360
+ // Mathpix rendering (inspired by official GOT-OCR demo)
361
+ const html = window.render(latexContent, {htmlTags: true});
362
+
363
+ // LaTeX structure preservation
364
+ \begin{tabular}{|l|c|c|}
365
+ \hline Disability & Participants & Results \\
366
+ \hline Blind & 5 & $34.5\%, n=1$ \\
367
+ \end{tabular}
368
+ ```
369
+
370
+ ### **πŸ“Š Use Cases:**
371
+ - **Research papers**: Mathematical formulas and data tables
372
+ - **Scientific documents**: Complex equations and statistical data
373
+ - **Financial reports**: Tabular data with calculations
374
+ - **Academic content**: Mixed text, math, and structured data
375
+
376
  ## Credits
377
 
378
  - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
379
  - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
380
+ - [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
381
  - [Gradio](https://gradio.app/) for the UI framework
382
 
383
  ---
 
398
  - **Best for**: General questions and semantic understanding
399
  - **Use case**: "What is the main topic of this document?"
400
  - **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
401
+ - **Chunking**: Uses content-aware chunking (Markdown or LaTeX) for optimal structure preservation
402
 
403
  ### **2. πŸ”€ MMR (Maximal Marginal Relevance)**
404
  - **How it works**: Balances relevance with result diversity to reduce redundancy
 
525
 
526
  ### 🧠 **RAG System Architecture:**
527
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
528
+ - **πŸ†• Smart Content-Aware Chunking** (`src/rag/chunking.py`):
529
+ - **Unified chunker** supporting both Markdown and LaTeX content
530
+ - **Markdown chunking**: Preserves tables and code blocks as whole units
531
+ - **LaTeX chunking**: Preserves `\begin{tabular}`, mathematical environments, and LaTeX structures
532
+ - **Automatic format detection**: GOT-OCR results β†’ LaTeX chunker, others β†’ Markdown chunker
533
+ - **Enhanced metadata**: Content type tracking and structure detection
534
  - **πŸ†• Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
535
  - **Similarity Search**: Traditional semantic retrieval using embeddings
536
  - **MMR Support**: Maximal Marginal Relevance for diverse results