Spaces:
Sleeping
Sleeping
Enhance README to reflect new features and improvements in document processing
Browse files- Updated GOT-OCR section to highlight native LaTeX output and Mathpix rendering capabilities.
- Introduced Smart Content-Aware Chunking for both Markdown and LaTeX, detailing preservation of structures and automatic format detection.
- Added new configuration options for LaTeX chunking in the application settings.
- Expanded usage instructions for GOT-OCR and other parsers, clarifying output formats.
- Included a new section on advanced LaTeX processing features and use cases for better user guidance.
README.md
CHANGED
@@ -24,7 +24,7 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
24 |
- Multiple parser options:
|
25 |
- MarkItDown: For comprehensive document conversion
|
26 |
- Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
|
27 |
-
- GOT-OCR: For image-based OCR with LaTeX
|
28 |
- Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
|
29 |
- Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
|
30 |
- **π Intelligent Processing Types**:
|
@@ -42,7 +42,10 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
42 |
- **BM25 Keyword Search**: Traditional keyword-based retrieval
|
43 |
- **Hybrid Search**: Combines semantic + keyword search for best accuracy
|
44 |
- **Intelligent document retrieval** using vector embeddings
|
45 |
-
-
|
|
|
|
|
|
|
46 |
- **Streaming chat responses** for real-time interaction
|
47 |
- **Chat history management** with session persistence
|
48 |
- **Usage limits** to prevent abuse on public spaces
|
@@ -168,8 +171,10 @@ The application uses centralized configuration management. You can enhance funct
|
|
168 |
- `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
|
169 |
- `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
|
170 |
- `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
|
171 |
-
- `CHUNK_SIZE`: Document chunk size for
|
172 |
-
- `CHUNK_OVERLAP`: Overlap between chunks (default: 200)
|
|
|
|
|
173 |
- `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
|
174 |
- `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
|
175 |
- `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
|
@@ -199,7 +204,9 @@ The application uses centralized configuration management. You can enhance funct
|
|
199 |
- **"Gemini Flash"** for AI-powered text extraction
|
200 |
4. Select an OCR method based on your chosen parser
|
201 |
5. Click "Convert"
|
202 |
-
6. View the
|
|
|
|
|
203 |
|
204 |
#### π **Multi-Document Processing** (NEW!)
|
205 |
1. Go to the **"Document Converter"** tab
|
@@ -319,10 +326,58 @@ The application uses centralized configuration management. You can enhance funct
|
|
319 |
- **Enhanced Error Messages**: Detailed error reporting for debugging
|
320 |
- **Centralized Logging**: Configurable logging levels and output formats
|
321 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
322 |
## Credits
|
323 |
|
324 |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
|
325 |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
|
|
|
326 |
- [Gradio](https://gradio.app/) for the UI framework
|
327 |
|
328 |
---
|
@@ -343,6 +398,7 @@ The system supports **four different retrieval methods** for optimal document se
|
|
343 |
- **Best for**: General questions and semantic understanding
|
344 |
- **Use case**: "What is the main topic of this document?"
|
345 |
- **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
|
|
|
346 |
|
347 |
### **2. π MMR (Maximal Marginal Relevance)**
|
348 |
- **How it works**: Balances relevance with result diversity to reduce redundancy
|
@@ -469,7 +525,12 @@ markit_v2/
|
|
469 |
|
470 |
### π§ **RAG System Architecture:**
|
471 |
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
472 |
-
-
|
|
|
|
|
|
|
|
|
|
|
473 |
- **π Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
|
474 |
- **Similarity Search**: Traditional semantic retrieval using embeddings
|
475 |
- **MMR Support**: Maximal Marginal Relevance for diverse results
|
|
|
24 |
- Multiple parser options:
|
25 |
- MarkItDown: For comprehensive document conversion
|
26 |
- Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
|
27 |
+
- GOT-OCR: For image-based OCR with **native LaTeX output** and Mathpix rendering
|
28 |
- Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
|
29 |
- Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
|
30 |
- **π Intelligent Processing Types**:
|
|
|
42 |
- **BM25 Keyword Search**: Traditional keyword-based retrieval
|
43 |
- **Hybrid Search**: Combines semantic + keyword search for best accuracy
|
44 |
- **Intelligent document retrieval** using vector embeddings
|
45 |
+
- **π Smart Content-Aware Chunking**:
|
46 |
+
- **Markdown chunking** that preserves tables and code blocks
|
47 |
+
- **LaTeX chunking** that preserves mathematical tables, environments, and structures
|
48 |
+
- **Automatic format detection** for optimal chunking strategy
|
49 |
- **Streaming chat responses** for real-time interaction
|
50 |
- **Chat history management** with session persistence
|
51 |
- **Usage limits** to prevent abuse on public spaces
|
|
|
171 |
- `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
|
172 |
- `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
|
173 |
- `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
|
174 |
+
- `CHUNK_SIZE`: Document chunk size for Markdown content (default: 1000)
|
175 |
+
- `CHUNK_OVERLAP`: Overlap between chunks for Markdown (default: 200)
|
176 |
+
- `LATEX_CHUNK_SIZE`: Document chunk size for LaTeX content (default: 1200)
|
177 |
+
- `LATEX_CHUNK_OVERLAP`: Overlap between chunks for LaTeX (default: 150)
|
178 |
- `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
|
179 |
- `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
|
180 |
- `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
|
|
|
204 |
- **"Gemini Flash"** for AI-powered text extraction
|
205 |
4. Select an OCR method based on your chosen parser
|
206 |
5. Click "Convert"
|
207 |
+
6. **For GOT-OCR**: View the LaTeX output with **Mathpix rendering** for proper mathematical and tabular display
|
208 |
+
7. **For other parsers**: View the Markdown output
|
209 |
+
8. Download the converted file (.tex for GOT-OCR, .md for others)
|
210 |
|
211 |
#### π **Multi-Document Processing** (NEW!)
|
212 |
1. Go to the **"Document Converter"** tab
|
|
|
326 |
- **Enhanced Error Messages**: Detailed error reporting for debugging
|
327 |
- **Centralized Logging**: Configurable logging levels and output formats
|
328 |
|
329 |
+
## π GOT-OCR LaTeX Processing
|
330 |
+
|
331 |
+
Markit v2 features **advanced LaTeX processing** for GOT-OCR results, providing proper mathematical and tabular content handling:
|
332 |
+
|
333 |
+
### **π― Key Features:**
|
334 |
+
|
335 |
+
#### **1. Native LaTeX Output**
|
336 |
+
- **No LLM conversion**: GOT-OCR returns raw LaTeX directly for maximum accuracy
|
337 |
+
- **Preserves mathematical structures**: Complex formulas, tables, and equations remain intact
|
338 |
+
- **.tex file output**: Save files in proper LaTeX format for external use
|
339 |
+
|
340 |
+
#### **2. Mathpix Markdown Rendering**
|
341 |
+
- **Professional display**: Uses Mathpix Markdown library (same as official GOT-OCR demo)
|
342 |
+
- **Complex table support**: Renders `\begin{tabular}`, `\multirow`, `\multicolumn` properly
|
343 |
+
- **Mathematical expressions**: Displays LaTeX math with proper formatting
|
344 |
+
- **Base64 iframe embedding**: Secure, isolated rendering environment
|
345 |
+
|
346 |
+
#### **3. RAG-Compatible LaTeX Chunking**
|
347 |
+
- **LaTeX-aware chunker**: Specialized chunking preserves LaTeX structures
|
348 |
+
- **Complete table preservation**: Entire `\begin{tabular}...\end{tabular}` blocks stay intact
|
349 |
+
- **Environment detection**: Maintains `\begin{env}...\end{env}` pairs
|
350 |
+
- **Intelligent separators**: Uses LaTeX commands (`\section`, `\title`) as break points
|
351 |
+
|
352 |
+
#### **4. Enhanced Metadata**
|
353 |
+
- **Content type tracking**: `content_type: "latex"` for proper handling
|
354 |
+
- **Structure detection**: Identifies tables, environments, and mathematical content
|
355 |
+
- **Auto-format detection**: GOT-OCR results automatically use LaTeX chunker
|
356 |
+
|
357 |
+
### **π§ Technical Implementation:**
|
358 |
+
|
359 |
+
```javascript
|
360 |
+
// Mathpix rendering (inspired by official GOT-OCR demo)
|
361 |
+
const html = window.render(latexContent, {htmlTags: true});
|
362 |
+
|
363 |
+
// LaTeX structure preservation
|
364 |
+
\begin{tabular}{|l|c|c|}
|
365 |
+
\hline Disability & Participants & Results \\
|
366 |
+
\hline Blind & 5 & $34.5\%, n=1$ \\
|
367 |
+
\end{tabular}
|
368 |
+
```
|
369 |
+
|
370 |
+
### **π Use Cases:**
|
371 |
+
- **Research papers**: Mathematical formulas and data tables
|
372 |
+
- **Scientific documents**: Complex equations and statistical data
|
373 |
+
- **Financial reports**: Tabular data with calculations
|
374 |
+
- **Academic content**: Mixed text, math, and structured data
|
375 |
+
|
376 |
## Credits
|
377 |
|
378 |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
|
379 |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
|
380 |
+
- [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
|
381 |
- [Gradio](https://gradio.app/) for the UI framework
|
382 |
|
383 |
---
|
|
|
398 |
- **Best for**: General questions and semantic understanding
|
399 |
- **Use case**: "What is the main topic of this document?"
|
400 |
- **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
|
401 |
+
- **Chunking**: Uses content-aware chunking (Markdown or LaTeX) for optimal structure preservation
|
402 |
|
403 |
### **2. π MMR (Maximal Marginal Relevance)**
|
404 |
- **How it works**: Balances relevance with result diversity to reduce redundancy
|
|
|
525 |
|
526 |
### π§ **RAG System Architecture:**
|
527 |
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
528 |
+
- **π Smart Content-Aware Chunking** (`src/rag/chunking.py`):
|
529 |
+
- **Unified chunker** supporting both Markdown and LaTeX content
|
530 |
+
- **Markdown chunking**: Preserves tables and code blocks as whole units
|
531 |
+
- **LaTeX chunking**: Preserves `\begin{tabular}`, mathematical environments, and LaTeX structures
|
532 |
+
- **Automatic format detection**: GOT-OCR results β LaTeX chunker, others β Markdown chunker
|
533 |
+
- **Enhanced metadata**: Content type tracking and structure detection
|
534 |
- **π Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
|
535 |
- **Similarity Search**: Traditional semantic retrieval using embeddings
|
536 |
- **MMR Support**: Maximal Marginal Relevance for diverse results
|