Spaces:
Running
Running
Implement multi-document processing capabilities and enhance UI
Browse files- Introduced support for processing up to 5 files simultaneously with a combined size limit of 20MB.
- Added new processing types: Combined, Individual, Summary, and Comparison for enhanced document analysis.
- Updated the Gemini Flash parser to handle multiple documents and format outputs based on processing type.
- Enhanced the UI to dynamically display processing options and real-time validation for file uploads.
- Unified document conversion method to streamline single and multi-file processing.
- Improved error handling and logging for batch processing operations.
- README.md +105 -239
- src/core/config.py +9 -0
- src/parsers/gemini_flash_parser.py +219 -0
- src/services/document_service.py +246 -2
- src/ui/ui.py +138 -11
README.md
CHANGED
@@ -20,11 +20,17 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
20 |
|
21 |
### Document Conversion
|
22 |
- Convert PDFs, Office documents, images, and more to Markdown
|
|
|
23 |
- Multiple parser options:
|
24 |
- MarkItDown: For comprehensive document conversion
|
25 |
- Docling: For advanced PDF understanding with table structure recognition
|
26 |
- GOT-OCR: For image-based OCR with LaTeX support
|
27 |
-
- Gemini Flash: For AI-powered text extraction
|
|
|
|
|
|
|
|
|
|
|
28 |
- Download converted documents as Markdown files
|
29 |
|
30 |
### π€ RAG Chat with Documents
|
@@ -40,34 +46,68 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
40 |
|
41 |
### User Interface
|
42 |
- **Dual-tab interface**: Document Converter + Chat
|
|
|
|
|
|
|
43 |
- **Real-time status monitoring** for RAG system with environment detection
|
44 |
- **Auto-ingestion** of converted documents into chat system
|
45 |
- **Enhanced status display**: Shows vector store document count, chat history files, and environment type
|
46 |
- **Data management controls**: Clear All Data button with comprehensive feedback
|
47 |
- **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β "example data.md")
|
|
|
48 |
- Clean, responsive UI with modern styling
|
49 |
|
50 |
-
##
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
55 |
-
[Microsoft's MarkItDown](https://github.com/microsoft/markitdown) library supports a wide range of file formats:
|
56 |
|
57 |
-
|
58 |
-
[IBM's Docling](https://github.com/DS4SD/docling) provides advanced document understanding with:
|
59 |
-
- **Advanced PDF parsing** with layout understanding, reading order, and table structure recognition
|
60 |
-
- **Multiple OCR engines** including EasyOCR and Tesseract
|
61 |
-
- **Document format support**: PDF, DOCX, XLSX, PPTX, HTML, Images (PNG, JPG, TIFF, BMP, WEBP), CSV
|
62 |
-
- **Local execution** for sensitive data processing
|
63 |
-
- **Formula and code understanding** with enrichment features
|
64 |
-
- **Picture classification** and description capabilities
|
65 |
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
|
72 |
## Environment Variables
|
73 |
|
@@ -81,6 +121,8 @@ The application uses centralized configuration management. You can enhance funct
|
|
81 |
### βοΈ **Configuration Options:**
|
82 |
- `DEBUG`: Set to `true` for debug mode with verbose logging
|
83 |
- `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
|
|
|
|
|
84 |
- `TEMP_DIR`: Directory for temporary files (default: ./temp)
|
85 |
- `TESSERACT_PATH`: Custom path to Tesseract executable
|
86 |
- `TESSDATA_PATH`: Path to Tesseract language data
|
@@ -118,15 +160,40 @@ The application uses centralized configuration management. You can enhance funct
|
|
118 |
## Usage
|
119 |
|
120 |
### Document Conversion
|
|
|
|
|
121 |
1. Go to the **"Document Converter"** tab
|
122 |
-
2.
|
123 |
3. Choose your preferred parser:
|
124 |
- **"MarkItDown"** for comprehensive document conversion
|
125 |
- **"Docling"** for advanced PDF understanding and table extraction
|
|
|
126 |
4. Select an OCR method based on your chosen parser
|
127 |
5. Click "Convert"
|
128 |
6. View the Markdown output and download the converted file
|
129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
|
131 |
### π€ Chat with Documents
|
132 |
1. Go to the **"Chat with Documents"** tab
|
@@ -165,54 +232,30 @@ The application uses centralized configuration management. You can enhance funct
|
|
165 |
# For local development (faster startup)
|
166 |
python run_app.py
|
167 |
|
168 |
-
# For testing with clean data
|
169 |
python run_app.py --clear-data-and-run
|
170 |
|
171 |
-
#
|
172 |
-
python run_app.py --
|
173 |
```
|
174 |
|
175 |
-
### π§Ή **Data Management
|
176 |
-
For local development and testing, you can easily clear all stored data:
|
177 |
|
178 |
-
|
179 |
-
# Clear all data and exit (useful for quick cleanup)
|
180 |
-
python run_app.py --clear-data
|
181 |
|
182 |
-
|
183 |
-
python run_app.py --clear-data-and-run
|
|
|
184 |
|
185 |
-
|
186 |
-
|
187 |
-
|
|
|
188 |
|
189 |
**What gets cleared:**
|
190 |
-
- `data/chat_history/*` - All saved chat sessions
|
191 |
- `data/vector_store/*` - All document embeddings and vector database
|
192 |
|
193 |
-
This is particularly useful when:
|
194 |
-
- Testing new RAG features with fresh data
|
195 |
-
- Clearing old chat sessions and documents
|
196 |
-
- Resetting the system to a clean state
|
197 |
-
- Debugging document ingestion issues
|
198 |
-
|
199 |
-
### ποΈ **In-App Data Clearing:**
|
200 |
-
In addition to command-line data clearing, you can also clear data directly from the web interface:
|
201 |
-
|
202 |
-
1. Go to the **"Chat with Documents"** tab
|
203 |
-
2. Click the **"ποΈ Clear All Data"** button in the control panel
|
204 |
-
3. All vector store documents and chat history will be cleared
|
205 |
-
4. A new chat session will automatically start
|
206 |
-
5. The status panel will update to reflect the cleared state
|
207 |
-
|
208 |
-
**Features of in-app clearing:**
|
209 |
-
- **Environment Detection**: Automatically works in both local and HF Space environments
|
210 |
-
- **Comprehensive Clearing**: Removes both vector store documents and chat history files
|
211 |
-
- **Smart Path Resolution**: Uses `/tmp/data/*` for HF Spaces, `./data/*` for local development
|
212 |
-
- **User Feedback**: Shows detailed results of what was cleared
|
213 |
-
- **Auto-Session Reset**: Starts fresh chat session after clearing
|
214 |
-
- **Safe Operation**: Handles errors gracefully and provides status updates
|
215 |
-
|
216 |
### π§ͺ **Development Features:**
|
217 |
- **Automatic Environment Setup**: Dependencies are checked and installed automatically
|
218 |
- **Configuration Validation**: Startup validation reports missing API keys and configuration issues
|
@@ -225,191 +268,14 @@ In addition to command-line data clearing, you can also clear data directly from
|
|
225 |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
|
226 |
- [Gradio](https://gradio.app/) for the UI framework
|
227 |
|
228 |
-
|
229 |
-
|
230 |
-
[](https://huggingface.co/spaces/Ansemin101/Markit_v2)
|
231 |
|
232 |
**Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
|
233 |
|
234 |
-
|
235 |
-
-
|
236 |
-
-
|
237 |
-
|
238 |
-
## Overview
|
239 |
-
Markit is a powerful tool that converts various document formats (PDF, DOCX, images, etc.) to Markdown format. It uses different parsing engines and OCR methods to extract text from documents and convert them to clean, readable Markdown formats.
|
240 |
-
|
241 |
-
## Key Features
|
242 |
-
- **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
|
243 |
-
- **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
|
244 |
-
- **Advanced Parsing Engines**:
|
245 |
-
- **MarkItDown**: Comprehensive document conversion (PDFs, Office docs, images, audio, etc.)
|
246 |
-
- **Docling**: Advanced PDF understanding with table structure, layout analysis, and multiple OCR engines
|
247 |
-
- **Gemini Flash**: AI-powered conversion using Google's Gemini API
|
248 |
-
- **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
|
249 |
-
- **Mistral OCR**: Advanced OCR using Mistral's Pixtral model for image-to-text conversion
|
250 |
-
- **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
|
251 |
-
- **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
|
252 |
-
- **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
|
253 |
-
- **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
|
254 |
-
|
255 |
-
## System Architecture
|
256 |
-
|
257 |
-
The application is built with a clean, layered architecture following modern software engineering principles:
|
258 |
-
|
259 |
-
### ποΈ **Core Architecture Components:**
|
260 |
-
- **Entry Point** (`app.py`): HF Spaces-compatible application launcher with environment setup
|
261 |
-
- **Configuration Layer** (`src/core/config.py`): Centralized configuration management with validation
|
262 |
-
- **Service Layer** (`src/services/`): Business logic for document processing and external services
|
263 |
-
- **Core Engine** (`src/core/`): Document conversion workflows and utilities
|
264 |
-
- **Parser Registry** (`src/parsers/`): Extensible parser system with standardized interfaces
|
265 |
-
- **UI Layer** (`src/ui/`): Gradio-based web interface with enhanced error handling
|
266 |
-
|
267 |
-
### π― **Key Architectural Features:**
|
268 |
-
- **Separation of Concerns**: Clean boundaries between UI, business logic, and core utilities
|
269 |
-
- **Centralized Configuration**: All settings, API keys, and validation in one place
|
270 |
-
- **Custom Exception Hierarchy**: Proper error handling with user-friendly messages
|
271 |
-
- **Plugin Architecture**: Easy addition of new document parsers
|
272 |
-
- **HF Spaces Optimized**: Maintains compatibility with Hugging Face deployment requirements
|
273 |
-
|
274 |
-
## Installation
|
275 |
-
|
276 |
-
### For Local Development
|
277 |
-
1. Clone the repository
|
278 |
-
2. Install dependencies:
|
279 |
-
```bash
|
280 |
-
pip install -r requirements.txt
|
281 |
-
```
|
282 |
-
3. Install Tesseract OCR (required for OCR functionality):
|
283 |
-
- Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
|
284 |
-
- Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
|
285 |
-
- macOS: `brew install tesseract`
|
286 |
-
|
287 |
-
4. Run the application:
|
288 |
-
```bash
|
289 |
-
python app.py
|
290 |
-
```
|
291 |
-
|
292 |
-
### API Keys Setup
|
293 |
-
|
294 |
-
#### Gemini Flash Parser
|
295 |
-
To use the Gemini Flash parser, you need to:
|
296 |
-
1. Install the Google Generative AI client: `pip install google-genai`
|
297 |
-
2. Set the API key environment variable:
|
298 |
-
```bash
|
299 |
-
# On Windows
|
300 |
-
set GOOGLE_API_KEY=your_api_key_here
|
301 |
-
|
302 |
-
# On Linux/Mac
|
303 |
-
export GOOGLE_API_KEY=your_api_key_here
|
304 |
-
```
|
305 |
-
3. Alternatively, create a `.env` file in the project root with:
|
306 |
-
```
|
307 |
-
GOOGLE_API_KEY=your_api_key_here
|
308 |
-
```
|
309 |
-
4. Get your Gemini API key from [Google AI Studio](https://aistudio.google.com/app/apikey)
|
310 |
-
|
311 |
-
#### GOT-OCR Parser
|
312 |
-
The GOT-OCR parser requires:
|
313 |
-
1. CUDA-capable GPU with sufficient memory
|
314 |
-
2. The following dependencies will be installed automatically:
|
315 |
-
```bash
|
316 |
-
torch
|
317 |
-
torchvision
|
318 |
-
git+https://github.com/huggingface/transformers.git@main # Latest transformers from GitHub
|
319 |
-
accelerate
|
320 |
-
verovio
|
321 |
-
numpy==1.26.3 # Specific version required
|
322 |
-
opencv-python
|
323 |
-
```
|
324 |
-
3. Note that GOT-OCR only supports JPG and PNG image formats
|
325 |
-
4. In HF Spaces, the integration with ZeroGPU is automatic and optimized for Stateless GPU environments
|
326 |
-
|
327 |
-
## Deploying to Hugging Face Spaces
|
328 |
-
|
329 |
-
### Environment Configuration
|
330 |
-
1. Go to your Space settings: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME/settings`
|
331 |
-
2. Add the following repository secrets:
|
332 |
-
- Name: `GOOGLE_API_KEY`
|
333 |
-
- Value: Your Gemini API key
|
334 |
-
|
335 |
-
### Space Configuration
|
336 |
-
Ensure your Hugging Face Space configuration includes:
|
337 |
-
```yaml
|
338 |
-
build:
|
339 |
-
dockerfile: Dockerfile
|
340 |
-
python_version: "3.10"
|
341 |
-
system_packages:
|
342 |
-
- "tesseract-ocr"
|
343 |
-
- "libtesseract-dev"
|
344 |
-
```
|
345 |
|
346 |
-
## How to Use
|
347 |
-
|
348 |
-
### Document Conversion
|
349 |
-
1. Upload your document using the file uploader
|
350 |
-
2. Select a parser provider:
|
351 |
-
- **MarkItDown**: Best for comprehensive document conversion (supports PDFs, Office docs, images, audio, etc.)
|
352 |
-
- **Docling**: Best for advanced PDF understanding with table structure recognition and layout analysis
|
353 |
-
- **Gemini Flash**: Best for AI-powered conversions (requires API key)
|
354 |
-
- **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
|
355 |
-
- **Mistral OCR**: Advanced OCR using Mistral's Pixtral model (requires API key)
|
356 |
-
3. Choose an OCR option based on your selected parser:
|
357 |
-
- **None**: No OCR processing (for documents with selectable text)
|
358 |
-
- **Tesseract**: Basic OCR using Tesseract
|
359 |
-
- **Advanced**: Enhanced OCR with layout preservation (available with specific parsers)
|
360 |
-
- **Plain Text**: For GOT-OCR, extracts raw text without formatting
|
361 |
-
- **Formatted Text**: For GOT-OCR, preserves formatting and converts to Markdown
|
362 |
-
4. Select your desired output format:
|
363 |
-
- **Markdown**: Clean, readable markdown format
|
364 |
-
- **JSON**: Structured data representation
|
365 |
-
- **Text**: Plain text extraction
|
366 |
-
- **Document Tags**: XML-like structure tags
|
367 |
-
5. Click "Convert" to process your document
|
368 |
-
6. Navigate through pages using the navigation buttons for multi-page documents
|
369 |
-
7. Download the converted content in your selected format
|
370 |
-
|
371 |
-
## Configuration & Error Handling
|
372 |
-
|
373 |
-
### π§ **Automatic Configuration:**
|
374 |
-
The application includes intelligent configuration management that:
|
375 |
-
- Validates API keys and reports availability at startup
|
376 |
-
- Checks for required dependencies and installs them automatically
|
377 |
-
- Provides helpful warnings for missing optional components
|
378 |
-
- Reports which parsers are available based on current configuration
|
379 |
-
|
380 |
-
### π‘οΈ **Enhanced Error Handling:**
|
381 |
-
- **User-Friendly Messages**: Clear error descriptions instead of technical stack traces
|
382 |
-
- **File Validation**: Automatic checking of file size and format compatibility
|
383 |
-
- **Parser Availability**: Real-time detection of which parsers can be used
|
384 |
-
- **Graceful Degradation**: Application continues working even if some parsers are unavailable
|
385 |
-
|
386 |
-
## Troubleshooting
|
387 |
-
|
388 |
-
### OCR Issues
|
389 |
-
- Ensure Tesseract is properly installed and in your system PATH
|
390 |
-
- Check the TESSDATA_PREFIX environment variable is set correctly
|
391 |
-
- Verify language files are available in the tessdata directory
|
392 |
-
|
393 |
-
### Gemini Flash Parser Issues
|
394 |
-
- Confirm your API key is set correctly as an environment variable
|
395 |
-
- Check for API usage limits or restrictions
|
396 |
-
- Verify the document format is supported by the Gemini API
|
397 |
-
|
398 |
-
### GOT-OCR Parser Issues
|
399 |
-
- Ensure you have a CUDA-capable GPU with sufficient memory
|
400 |
-
- Verify that all required dependencies are installed correctly
|
401 |
-
- Remember that GOT-OCR only supports JPG and PNG image formats
|
402 |
-
- If you encounter CUDA out-of-memory errors, try using a smaller image
|
403 |
-
- In Hugging Face Spaces with Stateless GPU, ensure the `spaces` module is imported before any CUDA initialization
|
404 |
-
- If you see errors about "CUDA must not be initialized in the main process", verify the import order in your app.py
|
405 |
-
- If you encounter "cannot pickle '_thread.lock' object" errors, this indicates thread locks are being passed to the GPU function
|
406 |
-
- The GOT-OCR parser has been optimized for ZeroGPU in Stateless GPU environments with proper serialization handling
|
407 |
-
- For local development, the parser will fall back to CPU processing if GPU is not available
|
408 |
-
|
409 |
-
### General Issues
|
410 |
-
- Check the console logs for error messages
|
411 |
-
- Ensure all dependencies are installed correctly
|
412 |
-
- For large documents, try processing fewer pages at a time
|
413 |
|
414 |
## Development Guide
|
415 |
|
@@ -450,7 +316,7 @@ markit_v2/
|
|
450 |
β β βββ docling_parser.py # π Docling parser with advanced PDF understanding
|
451 |
β β βββ got_ocr_parser.py # GOT-OCR parser for images
|
452 |
β β βββ mistral_ocr_parser.py # π Mistral OCR parser
|
453 |
-
β β βββ gemini_flash_parser.py # Gemini Flash parser
|
454 |
β βββ rag/ # π RAG (Retrieval-Augmented Generation) system
|
455 |
β β βββ __init__.py # Package initialization
|
456 |
β β βββ embeddings.py # OpenAI embedding model management
|
|
|
20 |
|
21 |
### Document Conversion
|
22 |
- Convert PDFs, Office documents, images, and more to Markdown
|
23 |
+
- **π Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
|
24 |
- Multiple parser options:
|
25 |
- MarkItDown: For comprehensive document conversion
|
26 |
- Docling: For advanced PDF understanding with table structure recognition
|
27 |
- GOT-OCR: For image-based OCR with LaTeX support
|
28 |
+
- Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
|
29 |
+
- **π Intelligent Processing Types**:
|
30 |
+
- **Combined**: Merge documents into unified content with duplicate removal
|
31 |
+
- **Individual**: Separate sections per document with clear organization
|
32 |
+
- **Summary**: Executive overview + detailed analysis of all documents
|
33 |
+
- **Comparison**: Cross-document analysis with similarities/differences tables
|
34 |
- Download converted documents as Markdown files
|
35 |
|
36 |
### π€ RAG Chat with Documents
|
|
|
46 |
|
47 |
### User Interface
|
48 |
- **Dual-tab interface**: Document Converter + Chat
|
49 |
+
- **π Unified File Input**: Single interface handles both single and multiple file uploads
|
50 |
+
- **π Dynamic Processing Options**: Multi-document processing type selector appears automatically
|
51 |
+
- **π Real-time Validation**: Live feedback on file count, size limits, and processing mode
|
52 |
- **Real-time status monitoring** for RAG system with environment detection
|
53 |
- **Auto-ingestion** of converted documents into chat system
|
54 |
- **Enhanced status display**: Shows vector store document count, chat history files, and environment type
|
55 |
- **Data management controls**: Clear All Data button with comprehensive feedback
|
56 |
- **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β "example data.md")
|
57 |
+
- **π Smart Output Naming**: Batch processing creates descriptive filenames (e.g., "Combined_3_Documents_20240125.md")
|
58 |
- Clean, responsive UI with modern styling
|
59 |
|
60 |
+
## Supported Libraries
|
61 |
|
62 |
+
**MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
|
63 |
|
64 |
+
**Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis.
|
|
|
65 |
|
66 |
+
**Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
|
68 |
+
## π Multi-Document Processing
|
69 |
+
|
70 |
+
### **What makes this special?**
|
71 |
+
Markit v2 introduces **industry-leading multi-document processing** powered by Google's Gemini Flash 2.5, enabling intelligent analysis across multiple documents simultaneously.
|
72 |
+
|
73 |
+
### **Key Capabilities:**
|
74 |
+
- **π Cross-Document Analysis**: Compare and contrast information across different files
|
75 |
+
- **π Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
|
76 |
+
- **π Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
|
77 |
+
- **π§ Contextual Understanding**: Recognizes relationships and patterns across document boundaries
|
78 |
+
- **β‘ Single API Call Processing**: Efficient batch processing using Gemini's native multi-document support
|
79 |
+
|
80 |
+
### **Processing Types Explained:**
|
81 |
+
|
82 |
+
#### π **Combined Processing**
|
83 |
+
- **Purpose**: Create one unified, cohesive document from multiple sources
|
84 |
+
- **Best for**: Related documents that should be read as one complete resource
|
85 |
+
- **Intelligence**: Removes redundant information while preserving all critical content
|
86 |
+
- **Example**: Merge project proposal + budget + timeline into one comprehensive document
|
87 |
+
|
88 |
+
#### π **Individual Processing**
|
89 |
+
- **Purpose**: Convert each document separately but organize them in one output
|
90 |
+
- **Best for**: Different documents you want in one place for easy reference
|
91 |
+
- **Intelligence**: Maintains original structure while creating clear organization
|
92 |
+
- **Example**: Meeting agenda + presentation + notes β organized sections
|
93 |
+
|
94 |
+
#### π **Summary Processing**
|
95 |
+
- **Purpose**: Executive overview + detailed analysis
|
96 |
+
- **Best for**: Complex document sets needing high-level insights
|
97 |
+
- **Intelligence**: Cross-document pattern recognition and key insight extraction
|
98 |
+
- **Example**: Research papers β executive summary + detailed analysis of each paper
|
99 |
+
|
100 |
+
#### βοΈ **Comparison Processing**
|
101 |
+
- **Purpose**: Analyze differences, similarities, and relationships
|
102 |
+
- **Best for**: Multiple proposals, document versions, or conflicting sources
|
103 |
+
- **Intelligence**: Creates comparison tables and identifies discrepancies/alignments
|
104 |
+
- **Example**: Contract versions β side-by-side analysis with change identification
|
105 |
+
|
106 |
+
### **Technical Advantages:**
|
107 |
+
- **Native Multimodal Support**: Processes text + images in same workflow
|
108 |
+
- **Advanced Reasoning**: Understands context and relationships between documents
|
109 |
+
- **Efficient Processing**: Single Gemini API call vs. multiple individual calls
|
110 |
+
- **Format Agnostic**: Works across all supported file types seamlessly
|
111 |
|
112 |
## Environment Variables
|
113 |
|
|
|
121 |
### βοΈ **Configuration Options:**
|
122 |
- `DEBUG`: Set to `true` for debug mode with verbose logging
|
123 |
- `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
|
124 |
+
- `MAX_BATCH_FILES`: Maximum files for multi-document processing (default: 5)
|
125 |
+
- `MAX_BATCH_SIZE`: Maximum combined size for batch processing (default: 20MB)
|
126 |
- `TEMP_DIR`: Directory for temporary files (default: ./temp)
|
127 |
- `TESSERACT_PATH`: Custom path to Tesseract executable
|
128 |
- `TESSDATA_PATH`: Path to Tesseract language data
|
|
|
160 |
## Usage
|
161 |
|
162 |
### Document Conversion
|
163 |
+
|
164 |
+
#### π **Single Document Processing**
|
165 |
1. Go to the **"Document Converter"** tab
|
166 |
+
2. Upload a single file
|
167 |
3. Choose your preferred parser:
|
168 |
- **"MarkItDown"** for comprehensive document conversion
|
169 |
- **"Docling"** for advanced PDF understanding and table extraction
|
170 |
+
- **"Gemini Flash"** for AI-powered text extraction
|
171 |
4. Select an OCR method based on your chosen parser
|
172 |
5. Click "Convert"
|
173 |
6. View the Markdown output and download the converted file
|
174 |
+
|
175 |
+
#### π **Multi-Document Processing** (NEW!)
|
176 |
+
1. Go to the **"Document Converter"** tab
|
177 |
+
2. Upload **2-5 files** (up to 20MB combined)
|
178 |
+
3. **Processing type selector appears automatically**
|
179 |
+
4. Choose your processing type:
|
180 |
+
- **Combined**: Merge all documents into unified content with smart duplicate removal
|
181 |
+
- **Individual**: Keep documents separate with clear section headers
|
182 |
+
- **Summary**: Executive overview + detailed analysis of each document
|
183 |
+
- **Comparison**: Side-by-side analysis with similarities/differences tables
|
184 |
+
5. Choose your preferred parser (recommend **Gemini Flash** for best multi-document results)
|
185 |
+
6. Click "Convert"
|
186 |
+
7. Get intelligent cross-document analysis and download enhanced output
|
187 |
+
|
188 |
+
#### π‘ **Multi-Document Tips**
|
189 |
+
- **Mixed file types work great**: Upload PDF + images, Word docs + PDFs, etc.
|
190 |
+
- **Gemini Flash excels at**: Cross-document reasoning, duplicate detection, and format analysis
|
191 |
+
- **Perfect for**: Comparing document versions, analyzing related reports, consolidating research
|
192 |
+
- **Real-time validation**: UI shows file count, size limits, and processing mode
|
193 |
+
|
194 |
+
#### π€ **RAG Integration**
|
195 |
+
- **All converted documents are automatically added to the RAG system** for chat functionality
|
196 |
+
- Multi-document processing creates richer context for chat interactions
|
197 |
|
198 |
### π€ Chat with Documents
|
199 |
1. Go to the **"Chat with Documents"** tab
|
|
|
232 |
# For local development (faster startup)
|
233 |
python run_app.py
|
234 |
|
235 |
+
# For testing with clean data
|
236 |
python run_app.py --clear-data-and-run
|
237 |
|
238 |
+
# Show all available options
|
239 |
+
python run_app.py --help
|
240 |
```
|
241 |
|
242 |
+
### π§Ή **Data Management:**
|
|
|
243 |
|
244 |
+
**Two ways to clear data:**
|
|
|
|
|
245 |
|
246 |
+
1. **Command-line** (for development):
|
247 |
+
- `python run_app.py --clear-data-and-run` - Clear data then start app
|
248 |
+
- `python run_app.py --clear-data` - Clear data and exit
|
249 |
|
250 |
+
2. **In-app UI** (for users):
|
251 |
+
- Go to "Chat with Documents" tab β Click "ποΈ Clear All Data" button
|
252 |
+
- Automatically detects environment (local vs HF Space)
|
253 |
+
- Provides detailed feedback and starts new session
|
254 |
|
255 |
**What gets cleared:**
|
256 |
+
- `data/chat_history/*` - All saved chat sessions
|
257 |
- `data/vector_store/*` - All document embeddings and vector database
|
258 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
259 |
### π§ͺ **Development Features:**
|
260 |
- **Automatic Environment Setup**: Dependencies are checked and installed automatically
|
261 |
- **Configuration Validation**: Startup validation reports missing API keys and configuration issues
|
|
|
268 |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
|
269 |
- [Gradio](https://gradio.app/) for the UI framework
|
270 |
|
271 |
+
---
|
|
|
|
|
272 |
|
273 |
**Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
|
274 |
|
275 |
+
**Project Links:**
|
276 |
+
- [GitHub Repository](https://github.com/ansemin/Markit_v2)
|
277 |
+
- [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
278 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
279 |
|
280 |
## Development Guide
|
281 |
|
|
|
316 |
β β βββ docling_parser.py # π Docling parser with advanced PDF understanding
|
317 |
β β βββ got_ocr_parser.py # GOT-OCR parser for images
|
318 |
β β βββ mistral_ocr_parser.py # π Mistral OCR parser
|
319 |
+
β β βββ gemini_flash_parser.py # π Enhanced Gemini Flash parser with multi-document processing
|
320 |
β βββ rag/ # π RAG (Retrieval-Augmented Generation) system
|
321 |
β β βββ __init__.py # Package initialization
|
322 |
β β βββ embeddings.py # OpenAI embedding model management
|
src/core/config.py
CHANGED
@@ -139,11 +139,20 @@ class AppConfig:
|
|
139 |
allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx", ".docx", ".pptx", ".html", ".xhtml", ".md", ".csv")
|
140 |
temp_dir: str = "./temp"
|
141 |
|
|
|
|
|
|
|
|
|
|
|
142 |
def __post_init__(self):
|
143 |
"""Load application configuration from environment variables."""
|
144 |
self.debug = os.getenv("DEBUG", "false").lower() == "true"
|
145 |
self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
|
146 |
self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
|
|
|
|
|
|
|
|
|
147 |
|
148 |
|
149 |
class Config:
|
|
|
139 |
allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx", ".docx", ".pptx", ".html", ".xhtml", ".md", ".csv")
|
140 |
temp_dir: str = "./temp"
|
141 |
|
142 |
+
# Multi-document batch processing settings
|
143 |
+
max_batch_files: int = 5
|
144 |
+
max_batch_size: int = 20 * 1024 * 1024 # 20MB combined
|
145 |
+
batch_processing_types: tuple = ("combined", "individual", "summary", "comparison")
|
146 |
+
|
147 |
def __post_init__(self):
|
148 |
"""Load application configuration from environment variables."""
|
149 |
self.debug = os.getenv("DEBUG", "false").lower() == "true"
|
150 |
self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
|
151 |
self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
|
152 |
+
|
153 |
+
# Load batch processing configuration
|
154 |
+
self.max_batch_files = int(os.getenv("MAX_BATCH_FILES", self.max_batch_files))
|
155 |
+
self.max_batch_size = int(os.getenv("MAX_BATCH_SIZE", self.max_batch_size))
|
156 |
|
157 |
|
158 |
class Config:
|
src/parsers/gemini_flash_parser.py
CHANGED
@@ -111,6 +111,225 @@ class GeminiFlashParser(DocumentParser):
|
|
111 |
print(error_message)
|
112 |
return f"# Error\n\n{error_message}\n\nPlease check your API key and try again."
|
113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
def _get_mime_type(self, file_extension: str) -> str:
|
115 |
"""Get the MIME type for a file extension."""
|
116 |
mime_types = {
|
|
|
111 |
print(error_message)
|
112 |
return f"# Error\n\n{error_message}\n\nPlease check your API key and try again."
|
113 |
|
114 |
+
def parse_multiple(self, file_paths: List[Union[str, Path]], processing_type: str = "combined", original_filenames: Optional[List[str]] = None, **kwargs) -> str:
|
115 |
+
"""Parse multiple documents using Gemini Flash 2.0."""
|
116 |
+
if not GEMINI_AVAILABLE:
|
117 |
+
raise ImportError(
|
118 |
+
"The Google Gemini API client is not installed. "
|
119 |
+
"Please install it with 'pip install google-genai'."
|
120 |
+
)
|
121 |
+
|
122 |
+
if not api_key:
|
123 |
+
raise ValueError(
|
124 |
+
"GOOGLE_API_KEY environment variable is not set. "
|
125 |
+
"Please set it to your Gemini API key."
|
126 |
+
)
|
127 |
+
|
128 |
+
try:
|
129 |
+
# Convert to Path objects and validate
|
130 |
+
path_objects = [Path(fp) for fp in file_paths]
|
131 |
+
self._validate_batch_files(path_objects)
|
132 |
+
|
133 |
+
# Check for cancellation
|
134 |
+
if self._check_cancellation():
|
135 |
+
return "Conversion cancelled."
|
136 |
+
|
137 |
+
# Create client
|
138 |
+
client = genai.Client(api_key=api_key)
|
139 |
+
|
140 |
+
# Create contents for API call
|
141 |
+
contents = self._create_batch_contents(path_objects, processing_type, original_filenames)
|
142 |
+
|
143 |
+
# Check for cancellation before API call
|
144 |
+
if self._check_cancellation():
|
145 |
+
return "Conversion cancelled."
|
146 |
+
|
147 |
+
# Generate the response
|
148 |
+
response = client.models.generate_content(
|
149 |
+
model=config.model.gemini_model,
|
150 |
+
contents=contents,
|
151 |
+
config={
|
152 |
+
"temperature": config.model.temperature,
|
153 |
+
"top_p": 0.95,
|
154 |
+
"top_k": 40,
|
155 |
+
"max_output_tokens": config.model.max_tokens,
|
156 |
+
}
|
157 |
+
)
|
158 |
+
|
159 |
+
# Format the output based on processing type
|
160 |
+
formatted_output = self._format_batch_output(response.text, path_objects, processing_type, original_filenames)
|
161 |
+
|
162 |
+
return formatted_output
|
163 |
+
|
164 |
+
except Exception as e:
|
165 |
+
error_message = f"Error parsing multiple documents with Gemini Flash: {str(e)}"
|
166 |
+
print(error_message)
|
167 |
+
return f"# Error\n\n{error_message}\n\nPlease check your API key and try again."
|
168 |
+
|
169 |
+
def _validate_batch_files(self, file_paths: List[Path]) -> None:
|
170 |
+
"""Validate batch of files for multi-document processing."""
|
171 |
+
# Check file count limit
|
172 |
+
if len(file_paths) == 0:
|
173 |
+
raise ValueError("No files provided for processing")
|
174 |
+
if len(file_paths) > 5:
|
175 |
+
raise ValueError("Maximum 5 files allowed for batch processing")
|
176 |
+
|
177 |
+
# Check individual files and calculate total size
|
178 |
+
total_size = 0
|
179 |
+
for file_path in file_paths:
|
180 |
+
if not file_path.exists():
|
181 |
+
raise ValueError(f"File not found: {file_path}")
|
182 |
+
|
183 |
+
file_size = file_path.stat().st_size
|
184 |
+
total_size += file_size
|
185 |
+
|
186 |
+
# Check individual file size (reasonable limit per file)
|
187 |
+
if file_size > 10 * 1024 * 1024: # 10MB per file
|
188 |
+
raise ValueError(f"Individual file size exceeds 10MB: {file_path.name}")
|
189 |
+
|
190 |
+
# Check combined size limit
|
191 |
+
if total_size > 20 * 1024 * 1024: # 20MB total
|
192 |
+
raise ValueError(f"Combined file size ({total_size / (1024*1024):.1f}MB) exceeds 20MB limit")
|
193 |
+
|
194 |
+
# Validate file types
|
195 |
+
for file_path in file_paths:
|
196 |
+
file_extension = file_path.suffix.lower()
|
197 |
+
if self._get_mime_type(file_extension) == "application/octet-stream":
|
198 |
+
raise ValueError(f"Unsupported file type: {file_path.name}")
|
199 |
+
|
200 |
+
def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
|
201 |
+
"""Create contents list for batch API call."""
|
202 |
+
# Create the prompt based on processing type
|
203 |
+
prompt = self._create_batch_prompt(file_paths, processing_type, original_filenames)
|
204 |
+
|
205 |
+
# Start with the prompt
|
206 |
+
contents = [prompt]
|
207 |
+
|
208 |
+
# Add each file as a content part
|
209 |
+
for file_path in file_paths:
|
210 |
+
file_content = file_path.read_bytes()
|
211 |
+
mime_type = self._get_mime_type(file_path.suffix.lower())
|
212 |
+
|
213 |
+
contents.append(
|
214 |
+
genai.types.Part.from_bytes(
|
215 |
+
data=file_content,
|
216 |
+
mime_type=mime_type
|
217 |
+
)
|
218 |
+
)
|
219 |
+
|
220 |
+
return contents
|
221 |
+
|
222 |
+
def _create_batch_prompt(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
|
223 |
+
"""Create appropriate prompt for batch processing."""
|
224 |
+
# Use original filenames if provided, otherwise use temp file names
|
225 |
+
if original_filenames:
|
226 |
+
file_names = original_filenames
|
227 |
+
else:
|
228 |
+
file_names = [fp.name for fp in file_paths]
|
229 |
+
file_list = "\n".join([f"- {name}" for name in file_names])
|
230 |
+
|
231 |
+
base_prompt = f"""I will provide you with {len(file_paths)} documents to process:
|
232 |
+
{file_list}
|
233 |
+
|
234 |
+
"""
|
235 |
+
|
236 |
+
if processing_type == "combined":
|
237 |
+
return base_prompt + """Please convert all documents to a single, cohesive markdown document.
|
238 |
+
Merge the content logically, remove duplicate information, and create a unified structure with clear headings.
|
239 |
+
Preserve important formatting, tables, lists, and structure from all documents.
|
240 |
+
For images, include brief descriptions in markdown image syntax.
|
241 |
+
Return only the combined markdown content, no other text."""
|
242 |
+
|
243 |
+
elif processing_type == "individual":
|
244 |
+
return base_prompt + """Please convert each document to markdown format and present them as separate sections.
|
245 |
+
For each document, create a clear section header with the document name.
|
246 |
+
Preserve the structure, headings, lists, tables, and formatting within each section.
|
247 |
+
For images, include brief descriptions in markdown image syntax.
|
248 |
+
Return the content in this format:
|
249 |
+
|
250 |
+
# Document 1: [filename]
|
251 |
+
[converted content]
|
252 |
+
|
253 |
+
# Document 2: [filename]
|
254 |
+
[converted content]
|
255 |
+
|
256 |
+
Return only the markdown content, no other text."""
|
257 |
+
|
258 |
+
elif processing_type == "summary":
|
259 |
+
return base_prompt + """Please create a comprehensive analysis with two parts:
|
260 |
+
|
261 |
+
1. EXECUTIVE SUMMARY: A concise overview summarizing the key points from all documents
|
262 |
+
2. DETAILED SECTIONS: Individual converted sections for each document
|
263 |
+
|
264 |
+
Structure the output as:
|
265 |
+
|
266 |
+
# Executive Summary
|
267 |
+
[Brief summary of key findings and themes across all documents]
|
268 |
+
|
269 |
+
# Detailed Analysis
|
270 |
+
|
271 |
+
## Document 1: [filename]
|
272 |
+
[converted content]
|
273 |
+
|
274 |
+
## Document 2: [filename]
|
275 |
+
[converted content]
|
276 |
+
|
277 |
+
Preserve formatting, tables, lists, and structure throughout.
|
278 |
+
For images, include brief descriptions in markdown image syntax.
|
279 |
+
Return only the markdown content, no other text."""
|
280 |
+
|
281 |
+
elif processing_type == "comparison":
|
282 |
+
return base_prompt + """Please create a comparative analysis of these documents:
|
283 |
+
|
284 |
+
1. Create a comparison table highlighting key differences and similarities
|
285 |
+
2. Provide individual document summaries
|
286 |
+
3. Include a section on cross-document insights
|
287 |
+
|
288 |
+
Structure the output as:
|
289 |
+
|
290 |
+
# Document Comparison Analysis
|
291 |
+
|
292 |
+
## Comparison Table
|
293 |
+
| Aspect | Document 1 | Document 2 | Document 3 | ... |
|
294 |
+
|--------|------------|------------|------------|-----|
|
295 |
+
| [Key aspects found across documents] | | | | |
|
296 |
+
|
297 |
+
## Individual Document Summaries
|
298 |
+
|
299 |
+
### Document 1: [filename]
|
300 |
+
[Key points and content summary]
|
301 |
+
|
302 |
+
### Document 2: [filename]
|
303 |
+
[Key points and content summary]
|
304 |
+
|
305 |
+
## Cross-Document Insights
|
306 |
+
[Analysis of patterns, contradictions, or complementary information across documents]
|
307 |
+
|
308 |
+
Preserve important formatting and structure.
|
309 |
+
For images, include brief descriptions in markdown image syntax.
|
310 |
+
Return only the markdown content, no other text."""
|
311 |
+
|
312 |
+
else:
|
313 |
+
# Fallback to combined
|
314 |
+
return self._create_batch_prompt(file_paths, "combined")
|
315 |
+
|
316 |
+
def _format_batch_output(self, response_text: str, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
|
317 |
+
"""Format the batch processing output."""
|
318 |
+
# Add metadata header using original filenames if provided
|
319 |
+
if original_filenames:
|
320 |
+
file_names = original_filenames
|
321 |
+
else:
|
322 |
+
file_names = [fp.name for fp in file_paths]
|
323 |
+
|
324 |
+
header = f"""<!-- Multi-Document Processing Results -->
|
325 |
+
<!-- Processing Type: {processing_type} -->
|
326 |
+
<!-- Files Processed: {len(file_paths)} -->
|
327 |
+
<!-- File Names: {', '.join(file_names)} -->
|
328 |
+
|
329 |
+
"""
|
330 |
+
|
331 |
+
return header + response_text
|
332 |
+
|
333 |
def _get_mime_type(self, file_extension: str) -> str:
|
334 |
"""Get the MIME type for a file extension."""
|
335 |
mime_types = {
|
src/services/document_service.py
CHANGED
@@ -7,7 +7,7 @@ import time
|
|
7 |
import os
|
8 |
import threading
|
9 |
from pathlib import Path
|
10 |
-
from typing import Optional, Tuple, Any
|
11 |
|
12 |
from src.core.config import config
|
13 |
from src.core.exceptions import (
|
@@ -269,4 +269,248 @@ class DocumentService:
|
|
269 |
if self._check_cancellation() and output_path:
|
270 |
self._safe_delete_file(output_path)
|
271 |
|
272 |
-
self._conversion_in_progress = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
import os
|
8 |
import threading
|
9 |
from pathlib import Path
|
10 |
+
from typing import Optional, Tuple, Any, List
|
11 |
|
12 |
from src.core.config import config
|
13 |
from src.core.exceptions import (
|
|
|
269 |
if self._check_cancellation() and output_path:
|
270 |
self._safe_delete_file(output_path)
|
271 |
|
272 |
+
self._conversion_in_progress = False
|
273 |
+
|
274 |
+
def convert_documents(
|
275 |
+
self,
|
276 |
+
file_paths: List[str],
|
277 |
+
parser_name: str,
|
278 |
+
ocr_method_name: str,
|
279 |
+
output_format: str,
|
280 |
+
processing_type: str = "combined"
|
281 |
+
) -> Tuple[str, Optional[str]]:
|
282 |
+
"""
|
283 |
+
Unified method to convert single or multiple documents.
|
284 |
+
|
285 |
+
Args:
|
286 |
+
file_paths: List of paths to input files (can be single file)
|
287 |
+
parser_name: Name of the parser to use
|
288 |
+
ocr_method_name: Name of the OCR method to use
|
289 |
+
output_format: Output format (Markdown, JSON, Text, Document Tags)
|
290 |
+
processing_type: Type of multi-document processing (combined, individual, summary, comparison)
|
291 |
+
|
292 |
+
Returns:
|
293 |
+
Tuple of (content, output_file_path)
|
294 |
+
|
295 |
+
Raises:
|
296 |
+
DocumentProcessingError: For general processing errors
|
297 |
+
FileSizeLimitError: When file(s) are too large
|
298 |
+
UnsupportedFileTypeError: For unsupported file types
|
299 |
+
ConversionError: When conversion fails or is cancelled
|
300 |
+
"""
|
301 |
+
if not file_paths:
|
302 |
+
raise DocumentProcessingError("No files provided")
|
303 |
+
|
304 |
+
# Route to appropriate processing method
|
305 |
+
if len(file_paths) == 1:
|
306 |
+
# Single file processing - use existing method
|
307 |
+
return self.convert_document(
|
308 |
+
file_paths[0], parser_name, ocr_method_name, output_format
|
309 |
+
)
|
310 |
+
else:
|
311 |
+
# Multi-file processing - use new batch method
|
312 |
+
return self._convert_multiple_documents(
|
313 |
+
file_paths, parser_name, ocr_method_name, output_format, processing_type
|
314 |
+
)
|
315 |
+
|
316 |
+
def _convert_multiple_documents(
|
317 |
+
self,
|
318 |
+
file_paths: List[str],
|
319 |
+
parser_name: str,
|
320 |
+
ocr_method_name: str,
|
321 |
+
output_format: str,
|
322 |
+
processing_type: str
|
323 |
+
) -> Tuple[str, Optional[str]]:
|
324 |
+
"""
|
325 |
+
Convert multiple documents using batch processing.
|
326 |
+
|
327 |
+
Args:
|
328 |
+
file_paths: List of paths to input files
|
329 |
+
parser_name: Name of the parser to use
|
330 |
+
ocr_method_name: Name of the OCR method to use
|
331 |
+
output_format: Output format (Markdown, JSON, Text, Document Tags)
|
332 |
+
processing_type: Type of multi-document processing
|
333 |
+
|
334 |
+
Returns:
|
335 |
+
Tuple of (content, output_file_path)
|
336 |
+
"""
|
337 |
+
self._conversion_in_progress = True
|
338 |
+
temp_inputs = []
|
339 |
+
output_path = None
|
340 |
+
self._original_file_paths = file_paths # Store original paths for filename reference
|
341 |
+
|
342 |
+
try:
|
343 |
+
# Validate all files first
|
344 |
+
for file_path in file_paths:
|
345 |
+
self._validate_file(file_path)
|
346 |
+
|
347 |
+
if self._check_cancellation():
|
348 |
+
raise ConversionError("Conversion cancelled")
|
349 |
+
|
350 |
+
# Create temporary files with English names
|
351 |
+
for original_path in file_paths:
|
352 |
+
temp_path = self._create_temp_file(original_path)
|
353 |
+
temp_inputs.append(temp_path)
|
354 |
+
|
355 |
+
if self._check_cancellation():
|
356 |
+
raise ConversionError("Conversion cancelled during file preparation")
|
357 |
+
|
358 |
+
# Process documents using parser factory with multi-document support
|
359 |
+
start_time = time.time()
|
360 |
+
content = self._process_multiple_with_parser(
|
361 |
+
temp_inputs, parser_name, ocr_method_name, output_format, processing_type
|
362 |
+
)
|
363 |
+
|
364 |
+
if content == "Conversion cancelled.":
|
365 |
+
raise ConversionError("Conversion cancelled by parser")
|
366 |
+
|
367 |
+
duration = time.time() - start_time
|
368 |
+
logging.info(f"Multiple documents processed in {duration:.2f} seconds")
|
369 |
+
|
370 |
+
if self._check_cancellation():
|
371 |
+
raise ConversionError("Conversion cancelled")
|
372 |
+
|
373 |
+
# Create output file with batch naming
|
374 |
+
output_path = self._create_batch_output_file(
|
375 |
+
content, output_format, file_paths, processing_type
|
376 |
+
)
|
377 |
+
|
378 |
+
return content, output_path
|
379 |
+
|
380 |
+
except (DocumentProcessingError, FileSizeLimitError, UnsupportedFileTypeError, ConversionError):
|
381 |
+
# Re-raise our custom exceptions
|
382 |
+
for temp_path in temp_inputs:
|
383 |
+
self._safe_delete_file(temp_path)
|
384 |
+
self._safe_delete_file(output_path)
|
385 |
+
raise
|
386 |
+
except Exception as e:
|
387 |
+
# Wrap unexpected exceptions
|
388 |
+
for temp_path in temp_inputs:
|
389 |
+
self._safe_delete_file(temp_path)
|
390 |
+
self._safe_delete_file(output_path)
|
391 |
+
raise DocumentProcessingError(f"Unexpected error during batch conversion: {str(e)}")
|
392 |
+
finally:
|
393 |
+
# Clean up temp input files
|
394 |
+
for temp_path in temp_inputs:
|
395 |
+
self._safe_delete_file(temp_path)
|
396 |
+
|
397 |
+
# Clean up output file if cancelled
|
398 |
+
if self._check_cancellation() and output_path:
|
399 |
+
self._safe_delete_file(output_path)
|
400 |
+
|
401 |
+
self._conversion_in_progress = False
|
402 |
+
|
403 |
+
def _process_multiple_with_parser(
|
404 |
+
self,
|
405 |
+
temp_file_paths: List[str],
|
406 |
+
parser_name: str,
|
407 |
+
ocr_method_name: str,
|
408 |
+
output_format: str,
|
409 |
+
processing_type: str
|
410 |
+
) -> str:
|
411 |
+
"""Process multiple documents using the parser factory."""
|
412 |
+
try:
|
413 |
+
# Get parser instance
|
414 |
+
from src.parsers.parser_registry import ParserRegistry
|
415 |
+
parser_class = ParserRegistry.get_parser_class(parser_name)
|
416 |
+
|
417 |
+
if not parser_class:
|
418 |
+
raise DocumentProcessingError(f"Parser '{parser_name}' not found")
|
419 |
+
|
420 |
+
parser_instance = parser_class()
|
421 |
+
parser_instance.set_cancellation_flag(self._cancellation_flag)
|
422 |
+
|
423 |
+
# Check if parser supports multi-document processing
|
424 |
+
if hasattr(parser_instance, 'parse_multiple'):
|
425 |
+
# Use multi-document parsing with original filenames for reference
|
426 |
+
return parser_instance.parse_multiple(
|
427 |
+
file_paths=temp_file_paths,
|
428 |
+
processing_type=processing_type,
|
429 |
+
ocr_method=ocr_method_name,
|
430 |
+
output_format=output_format.lower(),
|
431 |
+
original_filenames=[Path(fp).name for fp in self._original_file_paths]
|
432 |
+
)
|
433 |
+
else:
|
434 |
+
# Fallback: process individually and combine
|
435 |
+
results = []
|
436 |
+
for i, file_path in enumerate(temp_file_paths):
|
437 |
+
if self._check_cancellation():
|
438 |
+
return "Conversion cancelled."
|
439 |
+
|
440 |
+
result = parser_instance.parse(
|
441 |
+
file_path=file_path,
|
442 |
+
ocr_method=ocr_method_name
|
443 |
+
)
|
444 |
+
|
445 |
+
# Add section header for individual results using original filename
|
446 |
+
original_filename = Path(self._original_file_paths[i]).name
|
447 |
+
results.append(f"# Document {i+1}: {original_filename}\n\n{result}")
|
448 |
+
|
449 |
+
return "\n\n---\n\n".join(results)
|
450 |
+
|
451 |
+
except Exception as e:
|
452 |
+
raise DocumentProcessingError(f"Error processing multiple documents: {str(e)}")
|
453 |
+
|
454 |
+
def _create_batch_output_file(
|
455 |
+
self,
|
456 |
+
content: str,
|
457 |
+
output_format: str,
|
458 |
+
original_file_paths: List[str],
|
459 |
+
processing_type: str
|
460 |
+
) -> str:
|
461 |
+
"""Create output file for batch processing with descriptive naming."""
|
462 |
+
# Determine file extension
|
463 |
+
format_extensions = {
|
464 |
+
"markdown": ".md",
|
465 |
+
"json": ".json",
|
466 |
+
"text": ".txt",
|
467 |
+
"document tags": ".doctags"
|
468 |
+
}
|
469 |
+
ext = format_extensions.get(output_format.lower(), ".txt")
|
470 |
+
|
471 |
+
if self._check_cancellation():
|
472 |
+
raise ConversionError("Conversion cancelled before output file creation")
|
473 |
+
|
474 |
+
# Create descriptive filename for batch processing
|
475 |
+
file_count = len(original_file_paths)
|
476 |
+
timestamp = time.strftime("%Y%m%d_%H%M%S")
|
477 |
+
|
478 |
+
if processing_type == "combined":
|
479 |
+
filename = f"Combined_{file_count}_Documents_{timestamp}{ext}"
|
480 |
+
elif processing_type == "individual":
|
481 |
+
filename = f"Individual_Sections_{file_count}_Files_{timestamp}{ext}"
|
482 |
+
elif processing_type == "summary":
|
483 |
+
filename = f"Summary_Analysis_{file_count}_Files_{timestamp}{ext}"
|
484 |
+
elif processing_type == "comparison":
|
485 |
+
filename = f"Comparison_Analysis_{file_count}_Files_{timestamp}{ext}"
|
486 |
+
else:
|
487 |
+
filename = f"Batch_Processing_{file_count}_Files_{timestamp}{ext}"
|
488 |
+
|
489 |
+
# Create output file in temp directory
|
490 |
+
temp_dir = tempfile.gettempdir()
|
491 |
+
tmp_path = os.path.join(temp_dir, filename)
|
492 |
+
|
493 |
+
# Handle filename conflicts
|
494 |
+
counter = 1
|
495 |
+
base_path = tmp_path
|
496 |
+
while os.path.exists(tmp_path):
|
497 |
+
name_part = filename.replace(ext, f"_{counter}{ext}")
|
498 |
+
tmp_path = os.path.join(temp_dir, name_part)
|
499 |
+
counter += 1
|
500 |
+
|
501 |
+
# Write content to file
|
502 |
+
try:
|
503 |
+
with open(tmp_path, "w", encoding="utf-8") as f:
|
504 |
+
# Write in chunks with cancellation checks
|
505 |
+
chunk_size = 10000 # characters
|
506 |
+
for i in range(0, len(content), chunk_size):
|
507 |
+
if self._check_cancellation():
|
508 |
+
self._safe_delete_file(tmp_path)
|
509 |
+
raise ConversionError("Conversion cancelled during output file writing")
|
510 |
+
|
511 |
+
f.write(content[i:i+chunk_size])
|
512 |
+
except Exception as e:
|
513 |
+
self._safe_delete_file(tmp_path)
|
514 |
+
raise ConversionError(f"Failed to write batch output file: {str(e)}")
|
515 |
+
|
516 |
+
return tmp_path
|
src/ui/ui.py
CHANGED
@@ -45,6 +45,51 @@ def monitor_cancellation():
|
|
45 |
time.sleep(0.1) # Check every 100ms
|
46 |
logger.info("Cancellation monitor thread ending")
|
47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
def validate_file_for_parser(file_path, parser_name):
|
49 |
"""Validate if the file type is supported by the selected parser."""
|
50 |
if not file_path:
|
@@ -112,8 +157,48 @@ def run_conversion_thread(file_path, parser_name, ocr_method_name, output_format
|
|
112 |
|
113 |
return thread, results
|
114 |
|
115 |
-
def
|
116 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
global conversion_cancelled
|
118 |
|
119 |
# Check if we should cancel before starting
|
@@ -121,16 +206,31 @@ def handle_convert(file_path, parser_name, ocr_method_name, output_format, is_ca
|
|
121 |
logger.info("Conversion cancelled before starting")
|
122 |
return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
|
123 |
|
124 |
-
# Validate
|
125 |
-
|
126 |
-
|
127 |
-
logger.error(
|
128 |
return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
|
129 |
|
130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
# Start the conversion in a separate thread
|
133 |
-
thread, results =
|
134 |
|
135 |
# Start the monitoring thread
|
136 |
monitor_thread = threading.Thread(target=monitor_cancellation)
|
@@ -763,8 +863,27 @@ def create_ui():
|
|
763 |
# State to store the output format (fixed to Markdown)
|
764 |
output_format_state = gr.State("Markdown")
|
765 |
|
766 |
-
#
|
767 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
768 |
|
769 |
# Provider and OCR options below the file input
|
770 |
with gr.Row(elem_classes=["provider-options-row"]):
|
@@ -805,6 +924,14 @@ def create_ui():
|
|
805 |
cancel_button = gr.Button("Cancel", variant="stop", visible=False)
|
806 |
|
807 |
# Event handlers for document converter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
808 |
provider_dropdown.change(
|
809 |
lambda p: gr.Dropdown(
|
810 |
choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
|
@@ -845,7 +972,7 @@ def create_ui():
|
|
845 |
queue=False # Execute immediately
|
846 |
).then(
|
847 |
fn=handle_convert,
|
848 |
-
inputs=[
|
849 |
outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
|
850 |
)
|
851 |
|
|
|
45 |
time.sleep(0.1) # Check every 100ms
|
46 |
logger.info("Cancellation monitor thread ending")
|
47 |
|
48 |
+
def update_ui_for_file_count(files):
|
49 |
+
"""Update UI components based on the number of files uploaded."""
|
50 |
+
if not files or len(files) == 0:
|
51 |
+
return (
|
52 |
+
gr.update(visible=False), # processing_type_selector
|
53 |
+
"<div style='color: #666; font-style: italic;'>Upload documents to begin</div>" # file_status_text
|
54 |
+
)
|
55 |
+
|
56 |
+
if len(files) == 1:
|
57 |
+
file_name = files[0].name if hasattr(files[0], 'name') else str(files[0])
|
58 |
+
return (
|
59 |
+
gr.update(visible=False), # processing_type_selector (hidden for single file)
|
60 |
+
f"<div style='color: #2563eb; font-weight: 500;'>π Single document: {file_name}</div>"
|
61 |
+
)
|
62 |
+
else:
|
63 |
+
# Calculate total size for validation display
|
64 |
+
total_size = 0
|
65 |
+
try:
|
66 |
+
for file in files:
|
67 |
+
if hasattr(file, 'size'):
|
68 |
+
total_size += file.size
|
69 |
+
elif hasattr(file, 'name'):
|
70 |
+
# For file paths, get size from filesystem
|
71 |
+
total_size += Path(file.name).stat().st_size
|
72 |
+
except:
|
73 |
+
pass # Size calculation is optional for display
|
74 |
+
|
75 |
+
size_display = f" ({total_size / (1024*1024):.1f}MB)" if total_size > 0 else ""
|
76 |
+
|
77 |
+
# Check if within limits
|
78 |
+
if len(files) > 5:
|
79 |
+
status_color = "#dc2626" # red
|
80 |
+
status_text = f"β οΈ Too many files: {len(files)}/5 (max 5 files allowed)"
|
81 |
+
elif total_size > 20 * 1024 * 1024: # 20MB
|
82 |
+
status_color = "#dc2626" # red
|
83 |
+
status_text = f"β οΈ Files too large{size_display} (max 20MB combined)"
|
84 |
+
else:
|
85 |
+
status_color = "#059669" # green
|
86 |
+
status_text = f"π Batch mode: {len(files)} files{size_display}"
|
87 |
+
|
88 |
+
return (
|
89 |
+
gr.update(visible=True), # processing_type_selector (visible for multiple files)
|
90 |
+
f"<div style='color: {status_color}; font-weight: 500;'>{status_text}</div>"
|
91 |
+
)
|
92 |
+
|
93 |
def validate_file_for_parser(file_path, parser_name):
|
94 |
"""Validate if the file type is supported by the selected parser."""
|
95 |
if not file_path:
|
|
|
157 |
|
158 |
return thread, results
|
159 |
|
160 |
+
def run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type):
|
161 |
+
"""Run the conversion in a separate thread for multiple files."""
|
162 |
+
import threading
|
163 |
+
from src.services.document_service import DocumentService
|
164 |
+
|
165 |
+
# Results will be shared between threads
|
166 |
+
results = {"content": None, "download_file": None, "error": None}
|
167 |
+
|
168 |
+
def conversion_worker():
|
169 |
+
try:
|
170 |
+
logger.info(f"Starting multi-file conversion thread for {len(file_paths)} files")
|
171 |
+
|
172 |
+
# Use the new document service unified method
|
173 |
+
document_service = DocumentService()
|
174 |
+
document_service.set_cancellation_flag(conversion_cancelled)
|
175 |
+
|
176 |
+
# Call the unified convert_documents method
|
177 |
+
content, output_file = document_service.convert_documents(
|
178 |
+
file_paths=file_paths,
|
179 |
+
parser_name=parser_name,
|
180 |
+
ocr_method_name=ocr_method_name,
|
181 |
+
output_format=output_format,
|
182 |
+
processing_type=processing_type
|
183 |
+
)
|
184 |
+
|
185 |
+
logger.info(f"Multi-file conversion completed successfully for {len(file_paths)} files")
|
186 |
+
results["content"] = content
|
187 |
+
results["download_file"] = output_file
|
188 |
+
|
189 |
+
except Exception as e:
|
190 |
+
logger.error(f"Error during multi-file conversion: {str(e)}")
|
191 |
+
results["error"] = str(e)
|
192 |
+
|
193 |
+
# Create and start the thread
|
194 |
+
thread = threading.Thread(target=conversion_worker)
|
195 |
+
thread.daemon = True
|
196 |
+
thread.start()
|
197 |
+
|
198 |
+
return thread, results
|
199 |
+
|
200 |
+
def handle_convert(files, parser_name, ocr_method_name, output_format, processing_type, is_cancelled):
|
201 |
+
"""Handle file conversion for single or multiple files."""
|
202 |
global conversion_cancelled
|
203 |
|
204 |
# Check if we should cancel before starting
|
|
|
206 |
logger.info("Conversion cancelled before starting")
|
207 |
return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
|
208 |
|
209 |
+
# Validate files input
|
210 |
+
if not files or len(files) == 0:
|
211 |
+
error_msg = "No files uploaded. Please upload at least one document."
|
212 |
+
logger.error(error_msg)
|
213 |
return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
|
214 |
|
215 |
+
# Convert Gradio file objects to file paths
|
216 |
+
file_paths = []
|
217 |
+
for file in files:
|
218 |
+
if hasattr(file, 'name'):
|
219 |
+
file_paths.append(file.name)
|
220 |
+
else:
|
221 |
+
file_paths.append(str(file))
|
222 |
+
|
223 |
+
# Validate file types for the selected parser
|
224 |
+
for file_path in file_paths:
|
225 |
+
is_valid, error_msg = validate_file_for_parser(file_path, parser_name)
|
226 |
+
if not is_valid:
|
227 |
+
logger.error(f"File validation error: {error_msg}")
|
228 |
+
return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
|
229 |
+
|
230 |
+
logger.info(f"Starting conversion of {len(file_paths)} file(s) with cancellation flag cleared")
|
231 |
|
232 |
# Start the conversion in a separate thread
|
233 |
+
thread, results = run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type)
|
234 |
|
235 |
# Start the monitoring thread
|
236 |
monitor_thread = threading.Thread(target=monitor_cancellation)
|
|
|
863 |
# State to store the output format (fixed to Markdown)
|
864 |
output_format_state = gr.State("Markdown")
|
865 |
|
866 |
+
# Multi-file input (supports single and multiple files)
|
867 |
+
files_input = gr.Files(
|
868 |
+
label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
|
869 |
+
file_count="multiple",
|
870 |
+
file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
|
871 |
+
)
|
872 |
+
|
873 |
+
# Processing type selector (visible only for multiple files)
|
874 |
+
processing_type_selector = gr.Radio(
|
875 |
+
choices=["combined", "individual", "summary", "comparison"],
|
876 |
+
value="combined",
|
877 |
+
label="Multi-Document Processing Type",
|
878 |
+
info="How to process multiple documents together",
|
879 |
+
visible=False
|
880 |
+
)
|
881 |
+
|
882 |
+
# Status text to show file count and processing mode
|
883 |
+
file_status_text = gr.HTML(
|
884 |
+
value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
|
885 |
+
label=""
|
886 |
+
)
|
887 |
|
888 |
# Provider and OCR options below the file input
|
889 |
with gr.Row(elem_classes=["provider-options-row"]):
|
|
|
924 |
cancel_button = gr.Button("Cancel", variant="stop", visible=False)
|
925 |
|
926 |
# Event handlers for document converter
|
927 |
+
|
928 |
+
# Update UI when files are uploaded/changed
|
929 |
+
files_input.change(
|
930 |
+
fn=update_ui_for_file_count,
|
931 |
+
inputs=[files_input],
|
932 |
+
outputs=[processing_type_selector, file_status_text]
|
933 |
+
)
|
934 |
+
|
935 |
provider_dropdown.change(
|
936 |
lambda p: gr.Dropdown(
|
937 |
choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
|
|
|
972 |
queue=False # Execute immediately
|
973 |
).then(
|
974 |
fn=handle_convert,
|
975 |
+
inputs=[files_input, provider_dropdown, ocr_dropdown, output_format_state, processing_type_selector, cancel_requested],
|
976 |
outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
|
977 |
)
|
978 |
|