AnseMin commited on
Commit
111954a
Β·
1 Parent(s): 35912f6

Implement multi-document processing capabilities and enhance UI

Browse files

- Introduced support for processing up to 5 files simultaneously with a combined size limit of 20MB.
- Added new processing types: Combined, Individual, Summary, and Comparison for enhanced document analysis.
- Updated the Gemini Flash parser to handle multiple documents and format outputs based on processing type.
- Enhanced the UI to dynamically display processing options and real-time validation for file uploads.
- Unified document conversion method to streamline single and multi-file processing.
- Improved error handling and logging for batch processing operations.

README.md CHANGED
@@ -20,11 +20,17 @@ A Hugging Face Space that converts various document formats to Markdown and lets
20
 
21
  ### Document Conversion
22
  - Convert PDFs, Office documents, images, and more to Markdown
 
23
  - Multiple parser options:
24
  - MarkItDown: For comprehensive document conversion
25
  - Docling: For advanced PDF understanding with table structure recognition
26
  - GOT-OCR: For image-based OCR with LaTeX support
27
- - Gemini Flash: For AI-powered text extraction
 
 
 
 
 
28
  - Download converted documents as Markdown files
29
 
30
  ### πŸ€– RAG Chat with Documents
@@ -40,34 +46,68 @@ A Hugging Face Space that converts various document formats to Markdown and lets
40
 
41
  ### User Interface
42
  - **Dual-tab interface**: Document Converter + Chat
 
 
 
43
  - **Real-time status monitoring** for RAG system with environment detection
44
  - **Auto-ingestion** of converted documents into chat system
45
  - **Enhanced status display**: Shows vector store document count, chat history files, and environment type
46
  - **Data management controls**: Clear All Data button with comprehensive feedback
47
  - **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β†’ "example data.md")
 
48
  - Clean, responsive UI with modern styling
49
 
50
- ## Using MarkItDown & Docling
51
 
52
- This app integrates multiple powerful document conversion libraries:
53
 
54
- ### MarkItDown
55
- [Microsoft's MarkItDown](https://github.com/microsoft/markitdown) library supports a wide range of file formats:
56
 
57
- ### Docling
58
- [IBM's Docling](https://github.com/DS4SD/docling) provides advanced document understanding with:
59
- - **Advanced PDF parsing** with layout understanding, reading order, and table structure recognition
60
- - **Multiple OCR engines** including EasyOCR and Tesseract
61
- - **Document format support**: PDF, DOCX, XLSX, PPTX, HTML, Images (PNG, JPG, TIFF, BMP, WEBP), CSV
62
- - **Local execution** for sensitive data processing
63
- - **Formula and code understanding** with enrichment features
64
- - **Picture classification** and description capabilities
65
 
66
- ### MarkItDown Features
67
- - PDF, PowerPoint (PPTX), Word (DOCX), Excel (XLSX)
68
- - Images (JPG, PNG), Audio files (with transcription)
69
- - HTML, Text-based formats (CSV, JSON, XML)
70
- - ZIP files, YouTube URLs, EPubs, and more!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## Environment Variables
73
 
@@ -81,6 +121,8 @@ The application uses centralized configuration management. You can enhance funct
81
  ### βš™οΈ **Configuration Options:**
82
  - `DEBUG`: Set to `true` for debug mode with verbose logging
83
  - `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
 
 
84
  - `TEMP_DIR`: Directory for temporary files (default: ./temp)
85
  - `TESSERACT_PATH`: Custom path to Tesseract executable
86
  - `TESSDATA_PATH`: Path to Tesseract language data
@@ -118,15 +160,40 @@ The application uses centralized configuration management. You can enhance funct
118
  ## Usage
119
 
120
  ### Document Conversion
 
 
121
  1. Go to the **"Document Converter"** tab
122
- 2. Select a file to upload
123
  3. Choose your preferred parser:
124
  - **"MarkItDown"** for comprehensive document conversion
125
  - **"Docling"** for advanced PDF understanding and table extraction
 
126
  4. Select an OCR method based on your chosen parser
127
  5. Click "Convert"
128
  6. View the Markdown output and download the converted file
129
- 7. **Documents are automatically added to the RAG system** for chat functionality
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
  ### πŸ€– Chat with Documents
132
  1. Go to the **"Chat with Documents"** tab
@@ -165,54 +232,30 @@ The application uses centralized configuration management. You can enhance funct
165
  # For local development (faster startup)
166
  python run_app.py
167
 
168
- # For testing with clean data (clears chat history and vector store)
169
  python run_app.py --clear-data-and-run
170
 
171
- # To only clear data without running the app
172
- python run_app.py --clear-data
173
  ```
174
 
175
- ### 🧹 **Data Management for Testing:**
176
- For local development and testing, you can easily clear all stored data:
177
 
178
- ```bash
179
- # Clear all data and exit (useful for quick cleanup)
180
- python run_app.py --clear-data
181
 
182
- # Clear all data then run the app (useful for fresh testing)
183
- python run_app.py --clear-data-and-run
 
184
 
185
- # Show all available options
186
- python run_app.py --help
187
- ```
 
188
 
189
  **What gets cleared:**
190
- - `data/chat_history/*` - All saved chat sessions
191
  - `data/vector_store/*` - All document embeddings and vector database
192
 
193
- This is particularly useful when:
194
- - Testing new RAG features with fresh data
195
- - Clearing old chat sessions and documents
196
- - Resetting the system to a clean state
197
- - Debugging document ingestion issues
198
-
199
- ### πŸ—‘οΈ **In-App Data Clearing:**
200
- In addition to command-line data clearing, you can also clear data directly from the web interface:
201
-
202
- 1. Go to the **"Chat with Documents"** tab
203
- 2. Click the **"πŸ—‘οΈ Clear All Data"** button in the control panel
204
- 3. All vector store documents and chat history will be cleared
205
- 4. A new chat session will automatically start
206
- 5. The status panel will update to reflect the cleared state
207
-
208
- **Features of in-app clearing:**
209
- - **Environment Detection**: Automatically works in both local and HF Space environments
210
- - **Comprehensive Clearing**: Removes both vector store documents and chat history files
211
- - **Smart Path Resolution**: Uses `/tmp/data/*` for HF Spaces, `./data/*` for local development
212
- - **User Feedback**: Shows detailed results of what was cleared
213
- - **Auto-Session Reset**: Starts fresh chat session after clearing
214
- - **Safe Operation**: Handles errors gracefully and provides status updates
215
-
216
  ### πŸ§ͺ **Development Features:**
217
  - **Automatic Environment Setup**: Dependencies are checked and installed automatically
218
  - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
@@ -225,191 +268,14 @@ In addition to command-line data clearing, you can also clear data directly from
225
  - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
226
  - [Gradio](https://gradio.app/) for the UI framework
227
 
228
- # Markit: Document to Markdown Converter
229
-
230
- [![Hugging Face Space](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Ansemin101/Markit_v2)
231
 
232
  **Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
233
 
234
- ## Project Links
235
- - **GitHub Repository**: [github.com/ansemin/Markit_v2](https://github.com/ansemin/Markit_v2)
236
- - **Hugging Face Space**: [huggingface.co/spaces/Ansemin101/Markit_v2](https://huggingface.co/spaces/Ansemin101/Markit_v2)
237
-
238
- ## Overview
239
- Markit is a powerful tool that converts various document formats (PDF, DOCX, images, etc.) to Markdown format. It uses different parsing engines and OCR methods to extract text from documents and convert them to clean, readable Markdown formats.
240
-
241
- ## Key Features
242
- - **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
243
- - **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
244
- - **Advanced Parsing Engines**:
245
- - **MarkItDown**: Comprehensive document conversion (PDFs, Office docs, images, audio, etc.)
246
- - **Docling**: Advanced PDF understanding with table structure, layout analysis, and multiple OCR engines
247
- - **Gemini Flash**: AI-powered conversion using Google's Gemini API
248
- - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
249
- - **Mistral OCR**: Advanced OCR using Mistral's Pixtral model for image-to-text conversion
250
- - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
251
- - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
252
- - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
253
- - **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
254
-
255
- ## System Architecture
256
-
257
- The application is built with a clean, layered architecture following modern software engineering principles:
258
-
259
- ### πŸ—οΈ **Core Architecture Components:**
260
- - **Entry Point** (`app.py`): HF Spaces-compatible application launcher with environment setup
261
- - **Configuration Layer** (`src/core/config.py`): Centralized configuration management with validation
262
- - **Service Layer** (`src/services/`): Business logic for document processing and external services
263
- - **Core Engine** (`src/core/`): Document conversion workflows and utilities
264
- - **Parser Registry** (`src/parsers/`): Extensible parser system with standardized interfaces
265
- - **UI Layer** (`src/ui/`): Gradio-based web interface with enhanced error handling
266
-
267
- ### 🎯 **Key Architectural Features:**
268
- - **Separation of Concerns**: Clean boundaries between UI, business logic, and core utilities
269
- - **Centralized Configuration**: All settings, API keys, and validation in one place
270
- - **Custom Exception Hierarchy**: Proper error handling with user-friendly messages
271
- - **Plugin Architecture**: Easy addition of new document parsers
272
- - **HF Spaces Optimized**: Maintains compatibility with Hugging Face deployment requirements
273
-
274
- ## Installation
275
-
276
- ### For Local Development
277
- 1. Clone the repository
278
- 2. Install dependencies:
279
- ```bash
280
- pip install -r requirements.txt
281
- ```
282
- 3. Install Tesseract OCR (required for OCR functionality):
283
- - Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
284
- - Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
285
- - macOS: `brew install tesseract`
286
-
287
- 4. Run the application:
288
- ```bash
289
- python app.py
290
- ```
291
-
292
- ### API Keys Setup
293
-
294
- #### Gemini Flash Parser
295
- To use the Gemini Flash parser, you need to:
296
- 1. Install the Google Generative AI client: `pip install google-genai`
297
- 2. Set the API key environment variable:
298
- ```bash
299
- # On Windows
300
- set GOOGLE_API_KEY=your_api_key_here
301
-
302
- # On Linux/Mac
303
- export GOOGLE_API_KEY=your_api_key_here
304
- ```
305
- 3. Alternatively, create a `.env` file in the project root with:
306
- ```
307
- GOOGLE_API_KEY=your_api_key_here
308
- ```
309
- 4. Get your Gemini API key from [Google AI Studio](https://aistudio.google.com/app/apikey)
310
-
311
- #### GOT-OCR Parser
312
- The GOT-OCR parser requires:
313
- 1. CUDA-capable GPU with sufficient memory
314
- 2. The following dependencies will be installed automatically:
315
- ```bash
316
- torch
317
- torchvision
318
- git+https://github.com/huggingface/transformers.git@main # Latest transformers from GitHub
319
- accelerate
320
- verovio
321
- numpy==1.26.3 # Specific version required
322
- opencv-python
323
- ```
324
- 3. Note that GOT-OCR only supports JPG and PNG image formats
325
- 4. In HF Spaces, the integration with ZeroGPU is automatic and optimized for Stateless GPU environments
326
-
327
- ## Deploying to Hugging Face Spaces
328
-
329
- ### Environment Configuration
330
- 1. Go to your Space settings: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME/settings`
331
- 2. Add the following repository secrets:
332
- - Name: `GOOGLE_API_KEY`
333
- - Value: Your Gemini API key
334
-
335
- ### Space Configuration
336
- Ensure your Hugging Face Space configuration includes:
337
- ```yaml
338
- build:
339
- dockerfile: Dockerfile
340
- python_version: "3.10"
341
- system_packages:
342
- - "tesseract-ocr"
343
- - "libtesseract-dev"
344
- ```
345
 
346
- ## How to Use
347
-
348
- ### Document Conversion
349
- 1. Upload your document using the file uploader
350
- 2. Select a parser provider:
351
- - **MarkItDown**: Best for comprehensive document conversion (supports PDFs, Office docs, images, audio, etc.)
352
- - **Docling**: Best for advanced PDF understanding with table structure recognition and layout analysis
353
- - **Gemini Flash**: Best for AI-powered conversions (requires API key)
354
- - **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
355
- - **Mistral OCR**: Advanced OCR using Mistral's Pixtral model (requires API key)
356
- 3. Choose an OCR option based on your selected parser:
357
- - **None**: No OCR processing (for documents with selectable text)
358
- - **Tesseract**: Basic OCR using Tesseract
359
- - **Advanced**: Enhanced OCR with layout preservation (available with specific parsers)
360
- - **Plain Text**: For GOT-OCR, extracts raw text without formatting
361
- - **Formatted Text**: For GOT-OCR, preserves formatting and converts to Markdown
362
- 4. Select your desired output format:
363
- - **Markdown**: Clean, readable markdown format
364
- - **JSON**: Structured data representation
365
- - **Text**: Plain text extraction
366
- - **Document Tags**: XML-like structure tags
367
- 5. Click "Convert" to process your document
368
- 6. Navigate through pages using the navigation buttons for multi-page documents
369
- 7. Download the converted content in your selected format
370
-
371
- ## Configuration & Error Handling
372
-
373
- ### πŸ”§ **Automatic Configuration:**
374
- The application includes intelligent configuration management that:
375
- - Validates API keys and reports availability at startup
376
- - Checks for required dependencies and installs them automatically
377
- - Provides helpful warnings for missing optional components
378
- - Reports which parsers are available based on current configuration
379
-
380
- ### πŸ›‘οΈ **Enhanced Error Handling:**
381
- - **User-Friendly Messages**: Clear error descriptions instead of technical stack traces
382
- - **File Validation**: Automatic checking of file size and format compatibility
383
- - **Parser Availability**: Real-time detection of which parsers can be used
384
- - **Graceful Degradation**: Application continues working even if some parsers are unavailable
385
-
386
- ## Troubleshooting
387
-
388
- ### OCR Issues
389
- - Ensure Tesseract is properly installed and in your system PATH
390
- - Check the TESSDATA_PREFIX environment variable is set correctly
391
- - Verify language files are available in the tessdata directory
392
-
393
- ### Gemini Flash Parser Issues
394
- - Confirm your API key is set correctly as an environment variable
395
- - Check for API usage limits or restrictions
396
- - Verify the document format is supported by the Gemini API
397
-
398
- ### GOT-OCR Parser Issues
399
- - Ensure you have a CUDA-capable GPU with sufficient memory
400
- - Verify that all required dependencies are installed correctly
401
- - Remember that GOT-OCR only supports JPG and PNG image formats
402
- - If you encounter CUDA out-of-memory errors, try using a smaller image
403
- - In Hugging Face Spaces with Stateless GPU, ensure the `spaces` module is imported before any CUDA initialization
404
- - If you see errors about "CUDA must not be initialized in the main process", verify the import order in your app.py
405
- - If you encounter "cannot pickle '_thread.lock' object" errors, this indicates thread locks are being passed to the GPU function
406
- - The GOT-OCR parser has been optimized for ZeroGPU in Stateless GPU environments with proper serialization handling
407
- - For local development, the parser will fall back to CPU processing if GPU is not available
408
-
409
- ### General Issues
410
- - Check the console logs for error messages
411
- - Ensure all dependencies are installed correctly
412
- - For large documents, try processing fewer pages at a time
413
 
414
  ## Development Guide
415
 
@@ -450,7 +316,7 @@ markit_v2/
450
  β”‚ β”‚ β”œβ”€β”€ docling_parser.py # πŸ†• Docling parser with advanced PDF understanding
451
  β”‚ β”‚ β”œβ”€β”€ got_ocr_parser.py # GOT-OCR parser for images
452
  β”‚ β”‚ β”œβ”€β”€ mistral_ocr_parser.py # πŸ†• Mistral OCR parser
453
- β”‚ β”‚ └── gemini_flash_parser.py # Gemini Flash parser
454
  β”‚ β”œβ”€β”€ rag/ # πŸ†• RAG (Retrieval-Augmented Generation) system
455
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
456
  β”‚ β”‚ β”œβ”€β”€ embeddings.py # OpenAI embedding model management
 
20
 
21
  ### Document Conversion
22
  - Convert PDFs, Office documents, images, and more to Markdown
23
+ - **πŸ†• Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
24
  - Multiple parser options:
25
  - MarkItDown: For comprehensive document conversion
26
  - Docling: For advanced PDF understanding with table structure recognition
27
  - GOT-OCR: For image-based OCR with LaTeX support
28
+ - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
29
+ - **πŸ†• Intelligent Processing Types**:
30
+ - **Combined**: Merge documents into unified content with duplicate removal
31
+ - **Individual**: Separate sections per document with clear organization
32
+ - **Summary**: Executive overview + detailed analysis of all documents
33
+ - **Comparison**: Cross-document analysis with similarities/differences tables
34
  - Download converted documents as Markdown files
35
 
36
  ### πŸ€– RAG Chat with Documents
 
46
 
47
  ### User Interface
48
  - **Dual-tab interface**: Document Converter + Chat
49
+ - **πŸ†• Unified File Input**: Single interface handles both single and multiple file uploads
50
+ - **πŸ†• Dynamic Processing Options**: Multi-document processing type selector appears automatically
51
+ - **πŸ†• Real-time Validation**: Live feedback on file count, size limits, and processing mode
52
  - **Real-time status monitoring** for RAG system with environment detection
53
  - **Auto-ingestion** of converted documents into chat system
54
  - **Enhanced status display**: Shows vector store document count, chat history files, and environment type
55
  - **Data management controls**: Clear All Data button with comprehensive feedback
56
  - **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β†’ "example data.md")
57
+ - **πŸ†• Smart Output Naming**: Batch processing creates descriptive filenames (e.g., "Combined_3_Documents_20240125.md")
58
  - Clean, responsive UI with modern styling
59
 
60
+ ## Supported Libraries
61
 
62
+ **MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
63
 
64
+ **Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis.
 
65
 
66
+ **Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
 
 
 
 
 
 
 
67
 
68
+ ## πŸš€ Multi-Document Processing
69
+
70
+ ### **What makes this special?**
71
+ Markit v2 introduces **industry-leading multi-document processing** powered by Google's Gemini Flash 2.5, enabling intelligent analysis across multiple documents simultaneously.
72
+
73
+ ### **Key Capabilities:**
74
+ - **πŸ“Š Cross-Document Analysis**: Compare and contrast information across different files
75
+ - **πŸ”„ Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
76
+ - **πŸ“‹ Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
77
+ - **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
78
+ - **⚑ Single API Call Processing**: Efficient batch processing using Gemini's native multi-document support
79
+
80
+ ### **Processing Types Explained:**
81
+
82
+ #### πŸ”— **Combined Processing**
83
+ - **Purpose**: Create one unified, cohesive document from multiple sources
84
+ - **Best for**: Related documents that should be read as one complete resource
85
+ - **Intelligence**: Removes redundant information while preserving all critical content
86
+ - **Example**: Merge project proposal + budget + timeline into one comprehensive document
87
+
88
+ #### πŸ“‘ **Individual Processing**
89
+ - **Purpose**: Convert each document separately but organize them in one output
90
+ - **Best for**: Different documents you want in one place for easy reference
91
+ - **Intelligence**: Maintains original structure while creating clear organization
92
+ - **Example**: Meeting agenda + presentation + notes β†’ organized sections
93
+
94
+ #### πŸ“ˆ **Summary Processing**
95
+ - **Purpose**: Executive overview + detailed analysis
96
+ - **Best for**: Complex document sets needing high-level insights
97
+ - **Intelligence**: Cross-document pattern recognition and key insight extraction
98
+ - **Example**: Research papers β†’ executive summary + detailed analysis of each paper
99
+
100
+ #### βš–οΈ **Comparison Processing**
101
+ - **Purpose**: Analyze differences, similarities, and relationships
102
+ - **Best for**: Multiple proposals, document versions, or conflicting sources
103
+ - **Intelligence**: Creates comparison tables and identifies discrepancies/alignments
104
+ - **Example**: Contract versions β†’ side-by-side analysis with change identification
105
+
106
+ ### **Technical Advantages:**
107
+ - **Native Multimodal Support**: Processes text + images in same workflow
108
+ - **Advanced Reasoning**: Understands context and relationships between documents
109
+ - **Efficient Processing**: Single Gemini API call vs. multiple individual calls
110
+ - **Format Agnostic**: Works across all supported file types seamlessly
111
 
112
  ## Environment Variables
113
 
 
121
  ### βš™οΈ **Configuration Options:**
122
  - `DEBUG`: Set to `true` for debug mode with verbose logging
123
  - `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
124
+ - `MAX_BATCH_FILES`: Maximum files for multi-document processing (default: 5)
125
+ - `MAX_BATCH_SIZE`: Maximum combined size for batch processing (default: 20MB)
126
  - `TEMP_DIR`: Directory for temporary files (default: ./temp)
127
  - `TESSERACT_PATH`: Custom path to Tesseract executable
128
  - `TESSDATA_PATH`: Path to Tesseract language data
 
160
  ## Usage
161
 
162
  ### Document Conversion
163
+
164
+ #### πŸ“„ **Single Document Processing**
165
  1. Go to the **"Document Converter"** tab
166
+ 2. Upload a single file
167
  3. Choose your preferred parser:
168
  - **"MarkItDown"** for comprehensive document conversion
169
  - **"Docling"** for advanced PDF understanding and table extraction
170
+ - **"Gemini Flash"** for AI-powered text extraction
171
  4. Select an OCR method based on your chosen parser
172
  5. Click "Convert"
173
  6. View the Markdown output and download the converted file
174
+
175
+ #### πŸ“‚ **Multi-Document Processing** (NEW!)
176
+ 1. Go to the **"Document Converter"** tab
177
+ 2. Upload **2-5 files** (up to 20MB combined)
178
+ 3. **Processing type selector appears automatically**
179
+ 4. Choose your processing type:
180
+ - **Combined**: Merge all documents into unified content with smart duplicate removal
181
+ - **Individual**: Keep documents separate with clear section headers
182
+ - **Summary**: Executive overview + detailed analysis of each document
183
+ - **Comparison**: Side-by-side analysis with similarities/differences tables
184
+ 5. Choose your preferred parser (recommend **Gemini Flash** for best multi-document results)
185
+ 6. Click "Convert"
186
+ 7. Get intelligent cross-document analysis and download enhanced output
187
+
188
+ #### πŸ’‘ **Multi-Document Tips**
189
+ - **Mixed file types work great**: Upload PDF + images, Word docs + PDFs, etc.
190
+ - **Gemini Flash excels at**: Cross-document reasoning, duplicate detection, and format analysis
191
+ - **Perfect for**: Comparing document versions, analyzing related reports, consolidating research
192
+ - **Real-time validation**: UI shows file count, size limits, and processing mode
193
+
194
+ #### πŸ€– **RAG Integration**
195
+ - **All converted documents are automatically added to the RAG system** for chat functionality
196
+ - Multi-document processing creates richer context for chat interactions
197
 
198
  ### πŸ€– Chat with Documents
199
  1. Go to the **"Chat with Documents"** tab
 
232
  # For local development (faster startup)
233
  python run_app.py
234
 
235
+ # For testing with clean data
236
  python run_app.py --clear-data-and-run
237
 
238
+ # Show all available options
239
+ python run_app.py --help
240
  ```
241
 
242
+ ### 🧹 **Data Management:**
 
243
 
244
+ **Two ways to clear data:**
 
 
245
 
246
+ 1. **Command-line** (for development):
247
+ - `python run_app.py --clear-data-and-run` - Clear data then start app
248
+ - `python run_app.py --clear-data` - Clear data and exit
249
 
250
+ 2. **In-app UI** (for users):
251
+ - Go to "Chat with Documents" tab β†’ Click "πŸ—‘οΈ Clear All Data" button
252
+ - Automatically detects environment (local vs HF Space)
253
+ - Provides detailed feedback and starts new session
254
 
255
  **What gets cleared:**
256
+ - `data/chat_history/*` - All saved chat sessions
257
  - `data/vector_store/*` - All document embeddings and vector database
258
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259
  ### πŸ§ͺ **Development Features:**
260
  - **Automatic Environment Setup**: Dependencies are checked and installed automatically
261
  - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
 
268
  - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
269
  - [Gradio](https://gradio.app/) for the UI framework
270
 
271
+ ---
 
 
272
 
273
  **Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
274
 
275
+ **Project Links:**
276
+ - [GitHub Repository](https://github.com/ansemin/Markit_v2)
277
+ - [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
 
280
  ## Development Guide
281
 
 
316
  β”‚ β”‚ β”œβ”€β”€ docling_parser.py # πŸ†• Docling parser with advanced PDF understanding
317
  β”‚ β”‚ β”œβ”€β”€ got_ocr_parser.py # GOT-OCR parser for images
318
  β”‚ β”‚ β”œβ”€β”€ mistral_ocr_parser.py # πŸ†• Mistral OCR parser
319
+ β”‚ β”‚ └── gemini_flash_parser.py # πŸ†• Enhanced Gemini Flash parser with multi-document processing
320
  β”‚ β”œβ”€β”€ rag/ # πŸ†• RAG (Retrieval-Augmented Generation) system
321
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
322
  β”‚ β”‚ β”œβ”€β”€ embeddings.py # OpenAI embedding model management
src/core/config.py CHANGED
@@ -139,11 +139,20 @@ class AppConfig:
139
  allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx", ".docx", ".pptx", ".html", ".xhtml", ".md", ".csv")
140
  temp_dir: str = "./temp"
141
 
 
 
 
 
 
142
  def __post_init__(self):
143
  """Load application configuration from environment variables."""
144
  self.debug = os.getenv("DEBUG", "false").lower() == "true"
145
  self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
146
  self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
 
 
 
 
147
 
148
 
149
  class Config:
 
139
  allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx", ".docx", ".pptx", ".html", ".xhtml", ".md", ".csv")
140
  temp_dir: str = "./temp"
141
 
142
+ # Multi-document batch processing settings
143
+ max_batch_files: int = 5
144
+ max_batch_size: int = 20 * 1024 * 1024 # 20MB combined
145
+ batch_processing_types: tuple = ("combined", "individual", "summary", "comparison")
146
+
147
  def __post_init__(self):
148
  """Load application configuration from environment variables."""
149
  self.debug = os.getenv("DEBUG", "false").lower() == "true"
150
  self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
151
  self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
152
+
153
+ # Load batch processing configuration
154
+ self.max_batch_files = int(os.getenv("MAX_BATCH_FILES", self.max_batch_files))
155
+ self.max_batch_size = int(os.getenv("MAX_BATCH_SIZE", self.max_batch_size))
156
 
157
 
158
  class Config:
src/parsers/gemini_flash_parser.py CHANGED
@@ -111,6 +111,225 @@ class GeminiFlashParser(DocumentParser):
111
  print(error_message)
112
  return f"# Error\n\n{error_message}\n\nPlease check your API key and try again."
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  def _get_mime_type(self, file_extension: str) -> str:
115
  """Get the MIME type for a file extension."""
116
  mime_types = {
 
111
  print(error_message)
112
  return f"# Error\n\n{error_message}\n\nPlease check your API key and try again."
113
 
114
+ def parse_multiple(self, file_paths: List[Union[str, Path]], processing_type: str = "combined", original_filenames: Optional[List[str]] = None, **kwargs) -> str:
115
+ """Parse multiple documents using Gemini Flash 2.0."""
116
+ if not GEMINI_AVAILABLE:
117
+ raise ImportError(
118
+ "The Google Gemini API client is not installed. "
119
+ "Please install it with 'pip install google-genai'."
120
+ )
121
+
122
+ if not api_key:
123
+ raise ValueError(
124
+ "GOOGLE_API_KEY environment variable is not set. "
125
+ "Please set it to your Gemini API key."
126
+ )
127
+
128
+ try:
129
+ # Convert to Path objects and validate
130
+ path_objects = [Path(fp) for fp in file_paths]
131
+ self._validate_batch_files(path_objects)
132
+
133
+ # Check for cancellation
134
+ if self._check_cancellation():
135
+ return "Conversion cancelled."
136
+
137
+ # Create client
138
+ client = genai.Client(api_key=api_key)
139
+
140
+ # Create contents for API call
141
+ contents = self._create_batch_contents(path_objects, processing_type, original_filenames)
142
+
143
+ # Check for cancellation before API call
144
+ if self._check_cancellation():
145
+ return "Conversion cancelled."
146
+
147
+ # Generate the response
148
+ response = client.models.generate_content(
149
+ model=config.model.gemini_model,
150
+ contents=contents,
151
+ config={
152
+ "temperature": config.model.temperature,
153
+ "top_p": 0.95,
154
+ "top_k": 40,
155
+ "max_output_tokens": config.model.max_tokens,
156
+ }
157
+ )
158
+
159
+ # Format the output based on processing type
160
+ formatted_output = self._format_batch_output(response.text, path_objects, processing_type, original_filenames)
161
+
162
+ return formatted_output
163
+
164
+ except Exception as e:
165
+ error_message = f"Error parsing multiple documents with Gemini Flash: {str(e)}"
166
+ print(error_message)
167
+ return f"# Error\n\n{error_message}\n\nPlease check your API key and try again."
168
+
169
+ def _validate_batch_files(self, file_paths: List[Path]) -> None:
170
+ """Validate batch of files for multi-document processing."""
171
+ # Check file count limit
172
+ if len(file_paths) == 0:
173
+ raise ValueError("No files provided for processing")
174
+ if len(file_paths) > 5:
175
+ raise ValueError("Maximum 5 files allowed for batch processing")
176
+
177
+ # Check individual files and calculate total size
178
+ total_size = 0
179
+ for file_path in file_paths:
180
+ if not file_path.exists():
181
+ raise ValueError(f"File not found: {file_path}")
182
+
183
+ file_size = file_path.stat().st_size
184
+ total_size += file_size
185
+
186
+ # Check individual file size (reasonable limit per file)
187
+ if file_size > 10 * 1024 * 1024: # 10MB per file
188
+ raise ValueError(f"Individual file size exceeds 10MB: {file_path.name}")
189
+
190
+ # Check combined size limit
191
+ if total_size > 20 * 1024 * 1024: # 20MB total
192
+ raise ValueError(f"Combined file size ({total_size / (1024*1024):.1f}MB) exceeds 20MB limit")
193
+
194
+ # Validate file types
195
+ for file_path in file_paths:
196
+ file_extension = file_path.suffix.lower()
197
+ if self._get_mime_type(file_extension) == "application/octet-stream":
198
+ raise ValueError(f"Unsupported file type: {file_path.name}")
199
+
200
+ def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
201
+ """Create contents list for batch API call."""
202
+ # Create the prompt based on processing type
203
+ prompt = self._create_batch_prompt(file_paths, processing_type, original_filenames)
204
+
205
+ # Start with the prompt
206
+ contents = [prompt]
207
+
208
+ # Add each file as a content part
209
+ for file_path in file_paths:
210
+ file_content = file_path.read_bytes()
211
+ mime_type = self._get_mime_type(file_path.suffix.lower())
212
+
213
+ contents.append(
214
+ genai.types.Part.from_bytes(
215
+ data=file_content,
216
+ mime_type=mime_type
217
+ )
218
+ )
219
+
220
+ return contents
221
+
222
+ def _create_batch_prompt(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
223
+ """Create appropriate prompt for batch processing."""
224
+ # Use original filenames if provided, otherwise use temp file names
225
+ if original_filenames:
226
+ file_names = original_filenames
227
+ else:
228
+ file_names = [fp.name for fp in file_paths]
229
+ file_list = "\n".join([f"- {name}" for name in file_names])
230
+
231
+ base_prompt = f"""I will provide you with {len(file_paths)} documents to process:
232
+ {file_list}
233
+
234
+ """
235
+
236
+ if processing_type == "combined":
237
+ return base_prompt + """Please convert all documents to a single, cohesive markdown document.
238
+ Merge the content logically, remove duplicate information, and create a unified structure with clear headings.
239
+ Preserve important formatting, tables, lists, and structure from all documents.
240
+ For images, include brief descriptions in markdown image syntax.
241
+ Return only the combined markdown content, no other text."""
242
+
243
+ elif processing_type == "individual":
244
+ return base_prompt + """Please convert each document to markdown format and present them as separate sections.
245
+ For each document, create a clear section header with the document name.
246
+ Preserve the structure, headings, lists, tables, and formatting within each section.
247
+ For images, include brief descriptions in markdown image syntax.
248
+ Return the content in this format:
249
+
250
+ # Document 1: [filename]
251
+ [converted content]
252
+
253
+ # Document 2: [filename]
254
+ [converted content]
255
+
256
+ Return only the markdown content, no other text."""
257
+
258
+ elif processing_type == "summary":
259
+ return base_prompt + """Please create a comprehensive analysis with two parts:
260
+
261
+ 1. EXECUTIVE SUMMARY: A concise overview summarizing the key points from all documents
262
+ 2. DETAILED SECTIONS: Individual converted sections for each document
263
+
264
+ Structure the output as:
265
+
266
+ # Executive Summary
267
+ [Brief summary of key findings and themes across all documents]
268
+
269
+ # Detailed Analysis
270
+
271
+ ## Document 1: [filename]
272
+ [converted content]
273
+
274
+ ## Document 2: [filename]
275
+ [converted content]
276
+
277
+ Preserve formatting, tables, lists, and structure throughout.
278
+ For images, include brief descriptions in markdown image syntax.
279
+ Return only the markdown content, no other text."""
280
+
281
+ elif processing_type == "comparison":
282
+ return base_prompt + """Please create a comparative analysis of these documents:
283
+
284
+ 1. Create a comparison table highlighting key differences and similarities
285
+ 2. Provide individual document summaries
286
+ 3. Include a section on cross-document insights
287
+
288
+ Structure the output as:
289
+
290
+ # Document Comparison Analysis
291
+
292
+ ## Comparison Table
293
+ | Aspect | Document 1 | Document 2 | Document 3 | ... |
294
+ |--------|------------|------------|------------|-----|
295
+ | [Key aspects found across documents] | | | | |
296
+
297
+ ## Individual Document Summaries
298
+
299
+ ### Document 1: [filename]
300
+ [Key points and content summary]
301
+
302
+ ### Document 2: [filename]
303
+ [Key points and content summary]
304
+
305
+ ## Cross-Document Insights
306
+ [Analysis of patterns, contradictions, or complementary information across documents]
307
+
308
+ Preserve important formatting and structure.
309
+ For images, include brief descriptions in markdown image syntax.
310
+ Return only the markdown content, no other text."""
311
+
312
+ else:
313
+ # Fallback to combined
314
+ return self._create_batch_prompt(file_paths, "combined")
315
+
316
+ def _format_batch_output(self, response_text: str, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
317
+ """Format the batch processing output."""
318
+ # Add metadata header using original filenames if provided
319
+ if original_filenames:
320
+ file_names = original_filenames
321
+ else:
322
+ file_names = [fp.name for fp in file_paths]
323
+
324
+ header = f"""<!-- Multi-Document Processing Results -->
325
+ <!-- Processing Type: {processing_type} -->
326
+ <!-- Files Processed: {len(file_paths)} -->
327
+ <!-- File Names: {', '.join(file_names)} -->
328
+
329
+ """
330
+
331
+ return header + response_text
332
+
333
  def _get_mime_type(self, file_extension: str) -> str:
334
  """Get the MIME type for a file extension."""
335
  mime_types = {
src/services/document_service.py CHANGED
@@ -7,7 +7,7 @@ import time
7
  import os
8
  import threading
9
  from pathlib import Path
10
- from typing import Optional, Tuple, Any
11
 
12
  from src.core.config import config
13
  from src.core.exceptions import (
@@ -269,4 +269,248 @@ class DocumentService:
269
  if self._check_cancellation() and output_path:
270
  self._safe_delete_file(output_path)
271
 
272
- self._conversion_in_progress = False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  import os
8
  import threading
9
  from pathlib import Path
10
+ from typing import Optional, Tuple, Any, List
11
 
12
  from src.core.config import config
13
  from src.core.exceptions import (
 
269
  if self._check_cancellation() and output_path:
270
  self._safe_delete_file(output_path)
271
 
272
+ self._conversion_in_progress = False
273
+
274
+ def convert_documents(
275
+ self,
276
+ file_paths: List[str],
277
+ parser_name: str,
278
+ ocr_method_name: str,
279
+ output_format: str,
280
+ processing_type: str = "combined"
281
+ ) -> Tuple[str, Optional[str]]:
282
+ """
283
+ Unified method to convert single or multiple documents.
284
+
285
+ Args:
286
+ file_paths: List of paths to input files (can be single file)
287
+ parser_name: Name of the parser to use
288
+ ocr_method_name: Name of the OCR method to use
289
+ output_format: Output format (Markdown, JSON, Text, Document Tags)
290
+ processing_type: Type of multi-document processing (combined, individual, summary, comparison)
291
+
292
+ Returns:
293
+ Tuple of (content, output_file_path)
294
+
295
+ Raises:
296
+ DocumentProcessingError: For general processing errors
297
+ FileSizeLimitError: When file(s) are too large
298
+ UnsupportedFileTypeError: For unsupported file types
299
+ ConversionError: When conversion fails or is cancelled
300
+ """
301
+ if not file_paths:
302
+ raise DocumentProcessingError("No files provided")
303
+
304
+ # Route to appropriate processing method
305
+ if len(file_paths) == 1:
306
+ # Single file processing - use existing method
307
+ return self.convert_document(
308
+ file_paths[0], parser_name, ocr_method_name, output_format
309
+ )
310
+ else:
311
+ # Multi-file processing - use new batch method
312
+ return self._convert_multiple_documents(
313
+ file_paths, parser_name, ocr_method_name, output_format, processing_type
314
+ )
315
+
316
+ def _convert_multiple_documents(
317
+ self,
318
+ file_paths: List[str],
319
+ parser_name: str,
320
+ ocr_method_name: str,
321
+ output_format: str,
322
+ processing_type: str
323
+ ) -> Tuple[str, Optional[str]]:
324
+ """
325
+ Convert multiple documents using batch processing.
326
+
327
+ Args:
328
+ file_paths: List of paths to input files
329
+ parser_name: Name of the parser to use
330
+ ocr_method_name: Name of the OCR method to use
331
+ output_format: Output format (Markdown, JSON, Text, Document Tags)
332
+ processing_type: Type of multi-document processing
333
+
334
+ Returns:
335
+ Tuple of (content, output_file_path)
336
+ """
337
+ self._conversion_in_progress = True
338
+ temp_inputs = []
339
+ output_path = None
340
+ self._original_file_paths = file_paths # Store original paths for filename reference
341
+
342
+ try:
343
+ # Validate all files first
344
+ for file_path in file_paths:
345
+ self._validate_file(file_path)
346
+
347
+ if self._check_cancellation():
348
+ raise ConversionError("Conversion cancelled")
349
+
350
+ # Create temporary files with English names
351
+ for original_path in file_paths:
352
+ temp_path = self._create_temp_file(original_path)
353
+ temp_inputs.append(temp_path)
354
+
355
+ if self._check_cancellation():
356
+ raise ConversionError("Conversion cancelled during file preparation")
357
+
358
+ # Process documents using parser factory with multi-document support
359
+ start_time = time.time()
360
+ content = self._process_multiple_with_parser(
361
+ temp_inputs, parser_name, ocr_method_name, output_format, processing_type
362
+ )
363
+
364
+ if content == "Conversion cancelled.":
365
+ raise ConversionError("Conversion cancelled by parser")
366
+
367
+ duration = time.time() - start_time
368
+ logging.info(f"Multiple documents processed in {duration:.2f} seconds")
369
+
370
+ if self._check_cancellation():
371
+ raise ConversionError("Conversion cancelled")
372
+
373
+ # Create output file with batch naming
374
+ output_path = self._create_batch_output_file(
375
+ content, output_format, file_paths, processing_type
376
+ )
377
+
378
+ return content, output_path
379
+
380
+ except (DocumentProcessingError, FileSizeLimitError, UnsupportedFileTypeError, ConversionError):
381
+ # Re-raise our custom exceptions
382
+ for temp_path in temp_inputs:
383
+ self._safe_delete_file(temp_path)
384
+ self._safe_delete_file(output_path)
385
+ raise
386
+ except Exception as e:
387
+ # Wrap unexpected exceptions
388
+ for temp_path in temp_inputs:
389
+ self._safe_delete_file(temp_path)
390
+ self._safe_delete_file(output_path)
391
+ raise DocumentProcessingError(f"Unexpected error during batch conversion: {str(e)}")
392
+ finally:
393
+ # Clean up temp input files
394
+ for temp_path in temp_inputs:
395
+ self._safe_delete_file(temp_path)
396
+
397
+ # Clean up output file if cancelled
398
+ if self._check_cancellation() and output_path:
399
+ self._safe_delete_file(output_path)
400
+
401
+ self._conversion_in_progress = False
402
+
403
+ def _process_multiple_with_parser(
404
+ self,
405
+ temp_file_paths: List[str],
406
+ parser_name: str,
407
+ ocr_method_name: str,
408
+ output_format: str,
409
+ processing_type: str
410
+ ) -> str:
411
+ """Process multiple documents using the parser factory."""
412
+ try:
413
+ # Get parser instance
414
+ from src.parsers.parser_registry import ParserRegistry
415
+ parser_class = ParserRegistry.get_parser_class(parser_name)
416
+
417
+ if not parser_class:
418
+ raise DocumentProcessingError(f"Parser '{parser_name}' not found")
419
+
420
+ parser_instance = parser_class()
421
+ parser_instance.set_cancellation_flag(self._cancellation_flag)
422
+
423
+ # Check if parser supports multi-document processing
424
+ if hasattr(parser_instance, 'parse_multiple'):
425
+ # Use multi-document parsing with original filenames for reference
426
+ return parser_instance.parse_multiple(
427
+ file_paths=temp_file_paths,
428
+ processing_type=processing_type,
429
+ ocr_method=ocr_method_name,
430
+ output_format=output_format.lower(),
431
+ original_filenames=[Path(fp).name for fp in self._original_file_paths]
432
+ )
433
+ else:
434
+ # Fallback: process individually and combine
435
+ results = []
436
+ for i, file_path in enumerate(temp_file_paths):
437
+ if self._check_cancellation():
438
+ return "Conversion cancelled."
439
+
440
+ result = parser_instance.parse(
441
+ file_path=file_path,
442
+ ocr_method=ocr_method_name
443
+ )
444
+
445
+ # Add section header for individual results using original filename
446
+ original_filename = Path(self._original_file_paths[i]).name
447
+ results.append(f"# Document {i+1}: {original_filename}\n\n{result}")
448
+
449
+ return "\n\n---\n\n".join(results)
450
+
451
+ except Exception as e:
452
+ raise DocumentProcessingError(f"Error processing multiple documents: {str(e)}")
453
+
454
+ def _create_batch_output_file(
455
+ self,
456
+ content: str,
457
+ output_format: str,
458
+ original_file_paths: List[str],
459
+ processing_type: str
460
+ ) -> str:
461
+ """Create output file for batch processing with descriptive naming."""
462
+ # Determine file extension
463
+ format_extensions = {
464
+ "markdown": ".md",
465
+ "json": ".json",
466
+ "text": ".txt",
467
+ "document tags": ".doctags"
468
+ }
469
+ ext = format_extensions.get(output_format.lower(), ".txt")
470
+
471
+ if self._check_cancellation():
472
+ raise ConversionError("Conversion cancelled before output file creation")
473
+
474
+ # Create descriptive filename for batch processing
475
+ file_count = len(original_file_paths)
476
+ timestamp = time.strftime("%Y%m%d_%H%M%S")
477
+
478
+ if processing_type == "combined":
479
+ filename = f"Combined_{file_count}_Documents_{timestamp}{ext}"
480
+ elif processing_type == "individual":
481
+ filename = f"Individual_Sections_{file_count}_Files_{timestamp}{ext}"
482
+ elif processing_type == "summary":
483
+ filename = f"Summary_Analysis_{file_count}_Files_{timestamp}{ext}"
484
+ elif processing_type == "comparison":
485
+ filename = f"Comparison_Analysis_{file_count}_Files_{timestamp}{ext}"
486
+ else:
487
+ filename = f"Batch_Processing_{file_count}_Files_{timestamp}{ext}"
488
+
489
+ # Create output file in temp directory
490
+ temp_dir = tempfile.gettempdir()
491
+ tmp_path = os.path.join(temp_dir, filename)
492
+
493
+ # Handle filename conflicts
494
+ counter = 1
495
+ base_path = tmp_path
496
+ while os.path.exists(tmp_path):
497
+ name_part = filename.replace(ext, f"_{counter}{ext}")
498
+ tmp_path = os.path.join(temp_dir, name_part)
499
+ counter += 1
500
+
501
+ # Write content to file
502
+ try:
503
+ with open(tmp_path, "w", encoding="utf-8") as f:
504
+ # Write in chunks with cancellation checks
505
+ chunk_size = 10000 # characters
506
+ for i in range(0, len(content), chunk_size):
507
+ if self._check_cancellation():
508
+ self._safe_delete_file(tmp_path)
509
+ raise ConversionError("Conversion cancelled during output file writing")
510
+
511
+ f.write(content[i:i+chunk_size])
512
+ except Exception as e:
513
+ self._safe_delete_file(tmp_path)
514
+ raise ConversionError(f"Failed to write batch output file: {str(e)}")
515
+
516
+ return tmp_path
src/ui/ui.py CHANGED
@@ -45,6 +45,51 @@ def monitor_cancellation():
45
  time.sleep(0.1) # Check every 100ms
46
  logger.info("Cancellation monitor thread ending")
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  def validate_file_for_parser(file_path, parser_name):
49
  """Validate if the file type is supported by the selected parser."""
50
  if not file_path:
@@ -112,8 +157,48 @@ def run_conversion_thread(file_path, parser_name, ocr_method_name, output_format
112
 
113
  return thread, results
114
 
115
- def handle_convert(file_path, parser_name, ocr_method_name, output_format, is_cancelled):
116
- """Handle file conversion."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  global conversion_cancelled
118
 
119
  # Check if we should cancel before starting
@@ -121,16 +206,31 @@ def handle_convert(file_path, parser_name, ocr_method_name, output_format, is_ca
121
  logger.info("Conversion cancelled before starting")
122
  return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
123
 
124
- # Validate file type for the selected parser
125
- is_valid, error_msg = validate_file_for_parser(file_path, parser_name)
126
- if not is_valid:
127
- logger.error(f"File validation error: {error_msg}")
128
  return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
129
 
130
- logger.info("Starting conversion with cancellation flag cleared")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  # Start the conversion in a separate thread
133
- thread, results = run_conversion_thread(file_path, parser_name, ocr_method_name, output_format)
134
 
135
  # Start the monitoring thread
136
  monitor_thread = threading.Thread(target=monitor_cancellation)
@@ -763,8 +863,27 @@ def create_ui():
763
  # State to store the output format (fixed to Markdown)
764
  output_format_state = gr.State("Markdown")
765
 
766
- # File input first
767
- file_input = gr.File(label="Upload Document", type="filepath")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
768
 
769
  # Provider and OCR options below the file input
770
  with gr.Row(elem_classes=["provider-options-row"]):
@@ -805,6 +924,14 @@ def create_ui():
805
  cancel_button = gr.Button("Cancel", variant="stop", visible=False)
806
 
807
  # Event handlers for document converter
 
 
 
 
 
 
 
 
808
  provider_dropdown.change(
809
  lambda p: gr.Dropdown(
810
  choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
@@ -845,7 +972,7 @@ def create_ui():
845
  queue=False # Execute immediately
846
  ).then(
847
  fn=handle_convert,
848
- inputs=[file_input, provider_dropdown, ocr_dropdown, output_format_state, cancel_requested],
849
  outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
850
  )
851
 
 
45
  time.sleep(0.1) # Check every 100ms
46
  logger.info("Cancellation monitor thread ending")
47
 
48
+ def update_ui_for_file_count(files):
49
+ """Update UI components based on the number of files uploaded."""
50
+ if not files or len(files) == 0:
51
+ return (
52
+ gr.update(visible=False), # processing_type_selector
53
+ "<div style='color: #666; font-style: italic;'>Upload documents to begin</div>" # file_status_text
54
+ )
55
+
56
+ if len(files) == 1:
57
+ file_name = files[0].name if hasattr(files[0], 'name') else str(files[0])
58
+ return (
59
+ gr.update(visible=False), # processing_type_selector (hidden for single file)
60
+ f"<div style='color: #2563eb; font-weight: 500;'>πŸ“„ Single document: {file_name}</div>"
61
+ )
62
+ else:
63
+ # Calculate total size for validation display
64
+ total_size = 0
65
+ try:
66
+ for file in files:
67
+ if hasattr(file, 'size'):
68
+ total_size += file.size
69
+ elif hasattr(file, 'name'):
70
+ # For file paths, get size from filesystem
71
+ total_size += Path(file.name).stat().st_size
72
+ except:
73
+ pass # Size calculation is optional for display
74
+
75
+ size_display = f" ({total_size / (1024*1024):.1f}MB)" if total_size > 0 else ""
76
+
77
+ # Check if within limits
78
+ if len(files) > 5:
79
+ status_color = "#dc2626" # red
80
+ status_text = f"⚠️ Too many files: {len(files)}/5 (max 5 files allowed)"
81
+ elif total_size > 20 * 1024 * 1024: # 20MB
82
+ status_color = "#dc2626" # red
83
+ status_text = f"⚠️ Files too large{size_display} (max 20MB combined)"
84
+ else:
85
+ status_color = "#059669" # green
86
+ status_text = f"πŸ“‚ Batch mode: {len(files)} files{size_display}"
87
+
88
+ return (
89
+ gr.update(visible=True), # processing_type_selector (visible for multiple files)
90
+ f"<div style='color: {status_color}; font-weight: 500;'>{status_text}</div>"
91
+ )
92
+
93
  def validate_file_for_parser(file_path, parser_name):
94
  """Validate if the file type is supported by the selected parser."""
95
  if not file_path:
 
157
 
158
  return thread, results
159
 
160
+ def run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type):
161
+ """Run the conversion in a separate thread for multiple files."""
162
+ import threading
163
+ from src.services.document_service import DocumentService
164
+
165
+ # Results will be shared between threads
166
+ results = {"content": None, "download_file": None, "error": None}
167
+
168
+ def conversion_worker():
169
+ try:
170
+ logger.info(f"Starting multi-file conversion thread for {len(file_paths)} files")
171
+
172
+ # Use the new document service unified method
173
+ document_service = DocumentService()
174
+ document_service.set_cancellation_flag(conversion_cancelled)
175
+
176
+ # Call the unified convert_documents method
177
+ content, output_file = document_service.convert_documents(
178
+ file_paths=file_paths,
179
+ parser_name=parser_name,
180
+ ocr_method_name=ocr_method_name,
181
+ output_format=output_format,
182
+ processing_type=processing_type
183
+ )
184
+
185
+ logger.info(f"Multi-file conversion completed successfully for {len(file_paths)} files")
186
+ results["content"] = content
187
+ results["download_file"] = output_file
188
+
189
+ except Exception as e:
190
+ logger.error(f"Error during multi-file conversion: {str(e)}")
191
+ results["error"] = str(e)
192
+
193
+ # Create and start the thread
194
+ thread = threading.Thread(target=conversion_worker)
195
+ thread.daemon = True
196
+ thread.start()
197
+
198
+ return thread, results
199
+
200
+ def handle_convert(files, parser_name, ocr_method_name, output_format, processing_type, is_cancelled):
201
+ """Handle file conversion for single or multiple files."""
202
  global conversion_cancelled
203
 
204
  # Check if we should cancel before starting
 
206
  logger.info("Conversion cancelled before starting")
207
  return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
208
 
209
+ # Validate files input
210
+ if not files or len(files) == 0:
211
+ error_msg = "No files uploaded. Please upload at least one document."
212
+ logger.error(error_msg)
213
  return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
214
 
215
+ # Convert Gradio file objects to file paths
216
+ file_paths = []
217
+ for file in files:
218
+ if hasattr(file, 'name'):
219
+ file_paths.append(file.name)
220
+ else:
221
+ file_paths.append(str(file))
222
+
223
+ # Validate file types for the selected parser
224
+ for file_path in file_paths:
225
+ is_valid, error_msg = validate_file_for_parser(file_path, parser_name)
226
+ if not is_valid:
227
+ logger.error(f"File validation error: {error_msg}")
228
+ return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
229
+
230
+ logger.info(f"Starting conversion of {len(file_paths)} file(s) with cancellation flag cleared")
231
 
232
  # Start the conversion in a separate thread
233
+ thread, results = run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type)
234
 
235
  # Start the monitoring thread
236
  monitor_thread = threading.Thread(target=monitor_cancellation)
 
863
  # State to store the output format (fixed to Markdown)
864
  output_format_state = gr.State("Markdown")
865
 
866
+ # Multi-file input (supports single and multiple files)
867
+ files_input = gr.Files(
868
+ label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
869
+ file_count="multiple",
870
+ file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
871
+ )
872
+
873
+ # Processing type selector (visible only for multiple files)
874
+ processing_type_selector = gr.Radio(
875
+ choices=["combined", "individual", "summary", "comparison"],
876
+ value="combined",
877
+ label="Multi-Document Processing Type",
878
+ info="How to process multiple documents together",
879
+ visible=False
880
+ )
881
+
882
+ # Status text to show file count and processing mode
883
+ file_status_text = gr.HTML(
884
+ value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
885
+ label=""
886
+ )
887
 
888
  # Provider and OCR options below the file input
889
  with gr.Row(elem_classes=["provider-options-row"]):
 
924
  cancel_button = gr.Button("Cancel", variant="stop", visible=False)
925
 
926
  # Event handlers for document converter
927
+
928
+ # Update UI when files are uploaded/changed
929
+ files_input.change(
930
+ fn=update_ui_for_file_count,
931
+ inputs=[files_input],
932
+ outputs=[processing_type_selector, file_status_text]
933
+ )
934
+
935
  provider_dropdown.change(
936
  lambda p: gr.Dropdown(
937
  choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
 
972
  queue=False # Execute immediately
973
  ).then(
974
  fn=handle_convert,
975
+ inputs=[files_input, provider_dropdown, ocr_dropdown, output_format_state, processing_type_selector, cancel_requested],
976
  outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
977
  )
978