AnseMin commited on
Commit
6ea41ec
Β·
1 Parent(s): 2a9686e

Refactor UI components for modular architecture and enhance functionality

Browse files

- Introduced a modular UI structure with dedicated components for Document Converter, Chat Interface, and Query Ranker.
- Updated README to reflect the new modular UI architecture and its components.
- Implemented content formatting utilities for Markdown and LaTeX rendering.
- Enhanced file validation and threading utilities for improved user experience.
- Added comprehensive styles for a cohesive UI design across components.
- Established a test suite for the new UI components to ensure functionality and reliability.

README.md CHANGED
@@ -498,9 +498,24 @@ markit_v2/
498
  β”‚ β”‚ β”œβ”€β”€ memory.py # Chat history and session management
499
  β”‚ β”‚ β”œβ”€β”€ chat_service.py # RAG chat service with Gemini 2.5 Flash
500
  β”‚ β”‚ └── ingestion.py # Document ingestion pipeline
501
- β”‚ └── ui/ # User interface layer
502
  β”‚ β”œβ”€β”€ __init__.py # Package initialization
503
- β”‚ └── ui.py # πŸ†• Gradio UI with three tabs (Converter + Chat + Query Ranker)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
504
  β”œβ”€β”€ documents/ # Documentation and examples (gitignored)
505
  β”œβ”€β”€ tessdata/ # Tesseract OCR data (gitignored)
506
  └── tests/ # πŸ†• Test suite for Phase 1 RAG implementation
@@ -522,6 +537,11 @@ markit_v2/
522
  - **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
523
  - **πŸ†• RAG System**: Complete RAG implementation with vector search and chat capabilities
524
  - **πŸ†• Query Ranker Interface**: Dedicated transparency tool for document search and ranking
 
 
 
 
 
525
 
526
  ### 🧠 **RAG System Architecture:**
527
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
 
498
  β”‚ β”‚ β”œβ”€β”€ memory.py # Chat history and session management
499
  β”‚ β”‚ β”œβ”€β”€ chat_service.py # RAG chat service with Gemini 2.5 Flash
500
  β”‚ β”‚ └── ingestion.py # Document ingestion pipeline
501
+ β”‚ └── ui/ # πŸ†• Modular user interface layer
502
  β”‚ β”œβ”€β”€ __init__.py # Package initialization
503
+ β”‚ β”œβ”€β”€ ui.py # Main UI orchestrator (~60 lines)
504
+ β”‚ β”œβ”€β”€ components/ # UI components
505
+ β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
506
+ β”‚ β”‚ β”œβ”€β”€ document_converter.py # Document converter tab (~200 lines)
507
+ β”‚ β”‚ β”œβ”€β”€ chat_interface.py # Chat interface tab (~180 lines)
508
+ β”‚ β”‚ └── query_ranker.py # Query ranker tab (~200 lines)
509
+ β”‚ β”œβ”€β”€ formatters/ # Content formatting utilities
510
+ β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
511
+ β”‚ β”‚ └── content_formatters.py # Markdown/LaTeX formatters (~150 lines)
512
+ β”‚ β”œβ”€β”€ styles/ # UI styling
513
+ β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
514
+ β”‚ β”‚ └── ui_styles.py # CSS styles and themes (~800 lines)
515
+ β”‚ └── utils/ # UI utility functions
516
+ β”‚ β”œβ”€β”€ __init__.py # Package initialization
517
+ β”‚ β”œβ”€β”€ file_validation.py # File validation utilities (~80 lines)
518
+ β”‚ └── threading_utils.py # Threading utilities (~40 lines)
519
  β”œβ”€β”€ documents/ # Documentation and examples (gitignored)
520
  β”œβ”€β”€ tessdata/ # Tesseract OCR data (gitignored)
521
  └── tests/ # πŸ†• Test suite for Phase 1 RAG implementation
 
537
  - **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
538
  - **πŸ†• RAG System**: Complete RAG implementation with vector search and chat capabilities
539
  - **πŸ†• Query Ranker Interface**: Dedicated transparency tool for document search and ranking
540
+ - **πŸ†• Modular UI Architecture**: Component-based UI with clear separation of concerns
541
+ - **UI Components**: Individual tab components for focused functionality
542
+ - **Content Formatters**: Specialized markdown and LaTeX rendering utilities
543
+ - **UI Styles**: Centralized CSS styling system with responsive design
544
+ - **UI Utils**: File validation and threading utilities for better code organization
545
 
546
  ### 🧠 **RAG System Architecture:**
547
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
src/ui/components/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """UI Components package - Modular UI components for the Markit application."""
src/ui/components/chat_interface.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Chat interface UI component and logic."""
2
+
3
+ import gradio as gr
4
+ import logging
5
+
6
+ from src.core.logging_config import get_logger
7
+ from src.rag import rag_chat_service
8
+ from src.services.data_clearing_service import data_clearing_service
9
+
10
+ logger = get_logger(__name__)
11
+
12
+
13
+ def handle_chat_message(message, history):
14
+ """Handle a new chat message with streaming response."""
15
+ if not message or not message.strip():
16
+ return "", history, gr.update()
17
+
18
+ try:
19
+ # Add user message to history
20
+ history = history or []
21
+ history.append({"role": "user", "content": message})
22
+
23
+ # Add assistant message placeholder
24
+ history.append({"role": "assistant", "content": ""})
25
+
26
+ # Get response from RAG service
27
+ response_text = ""
28
+ for chunk in rag_chat_service.chat_stream(message):
29
+ response_text += chunk
30
+ # Update the last message in history with the current response
31
+ history[-1]["content"] = response_text
32
+ # Update status in real-time during streaming
33
+ updated_status = get_chat_status()
34
+ yield "", history, updated_status
35
+
36
+ logger.info(f"Chat response completed for message: {message[:50]}...")
37
+
38
+ # Final status update after message completion
39
+ final_status = get_chat_status()
40
+ yield "", history, final_status
41
+
42
+ except Exception as e:
43
+ error_msg = f"Error generating response: {str(e)}"
44
+ logger.error(error_msg)
45
+ if history and len(history) > 0:
46
+ history[-1]["content"] = f"❌ {error_msg}"
47
+ else:
48
+ history = [
49
+ {"role": "user", "content": message},
50
+ {"role": "assistant", "content": f"❌ {error_msg}"}
51
+ ]
52
+ # Update status even on error
53
+ error_status = get_chat_status()
54
+ yield "", history, error_status
55
+
56
+
57
+ def start_new_chat_session():
58
+ """Start a new chat session."""
59
+ try:
60
+ session_id = rag_chat_service.start_new_session()
61
+ logger.info(f"Started new chat session: {session_id}")
62
+ return [], f"βœ… New chat session started: {session_id}"
63
+ except Exception as e:
64
+ error_msg = f"Error starting new session: {str(e)}"
65
+ logger.error(error_msg)
66
+ return [], f"❌ {error_msg}"
67
+
68
+
69
+ def handle_clear_all_data():
70
+ """Handle clearing all RAG data (vector store + chat history)."""
71
+ try:
72
+ # Clear all data using the data clearing service
73
+ success, message, stats = data_clearing_service.clear_all_data()
74
+
75
+ if success:
76
+ # Reset chat session after clearing data
77
+ session_id = rag_chat_service.start_new_session()
78
+
79
+ # Get updated status
80
+ updated_status = get_chat_status()
81
+
82
+ # Create success message with stats
83
+ if stats.get("total_cleared_documents", 0) > 0 or stats.get("total_cleared_files", 0) > 0:
84
+ clear_msg = f"βœ… {message}"
85
+ session_msg = f"πŸ†• Started new session: {session_id}"
86
+ combined_msg = f'{clear_msg}<br/><div class="session-info">{session_msg}</div>'
87
+ else:
88
+ combined_msg = f'ℹ️ {message}<br/><div class="session-info">πŸ†• Started new session: {session_id}</div>'
89
+
90
+ logger.info(f"Data cleared successfully: {message}")
91
+
92
+ return [], combined_msg, updated_status
93
+ else:
94
+ error_msg = f"❌ {message}"
95
+ logger.error(f"Data clearing failed: {message}")
96
+
97
+ # Still get updated status even on error
98
+ updated_status = get_chat_status()
99
+
100
+ return None, f'<div class="session-info">{error_msg}</div>', updated_status
101
+
102
+ except Exception as e:
103
+ error_msg = f"Error clearing data: {str(e)}"
104
+ logger.error(error_msg)
105
+
106
+ # Get current status
107
+ current_status = get_chat_status()
108
+
109
+ return None, f'<div class="session-info">❌ {error_msg}</div>', current_status
110
+
111
+
112
+ def get_chat_status():
113
+ """Get current chat system status."""
114
+ try:
115
+ # Check ingestion status
116
+ from src.rag import document_ingestion_service
117
+ from src.services.data_clearing_service import data_clearing_service
118
+
119
+ ingestion_status = document_ingestion_service.get_ingestion_status()
120
+
121
+ # Check usage stats
122
+ usage_stats = rag_chat_service.get_usage_stats()
123
+
124
+ # Get data status for additional context
125
+ data_status = data_clearing_service.get_data_status()
126
+
127
+ # Get environment info
128
+ import os
129
+ env_type = "Hugging Face Space" if os.getenv("SPACE_ID") else "Local Development"
130
+
131
+ # Modern status card design with better styling
132
+ status_html = f"""
133
+ <div class="status-card">
134
+ <div class="status-header">
135
+ <h3>πŸ’¬ Chat System Status</h3>
136
+ <div class="status-indicator {'status-ready' if ingestion_status.get('system_ready', False) else 'status-not-ready'}">
137
+ {'🟒 READY' if ingestion_status.get('system_ready', False) else 'πŸ”΄ NOT READY'}
138
+ </div>
139
+ </div>
140
+
141
+ <div class="status-grid">
142
+ <div class="status-item">
143
+ <div class="status-label">Vector Store Docs</div>
144
+ <div class="status-value">{data_status.get('vector_store', {}).get('document_count', 0)}</div>
145
+ </div>
146
+ <div class="status-item">
147
+ <div class="status-label">Chat History Files</div>
148
+ <div class="status-value">{data_status.get('chat_history', {}).get('file_count', 0)}</div>
149
+ </div>
150
+ <div class="status-item">
151
+ <div class="status-label">Session Usage</div>
152
+ <div class="status-value">{usage_stats.get('session_messages', 0)}/{usage_stats.get('session_limit', 50)}</div>
153
+ </div>
154
+ <div class="status-item">
155
+ <div class="status-label">Environment</div>
156
+ <div class="status-value">{'HF Space' if data_status.get('environment') == 'hf_space' else 'Local'}</div>
157
+ </div>
158
+ </div>
159
+
160
+ <div class="status-services">
161
+ <div class="service-status {'service-ready' if ingestion_status.get('embedding_model_available', False) else 'service-error'}">
162
+ <span class="service-icon">🧠</span>
163
+ <span>Embedding Model</span>
164
+ <span class="service-indicator">{'βœ…' if ingestion_status.get('embedding_model_available', False) else '❌'}</span>
165
+ </div>
166
+ <div class="service-status {'service-ready' if ingestion_status.get('vector_store_available', False) else 'service-error'}">
167
+ <span class="service-icon">πŸ—„οΈ</span>
168
+ <span>Vector Store</span>
169
+ <span class="service-indicator">{'βœ…' if ingestion_status.get('vector_store_available', False) else '❌'}</span>
170
+ </div>
171
+ </div>
172
+ </div>
173
+ """
174
+
175
+ return status_html
176
+
177
+ except Exception as e:
178
+ error_msg = f"Error getting chat status: {str(e)}"
179
+ logger.error(error_msg)
180
+ return f"""
181
+ <div class="status-card status-error">
182
+ <div class="status-header">
183
+ <h3>❌ System Error</h3>
184
+ </div>
185
+ <p class="error-message">{error_msg}</p>
186
+ </div>
187
+ """
188
+
189
+
190
+ def create_chat_interface_tab():
191
+ """Create the chat interface tab UI."""
192
+ with gr.TabItem("πŸ’¬ Chat with Documents"):
193
+ with gr.Column(elem_classes=["chat-tab-container"]):
194
+ # Header
195
+ gr.HTML("""
196
+ <div class="chat-header">
197
+ <h2>πŸ’¬ Chat with your converted documents</h2>
198
+ <p>Ask questions about your documents using advanced RAG technology</p>
199
+ </div>
200
+ """)
201
+
202
+ # Status monitoring
203
+ status_display = gr.HTML(value=get_chat_status())
204
+
205
+ # Control buttons
206
+ with gr.Row(elem_classes=["control-buttons"]):
207
+ refresh_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
208
+ new_session_btn = gr.Button("πŸ†• New Session", elem_classes=["control-btn", "btn-new-session"])
209
+ clear_data_btn = gr.Button("πŸ—‘οΈ Clear All Data", elem_classes=["control-btn", "btn-clear-data"], variant="stop")
210
+
211
+ # Chat interface
212
+ with gr.Column(elem_classes=["chat-main-container"]):
213
+ chatbot = gr.Chatbot(
214
+ elem_classes=["chat-container"],
215
+ height=500,
216
+ show_label=False,
217
+ show_share_button=False,
218
+ bubble_full_width=False,
219
+ type="messages",
220
+ placeholder="Start a conversation by asking questions about your documents..."
221
+ )
222
+
223
+ with gr.Row(elem_classes=["input-row"]):
224
+ msg_input = gr.Textbox(
225
+ placeholder="Ask questions about your documents...",
226
+ show_label=False,
227
+ scale=5,
228
+ lines=1,
229
+ max_lines=3,
230
+ elem_classes=["message-input"]
231
+ )
232
+ send_btn = gr.Button("Submit", elem_classes=["send-button"], scale=0)
233
+
234
+ # Session info display
235
+ session_info = gr.HTML(
236
+ value='<div class="session-info">No active session - Click "New Session" to start</div>'
237
+ )
238
+
239
+ # Event handlers for chat
240
+ def clear_input():
241
+ return ""
242
+
243
+ # Send message when button clicked or Enter pressed
244
+ msg_input.submit(
245
+ fn=handle_chat_message,
246
+ inputs=[msg_input, chatbot],
247
+ outputs=[msg_input, chatbot, status_display]
248
+ )
249
+
250
+ send_btn.click(
251
+ fn=handle_chat_message,
252
+ inputs=[msg_input, chatbot],
253
+ outputs=[msg_input, chatbot, status_display]
254
+ )
255
+
256
+ # Control button handlers
257
+ refresh_btn.click(
258
+ fn=get_chat_status,
259
+ inputs=[],
260
+ outputs=[status_display]
261
+ )
262
+
263
+ # New session handler with improved feedback
264
+ def enhanced_new_session():
265
+ history, info = start_new_chat_session()
266
+ session_html = f'<div class="session-info">{info}</div>'
267
+ updated_status = get_chat_status()
268
+ return history, session_html, updated_status
269
+
270
+ new_session_btn.click(
271
+ fn=enhanced_new_session,
272
+ inputs=[],
273
+ outputs=[chatbot, session_info, status_display]
274
+ )
275
+
276
+ clear_data_btn.click(
277
+ handle_clear_all_data,
278
+ outputs=[chatbot, session_info, status_display]
279
+ )
src/ui/components/document_converter.py ADDED
@@ -0,0 +1,340 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Document converter UI component and logic."""
2
+
3
+ import threading
4
+ import time
5
+ import gradio as gr
6
+ import logging
7
+ from pathlib import Path
8
+
9
+ from src.core.converter import convert_file
10
+ from src.core.logging_config import get_logger
11
+ from src.services.document_service import DocumentService
12
+ from src.rag import document_ingestion_service
13
+ from src.ui.utils.file_validation import validate_file_for_parser
14
+ from src.ui.utils.threading_utils import (
15
+ conversion_cancelled,
16
+ monitor_cancellation,
17
+ reset_cancellation,
18
+ set_cancellation
19
+ )
20
+ from src.ui.formatters.content_formatters import format_markdown_content, format_latex_content
21
+
22
+ logger = get_logger(__name__)
23
+
24
+
25
+ def run_conversion_thread(file_path, parser_name, ocr_method_name, output_format):
26
+ """Run the conversion in a separate thread and return the thread object"""
27
+ # Reset the cancellation flag
28
+ reset_cancellation()
29
+
30
+ # Create a container for the results
31
+ results = {"content": None, "download_file": None, "error": None}
32
+
33
+ def conversion_worker():
34
+ try:
35
+ content, download_file = convert_file(file_path, parser_name, ocr_method_name, output_format)
36
+ results["content"] = content
37
+ results["download_file"] = download_file
38
+ except Exception as e:
39
+ logger.error(f"Error during conversion: {str(e)}")
40
+ results["error"] = str(e)
41
+
42
+ # Create and start the thread
43
+ thread = threading.Thread(target=conversion_worker)
44
+ thread.daemon = True
45
+ thread.start()
46
+
47
+ return thread, results
48
+
49
+
50
+ def run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type):
51
+ """Run the conversion in a separate thread for multiple files."""
52
+ # Results will be shared between threads
53
+ results = {"content": None, "download_file": None, "error": None}
54
+
55
+ def conversion_worker():
56
+ try:
57
+ logger.info(f"Starting multi-file conversion thread for {len(file_paths)} files")
58
+
59
+ # Use the new document service unified method
60
+ document_service = DocumentService()
61
+ document_service.set_cancellation_flag(conversion_cancelled)
62
+
63
+ # Call the unified convert_documents method
64
+ content, output_file = document_service.convert_documents(
65
+ file_paths=file_paths,
66
+ parser_name=parser_name,
67
+ ocr_method_name=ocr_method_name,
68
+ output_format=output_format,
69
+ processing_type=processing_type
70
+ )
71
+
72
+ logger.info(f"Multi-file conversion completed successfully for {len(file_paths)} files")
73
+ results["content"] = content
74
+ results["download_file"] = output_file
75
+
76
+ except Exception as e:
77
+ logger.error(f"Error during multi-file conversion: {str(e)}")
78
+ results["error"] = str(e)
79
+
80
+ # Create and start the thread
81
+ thread = threading.Thread(target=conversion_worker)
82
+ thread.daemon = True
83
+ thread.start()
84
+
85
+ return thread, results
86
+
87
+
88
+ def handle_convert(files, parser_name, ocr_method_name, output_format, processing_type, is_cancelled):
89
+ """Handle file conversion for single or multiple files."""
90
+ # Check if we should cancel before starting
91
+ if is_cancelled:
92
+ logger.info("Conversion cancelled before starting")
93
+ return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
94
+
95
+ # Validate files input
96
+ if not files or len(files) == 0:
97
+ error_msg = "No files uploaded. Please upload at least one document."
98
+ logger.error(error_msg)
99
+ return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
100
+
101
+ # Convert Gradio file objects to file paths
102
+ file_paths = []
103
+ for file in files:
104
+ if hasattr(file, 'name'):
105
+ file_paths.append(file.name)
106
+ else:
107
+ file_paths.append(str(file))
108
+
109
+ # Validate file types for the selected parser
110
+ for file_path in file_paths:
111
+ is_valid, error_msg = validate_file_for_parser(file_path, parser_name)
112
+ if not is_valid:
113
+ logger.error(f"File validation error: {error_msg}")
114
+ return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
115
+
116
+ logger.info(f"Starting conversion of {len(file_paths)} file(s) with cancellation flag cleared")
117
+
118
+ # Start the conversion in a separate thread
119
+ thread, results = run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type)
120
+
121
+ # Start the monitoring thread
122
+ monitor_thread = threading.Thread(target=monitor_cancellation)
123
+ monitor_thread.daemon = True
124
+ monitor_thread.start()
125
+
126
+ # Wait for the thread to complete or be cancelled
127
+ while thread.is_alive():
128
+ # Check if cancellation was requested
129
+ if conversion_cancelled.is_set():
130
+ logger.info("Cancellation detected, waiting for thread to finish")
131
+ # Give the thread a chance to clean up
132
+ thread.join(timeout=0.5)
133
+ if thread.is_alive():
134
+ logger.warning("Thread did not finish within timeout")
135
+ return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
136
+
137
+ # Sleep briefly to avoid busy waiting
138
+ time.sleep(0.1)
139
+
140
+ # Thread has completed, check results
141
+ if results["error"]:
142
+ return f"Error: {results['error']}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
143
+
144
+ content = results["content"]
145
+ download_file = results["download_file"]
146
+
147
+ # If conversion returned a cancellation message
148
+ if content == "Conversion cancelled.":
149
+ logger.info("Converter returned cancellation message")
150
+ return content, None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
151
+
152
+ # Format the content based on parser type
153
+ if "GOT-OCR" in parser_name:
154
+ # For GOT-OCR, display as LaTeX
155
+ formatted_content = format_latex_content(str(content))
156
+ html_output = f"<div class='output-container'>{formatted_content}</div>"
157
+ else:
158
+ # For other parsers, display as Markdown
159
+ formatted_content = format_markdown_content(str(content))
160
+ html_output = f"<div class='output-container'>{formatted_content}</div>"
161
+
162
+ logger.info("Conversion completed successfully")
163
+
164
+ # Auto-ingest the converted document for RAG
165
+ try:
166
+ # For multi-file conversion, use the first file for metadata
167
+ file_path = file_paths[0] if file_paths else None
168
+
169
+ # Read original file content for proper deduplication hashing
170
+ original_file_content = None
171
+ if file_path and Path(file_path).exists():
172
+ try:
173
+ with open(file_path, 'rb') as f:
174
+ original_file_content = f.read().decode('utf-8', errors='ignore')
175
+ except Exception as e:
176
+ logger.warning(f"Could not read original file content: {e}")
177
+
178
+ conversion_result = {
179
+ "markdown_content": content,
180
+ "original_filename": Path(file_path).name if file_path else "unknown",
181
+ "conversion_method": parser_name,
182
+ "file_size": Path(file_path).stat().st_size if file_path and Path(file_path).exists() else 0,
183
+ "conversion_time": 0, # Could be tracked if needed
184
+ "original_file_content": original_file_content
185
+ }
186
+
187
+ success, ingestion_msg, stats = document_ingestion_service.ingest_from_conversion_result(conversion_result)
188
+ if success:
189
+ logger.info(f"Document auto-ingested for RAG: {ingestion_msg}")
190
+ else:
191
+ logger.warning(f"Document ingestion failed: {ingestion_msg}")
192
+ except Exception as e:
193
+ logger.error(f"Error during auto-ingestion: {e}")
194
+
195
+ return html_output, download_file, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
196
+
197
+
198
+
199
+
200
+ def create_document_converter_tab():
201
+ """Create the document converter tab UI."""
202
+ with gr.TabItem("πŸ“„ Document Converter"):
203
+ with gr.Column(elem_classes=["chat-tab-container"]):
204
+ # Modern header matching other tabs
205
+ gr.HTML("""
206
+ <div class="chat-header">
207
+ <h2>πŸ“„ Document Converter</h2>
208
+ <p>Convert documents to Markdown format with advanced OCR and AI processing</p>
209
+ </div>
210
+ """)
211
+
212
+ # State to track if cancellation is requested
213
+ cancel_requested = gr.State(False)
214
+ # State to store the conversion thread
215
+ conversion_thread = gr.State(None)
216
+ # State to store the output format (fixed to Markdown)
217
+ output_format_state = gr.State("Markdown")
218
+
219
+ # Multi-file input (supports single and multiple files)
220
+ files_input = gr.Files(
221
+ label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
222
+ file_count="multiple",
223
+ file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
224
+ )
225
+
226
+ # Processing type selector (visible only for multiple files)
227
+ processing_type_selector = gr.Radio(
228
+ choices=["combined", "individual", "summary", "comparison"],
229
+ value="combined",
230
+ label="Multi-Document Processing Type",
231
+ info="How to process multiple documents together",
232
+ visible=False
233
+ )
234
+
235
+ # Status text to show file count and processing mode
236
+ file_status_text = gr.HTML(
237
+ value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
238
+ label=""
239
+ )
240
+
241
+ # Provider and OCR options below the file input
242
+ with gr.Row(elem_classes=["provider-options-row"]):
243
+ with gr.Column(scale=1):
244
+ from src.parsers.parser_registry import ParserRegistry
245
+ parser_names = ParserRegistry.get_parser_names()
246
+
247
+ # Make MarkItDown the default parser if available
248
+ default_parser = next((p for p in parser_names if p == "MarkItDown"), parser_names[0] if parser_names else "PyPdfium")
249
+
250
+ provider_dropdown = gr.Dropdown(
251
+ label="Provider",
252
+ choices=parser_names,
253
+ value=default_parser,
254
+ interactive=True
255
+ )
256
+ with gr.Column(scale=1):
257
+ default_ocr_options = ParserRegistry.get_ocr_options(default_parser)
258
+ default_ocr = default_ocr_options[0] if default_ocr_options else "No OCR"
259
+
260
+ ocr_dropdown = gr.Dropdown(
261
+ label="OCR Options",
262
+ choices=default_ocr_options,
263
+ value=default_ocr,
264
+ interactive=True
265
+ )
266
+
267
+ # Processing controls row with consistent styling
268
+ with gr.Row(elem_classes=["control-buttons"]):
269
+ convert_button = gr.Button("πŸš€ Convert", elem_classes=["control-btn", "btn-primary"])
270
+ cancel_button = gr.Button("⏹️ Cancel", elem_classes=["control-btn", "btn-clear-data"], visible=False)
271
+
272
+ # Simple output container with just one scrollbar
273
+ file_display = gr.HTML(
274
+ value="<div class='output-container'></div>",
275
+ label="Converted Content"
276
+ )
277
+
278
+ file_download = gr.File(label="Download File")
279
+
280
+ # Event handlers
281
+ from src.ui.utils.file_validation import update_ui_for_file_count
282
+
283
+ # Update UI when files are uploaded
284
+ files_input.change(
285
+ fn=update_ui_for_file_count,
286
+ inputs=[files_input],
287
+ outputs=[processing_type_selector, file_status_text]
288
+ )
289
+
290
+ provider_dropdown.change(
291
+ lambda p: gr.Dropdown(
292
+ choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
293
+ value="Plain Text" if "GOT-OCR" in p else (ParserRegistry.get_ocr_options(p)[0] if ParserRegistry.get_ocr_options(p) else None)
294
+ ),
295
+ inputs=[provider_dropdown],
296
+ outputs=[ocr_dropdown]
297
+ )
298
+
299
+ # Reset cancel flag when starting conversion
300
+ def start_conversion():
301
+ from src.ui.utils.threading_utils import conversion_cancelled
302
+ conversion_cancelled.clear()
303
+ logger.info("Starting conversion with cancellation flag cleared")
304
+ return gr.update(visible=False), gr.update(visible=True), False
305
+
306
+ # Set cancel flag and terminate thread when cancel button is clicked
307
+ def request_cancellation(thread):
308
+ from src.ui.utils.threading_utils import conversion_cancelled
309
+ conversion_cancelled.set()
310
+ logger.info("Cancel button clicked, cancellation flag set")
311
+
312
+ # Try to join the thread with a timeout
313
+ if thread is not None:
314
+ logger.info(f"Attempting to join conversion thread: {thread}")
315
+ thread.join(timeout=0.5)
316
+ if thread.is_alive():
317
+ logger.warning("Thread did not finish within timeout")
318
+
319
+ # Add immediate feedback to the user
320
+ return gr.update(visible=True), gr.update(visible=False), True, None
321
+
322
+ # Start conversion sequence
323
+ convert_button.click(
324
+ fn=start_conversion,
325
+ inputs=[],
326
+ outputs=[convert_button, cancel_button, cancel_requested],
327
+ queue=False # Execute immediately
328
+ ).then(
329
+ fn=handle_convert,
330
+ inputs=[files_input, provider_dropdown, ocr_dropdown, output_format_state, processing_type_selector, cancel_requested],
331
+ outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
332
+ )
333
+
334
+ # Handle cancel button click
335
+ cancel_button.click(
336
+ fn=request_cancellation,
337
+ inputs=[conversion_thread],
338
+ outputs=[convert_button, cancel_button, cancel_requested, conversion_thread],
339
+ queue=False # Execute immediately
340
+ )
src/ui/components/query_ranker.py ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Query ranker UI component and logic."""
2
+
3
+ import gradio as gr
4
+ import logging
5
+
6
+ from src.core.logging_config import get_logger
7
+ from src.rag.vector_store import vector_store_manager
8
+ from src.rag import document_ingestion_service
9
+
10
+ logger = get_logger(__name__)
11
+
12
+
13
+ def handle_query_search(query, method, k_value):
14
+ """Handle query search and return formatted results."""
15
+ if not query or not query.strip():
16
+ return """
17
+ <div class="ranker-container">
18
+ <div class="ranker-placeholder">
19
+ <h3>πŸ” Query Ranker</h3>
20
+ <p>Enter a search query to find relevant document chunks with similarity scores.</p>
21
+ </div>
22
+ </div>
23
+ """
24
+
25
+ try:
26
+ logger.info(f"Query search: '{query[:50]}...' using method: {method}")
27
+
28
+ # Get results based on method
29
+ results = []
30
+ if method == "similarity":
31
+ retriever = vector_store_manager.get_retriever("similarity", {"k": k_value})
32
+ docs = retriever.invoke(query)
33
+ # Try to get actual similarity scores
34
+ try:
35
+ vector_store = vector_store_manager.get_vector_store()
36
+ if hasattr(vector_store, 'similarity_search_with_score'):
37
+ docs_with_scores = vector_store.similarity_search_with_score(query, k=k_value)
38
+ for i, (doc, score) in enumerate(docs_with_scores):
39
+ similarity_score = max(0, 1 - score) if score is not None else 0.8
40
+ results.append(_format_ranker_result(doc, similarity_score, i + 1))
41
+ else:
42
+ # Fallback without scores
43
+ for i, doc in enumerate(docs):
44
+ score = 0.85 - (i * 0.05)
45
+ results.append(_format_ranker_result(doc, score, i + 1))
46
+ except Exception as e:
47
+ logger.warning(f"Could not get similarity scores: {e}")
48
+ for i, doc in enumerate(docs):
49
+ score = 0.85 - (i * 0.05)
50
+ results.append(_format_ranker_result(doc, score, i + 1))
51
+
52
+ elif method == "mmr":
53
+ retriever = vector_store_manager.get_retriever("mmr", {"k": k_value, "fetch_k": k_value * 2, "lambda_mult": 0.5})
54
+ docs = retriever.invoke(query)
55
+ for i, doc in enumerate(docs):
56
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for MMR
57
+
58
+ elif method == "bm25":
59
+ retriever = vector_store_manager.get_bm25_retriever(k=k_value)
60
+ docs = retriever.invoke(query)
61
+ for i, doc in enumerate(docs):
62
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for BM25
63
+
64
+ elif method == "hybrid":
65
+ retriever = vector_store_manager.get_hybrid_retriever(k=k_value, semantic_weight=0.7, keyword_weight=0.3)
66
+ docs = retriever.invoke(query)
67
+ for i, doc in enumerate(docs):
68
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for hybrid
69
+
70
+ logger.info(f"Retrieved {len(results)} results for query using {method}")
71
+ return _format_ranker_results_html(results, query, method)
72
+
73
+ except Exception as e:
74
+ error_msg = f"Error during search: {str(e)}"
75
+ logger.error(error_msg)
76
+ return f"""
77
+ <div class="ranker-container">
78
+ <div class="ranker-error">
79
+ <h3>❌ Search Error</h3>
80
+ <p>{error_msg}</p>
81
+ <p class="error-hint">Make sure documents are uploaded and the system is ready.</p>
82
+ </div>
83
+ </div>
84
+ """
85
+
86
+
87
+ def _format_ranker_result(doc, score, rank):
88
+ """Format a single search result."""
89
+ # Extract metadata
90
+ metadata = doc.metadata
91
+ source = metadata.get("source", "Unknown")
92
+ page = metadata.get("page", "N/A")
93
+ chunk_id = metadata.get("chunk_id", "Unknown")
94
+
95
+ # Calculate content length and create indicator
96
+ content_length = len(doc.page_content)
97
+ if content_length < 200:
98
+ length_indicator = f"πŸ“ {content_length} chars"
99
+ elif content_length < 500:
100
+ length_indicator = f"πŸ“„ {content_length} chars"
101
+ else:
102
+ length_indicator = f"πŸ“š {content_length} chars"
103
+
104
+ # Calculate confidence based on rank (high confidence for top results)
105
+ if rank <= 2:
106
+ confidence = "High"
107
+ confidence_color = "#28a745"
108
+ confidence_icon = "πŸ”₯"
109
+ elif rank <= 4:
110
+ confidence = "Medium"
111
+ confidence_color = "#ffc107"
112
+ confidence_icon = "⭐"
113
+ else:
114
+ confidence = "Low"
115
+ confidence_color = "#6c757d"
116
+ confidence_icon = "πŸ’‘"
117
+
118
+ result = {
119
+ "rank": rank,
120
+ "content": doc.page_content,
121
+ "source": source,
122
+ "page": page,
123
+ "chunk_id": chunk_id,
124
+ "length_indicator": length_indicator,
125
+ "has_score": score is not None,
126
+ "confidence": confidence,
127
+ "confidence_color": confidence_color,
128
+ "confidence_icon": confidence_icon
129
+ }
130
+
131
+ # Only add score if we have a real score (similarity search only)
132
+ if score is not None:
133
+ result["score"] = round(score, 3)
134
+
135
+ return result
136
+
137
+
138
+ def _format_ranker_results_html(results, query, method):
139
+ """Format search results as HTML."""
140
+ if not results:
141
+ return """
142
+ <div class="ranker-container">
143
+ <div class="ranker-no-results">
144
+ <h3>πŸ” No Results Found</h3>
145
+ <p>No relevant documents found for your query.</p>
146
+ <p class="no-results-hint">Try different keywords or check if documents are uploaded.</p>
147
+ </div>
148
+ </div>
149
+ """
150
+
151
+ # Method display names
152
+ method_labels = {
153
+ "similarity": "🎯 Similarity Search",
154
+ "mmr": "πŸ”€ MMR (Diverse)",
155
+ "bm25": "πŸ” BM25 (Keywords)",
156
+ "hybrid": "πŸ”— Hybrid (Recommended)"
157
+ }
158
+ method_display = method_labels.get(method, method)
159
+
160
+ # Start building HTML
161
+ html_parts = [f"""
162
+ <div class="ranker-container">
163
+ <div class="ranker-header">
164
+ <div class="ranker-title">
165
+ <h3>πŸ” Search Results</h3>
166
+ <div class="query-display">"{query}"</div>
167
+ </div>
168
+ <div class="ranker-meta">
169
+ <span class="method-badge">{method_display}</span>
170
+ <span class="result-count">{len(results)} results</span>
171
+ </div>
172
+ </div>
173
+ """]
174
+
175
+ # Add results
176
+ for result in results:
177
+ rank_emoji = ["πŸ₯‡", "πŸ₯ˆ", "πŸ₯‰"][result["rank"] - 1] if result["rank"] <= 3 else f"#{result['rank']}"
178
+
179
+ # Escape content for safe HTML inclusion and JavaScript
180
+ escaped_content = result['content'].replace('"', '&quot;').replace("'", "&#39;").replace('\n', '\\n')
181
+
182
+ # Build score info - always show confidence, only show score for similarity search
183
+ score_info_parts = [f"""
184
+ <span class="confidence-badge" style="color: {result['confidence_color']}">
185
+ {result['confidence_icon']} {result['confidence']}
186
+ </span>"""]
187
+
188
+ # Only add score value if we have real scores (similarity search)
189
+ if result.get('has_score', False):
190
+ score_info_parts.append(f'<span class="score-value">🎯 {result["score"]}</span>')
191
+
192
+ score_info_html = f"""
193
+ <div class="score-info">
194
+ {''.join(score_info_parts)}
195
+ </div>"""
196
+
197
+ html_parts.append(f"""
198
+ <div class="result-card">
199
+ <div class="result-header">
200
+ <div class="rank-info">
201
+ <span class="rank-badge">{rank_emoji} Rank {result['rank']}</span>
202
+ <span class="source-info">πŸ“„ {result['source']}</span>
203
+ {f"<span class='page-info'>Page {result['page']}</span>" if result['page'] != 'N/A' else ""}
204
+ <span class="length-info">{result['length_indicator']}</span>
205
+ </div>
206
+ {score_info_html}
207
+ </div>
208
+ <div class="result-content">
209
+ <div class="content-text">{result['content']}</div>
210
+ </div>
211
+ </div>
212
+ """)
213
+
214
+ html_parts.append("</div>")
215
+
216
+ return "".join(html_parts)
217
+
218
+
219
+ def get_ranker_status():
220
+ """Get current ranker system status."""
221
+ try:
222
+ # Get collection info
223
+ collection_info = vector_store_manager.get_collection_info()
224
+ document_count = collection_info.get("document_count", 0)
225
+
226
+ # Get available methods
227
+ available_methods = ["similarity", "mmr", "bm25", "hybrid"]
228
+
229
+ # Check if system is ready
230
+ ingestion_status = document_ingestion_service.get_ingestion_status()
231
+ system_ready = ingestion_status.get('system_ready', False)
232
+
233
+ status_html = f"""
234
+ <div class="status-card">
235
+ <div class="status-header">
236
+ <h3>πŸ” Query Ranker Status</h3>
237
+ <div class="status-indicator {'status-ready' if system_ready else 'status-not-ready'}">
238
+ {'🟒 READY' if system_ready else 'πŸ”΄ NOT READY'}
239
+ </div>
240
+ </div>
241
+
242
+ <div class="status-grid">
243
+ <div class="status-item">
244
+ <div class="status-label">Available Documents</div>
245
+ <div class="status-value">{document_count}</div>
246
+ </div>
247
+ <div class="status-item">
248
+ <div class="status-label">Retrieval Methods</div>
249
+ <div class="status-value">{len(available_methods)}</div>
250
+ </div>
251
+ <div class="status-item">
252
+ <div class="status-label">Vector Store</div>
253
+ <div class="status-value">{'Ready' if system_ready else 'Not Ready'}</div>
254
+ </div>
255
+ </div>
256
+
257
+ <div class="ranker-methods">
258
+ <div class="methods-label">Available Methods:</div>
259
+ <div class="methods-list">
260
+ <span class="method-tag">🎯 Similarity</span>
261
+ <span class="method-tag">πŸ”€ MMR</span>
262
+ <span class="method-tag">πŸ” BM25</span>
263
+ <span class="method-tag">πŸ”— Hybrid</span>
264
+ </div>
265
+ </div>
266
+ </div>
267
+ """
268
+
269
+ return status_html
270
+
271
+ except Exception as e:
272
+ error_msg = f"Error getting ranker status: {str(e)}"
273
+ logger.error(error_msg)
274
+ return f"""
275
+ <div class="status-card status-error">
276
+ <div class="status-header">
277
+ <h3>❌ System Error</h3>
278
+ </div>
279
+ <p class="error-message">{error_msg}</p>
280
+ </div>
281
+ """
282
+
283
+
284
+ def create_query_ranker_tab():
285
+ """Create the query ranker tab UI."""
286
+ with gr.TabItem("πŸ” Query Ranker"):
287
+ with gr.Column(elem_classes=["ranker-container"]):
288
+ # Header
289
+ gr.HTML("""
290
+ <div class="chat-header">
291
+ <h2>πŸ” Query Ranker</h2>
292
+ <p>Search and rank document chunks with transparency into retrieval methods</p>
293
+ </div>
294
+ """)
295
+
296
+ # Status display
297
+ status_display = gr.HTML(value=get_ranker_status())
298
+
299
+ # Control buttons
300
+ with gr.Row(elem_classes=["control-buttons"]):
301
+ refresh_ranker_status_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
302
+ clear_results_btn = gr.Button("πŸ—‘οΈ Clear Results", elem_classes=["control-btn", "btn-clear-data"])
303
+
304
+ # Search controls
305
+ with gr.Column(elem_classes=["ranker-controls"]):
306
+ with gr.Row(elem_classes=["ranker-input-row"]):
307
+ query_input = gr.Textbox(
308
+ placeholder="Enter your search query...",
309
+ show_label=False,
310
+ elem_classes=["ranker-query-input"],
311
+ scale=4
312
+ )
313
+ search_btn = gr.Button("πŸ” Search", elem_classes=["ranker-search-btn"], scale=0)
314
+
315
+ with gr.Row(elem_classes=["ranker-options-row"]):
316
+ method_dropdown = gr.Dropdown(
317
+ choices=[
318
+ ("🎯 Similarity Search", "similarity"),
319
+ ("πŸ”€ MMR (Diverse)", "mmr"),
320
+ ("πŸ” BM25 (Keywords)", "bm25"),
321
+ ("πŸ”— Hybrid (Recommended)", "hybrid")
322
+ ],
323
+ value="hybrid",
324
+ label="Retrieval Method",
325
+ scale=2
326
+ )
327
+ k_slider = gr.Slider(
328
+ minimum=1,
329
+ maximum=10,
330
+ value=5,
331
+ step=1,
332
+ label="Number of Results",
333
+ scale=1
334
+ )
335
+
336
+ # Results display
337
+ results_display = gr.HTML(
338
+ value=handle_query_search("", "hybrid", 5), # Initial placeholder
339
+ elem_classes=["ranker-results-container"]
340
+ )
341
+
342
+ # Event handlers
343
+ query_input.submit(
344
+ handle_query_search,
345
+ inputs=[query_input, method_dropdown, k_slider],
346
+ outputs=[results_display]
347
+ )
348
+
349
+ search_btn.click(
350
+ handle_query_search,
351
+ inputs=[query_input, method_dropdown, k_slider],
352
+ outputs=[results_display]
353
+ )
354
+
355
+ # Control button handlers
356
+ def clear_ranker_results():
357
+ """Clear the search results and reset to placeholder."""
358
+ return handle_query_search("", "hybrid", 5), ""
359
+
360
+ def refresh_ranker_status():
361
+ """Refresh the ranker status display."""
362
+ return get_ranker_status()
363
+
364
+ refresh_ranker_status_btn.click(
365
+ fn=refresh_ranker_status,
366
+ inputs=[],
367
+ outputs=[status_display]
368
+ )
369
+
370
+ clear_results_btn.click(
371
+ fn=clear_ranker_results,
372
+ inputs=[],
373
+ outputs=[results_display, query_input]
374
+ )
375
+
376
+ # Update results when method or k changes
377
+ method_dropdown.change(
378
+ fn=handle_query_search,
379
+ inputs=[query_input, method_dropdown, k_slider],
380
+ outputs=[results_display]
381
+ )
382
+
383
+ k_slider.change(
384
+ fn=handle_query_search,
385
+ inputs=[query_input, method_dropdown, k_slider],
386
+ outputs=[results_display]
387
+ )
src/ui/formatters/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Content Formatters package - Content formatting and rendering utilities."""
src/ui/formatters/content_formatters.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Content formatting and rendering utilities for the Markit application."""
2
+
3
+ import markdown
4
+ import json
5
+ import base64
6
+ import html
7
+ import logging
8
+
9
+ from src.core.logging_config import get_logger
10
+
11
+ logger = get_logger(__name__)
12
+
13
+
14
+ def format_markdown_content(content):
15
+ """Convert markdown content to HTML."""
16
+ if not content:
17
+ return content
18
+
19
+ # Convert the content to HTML using markdown library
20
+ html_content = markdown.markdown(str(content), extensions=['tables'])
21
+ return html_content
22
+
23
+
24
+ def render_latex_to_html(latex_content):
25
+ """Convert LaTeX content to HTML using Mathpix Markdown like GOT-OCR demo."""
26
+ # Clean up the content similar to GOT-OCR demo
27
+ content = latex_content.strip()
28
+ if content.endswith("<|im_end|>"):
29
+ content = content[:-len("<|im_end|>")]
30
+
31
+ # Fix unbalanced delimiters exactly like GOT-OCR demo
32
+ right_num = content.count("\\right")
33
+ left_num = content.count("\\left")
34
+
35
+ if right_num != left_num:
36
+ content = (
37
+ content.replace("\\left(", "(")
38
+ .replace("\\right)", ")")
39
+ .replace("\\left[", "[")
40
+ .replace("\\right]", "]")
41
+ .replace("\\left{", "{")
42
+ .replace("\\right}", "}")
43
+ .replace("\\left|", "|")
44
+ .replace("\\right|", "|")
45
+ .replace("\\left.", ".")
46
+ .replace("\\right.", ".")
47
+ )
48
+
49
+ # Process content like GOT-OCR demo: remove $ signs and replace quotes
50
+ content = content.replace('"', "``").replace("$", "")
51
+
52
+ # Split into lines and create JavaScript string like GOT-OCR demo
53
+ outputs_list = content.split("\n")
54
+ js_text_parts = []
55
+ for line in outputs_list:
56
+ # Escape backslashes and add line break
57
+ escaped_line = line.replace("\\", "\\\\")
58
+ js_text_parts.append(f'"{escaped_line}\\n"')
59
+
60
+ # Join with + like in GOT-OCR demo
61
+ js_text = " + ".join(js_text_parts)
62
+
63
+ # Create HTML using Mathpix Markdown like GOT-OCR demo
64
+ html_content = f"""<!DOCTYPE html>
65
+ <html lang="en" data-lt-installed="true">
66
+ <head>
67
+ <meta charset="UTF-8">
68
+ <title>LaTeX Content</title>
69
+ <script>
70
+ const text = {js_text};
71
+ </script>
72
+ <style>
73
+ #content {{
74
+ max-width: 800px;
75
+ margin: auto;
76
+ padding: 20px;
77
+ }}
78
+ body {{
79
+ font-family: 'Times New Roman', serif;
80
+ line-height: 1.6;
81
+ background-color: #ffffff;
82
+ color: #333;
83
+ }}
84
+ table {{
85
+ border-collapse: collapse;
86
+ width: 100%;
87
+ margin: 20px 0;
88
+ }}
89
+ td, th {{
90
+ border: 1px solid #333;
91
+ padding: 8px 12px;
92
+ text-align: center;
93
+ vertical-align: middle;
94
+ }}
95
+ </style>
96
+ <script>
97
+ let script = document.createElement('script');
98
+ script.src = "https://cdn.jsdelivr.net/npm/mathpix-markdown-it@1.3.6/es5/bundle.js";
99
+ document.head.append(script);
100
+ script.onload = function() {{
101
+ const isLoaded = window.loadMathJax();
102
+ if (isLoaded) {{
103
+ console.log('Styles loaded!')
104
+ }}
105
+ const el = window.document.getElementById('content-text');
106
+ if (el) {{
107
+ const options = {{
108
+ htmlTags: true
109
+ }};
110
+ const html = window.render(text, options);
111
+ el.outerHTML = html;
112
+ }}
113
+ }};
114
+ </script>
115
+ </head>
116
+ <body>
117
+ <div id="content">
118
+ <div id="content-text"></div>
119
+ </div>
120
+ </body>
121
+ </html>"""
122
+
123
+ return html_content
124
+
125
+
126
+ def format_latex_content(content):
127
+ """Format LaTeX content for display in UI using MathJax rendering like GOT-OCR demo."""
128
+ if not content:
129
+ return content
130
+
131
+ try:
132
+ # Generate rendered HTML
133
+ rendered_html = render_latex_to_html(content)
134
+
135
+ # Encode for iframe display (similar to GOT-OCR demo)
136
+ encoded_html = base64.b64encode(rendered_html.encode("utf-8")).decode("utf-8")
137
+ iframe_src = f"data:text/html;base64,{encoded_html}"
138
+
139
+ # Create the display with both rendered and raw views
140
+ formatted_content = f"""
141
+ <div style="background-color: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; margin: 10px 0;">
142
+ <div style="background-color: #e9ecef; padding: 10px; border-radius: 8px 8px 0 0; font-weight: bold; color: #495057;">
143
+ πŸ“„ LaTeX Content (Rendered with MathJax)
144
+ </div>
145
+ <div style="padding: 0;">
146
+ <iframe src="{iframe_src}" width="100%" height="500px" style="border: none; border-radius: 0 0 8px 8px;"></iframe>
147
+ </div>
148
+ <div style="background-color: #e9ecef; padding: 8px 15px; border-radius: 0; font-size: 12px; color: #6c757d; border-top: 1px solid #dee2e6;">
149
+ πŸ’‘ LaTeX content rendered with MathJax. Tables and formulas are displayed as they would appear in a LaTeX document.
150
+ </div>
151
+ <details style="margin: 0; border-top: 1px solid #dee2e6;">
152
+ <summary style="padding: 8px 15px; background-color: #e9ecef; cursor: pointer; font-size: 12px; color: #6c757d;">
153
+ πŸ“ View Raw LaTeX Source
154
+ </summary>
155
+ <div style="padding: 15px; background-color: #f8f9fa;">
156
+ <pre style="background-color: transparent; margin: 0; padding: 0;
157
+ font-family: 'Courier New', monospace; font-size: 12px; line-height: 1.4;
158
+ white-space: pre-wrap; word-wrap: break-word; color: #2c3e50; max-height: 200px; overflow-y: auto;">
159
+ {content}
160
+ </pre>
161
+ </div>
162
+ </details>
163
+ </div>
164
+ """
165
+
166
+ except Exception as e:
167
+ # Fallback to simple formatting if rendering fails
168
+ logger.error(f"Error rendering LaTeX content: {e}")
169
+ escaped_content = html.escape(str(content))
170
+ formatted_content = f"""
171
+ <div style="background-color: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; margin: 10px 0;">
172
+ <div style="background-color: #e9ecef; padding: 10px; border-radius: 8px 8px 0 0; font-weight: bold; color: #495057;">
173
+ πŸ“„ LaTeX Content (Fallback View)
174
+ </div>
175
+ <div style="padding: 15px;">
176
+ <pre style="background-color: transparent; margin: 0; padding: 0;
177
+ font-family: 'Courier New', monospace; font-size: 14px; line-height: 1.4;
178
+ white-space: pre-wrap; word-wrap: break-word; color: #2c3e50;">
179
+ {escaped_content}
180
+ </pre>
181
+ </div>
182
+ <div style="background-color: #e9ecef; padding: 8px 15px; border-radius: 0 0 8px 8px; font-size: 12px; color: #6c757d;">
183
+ ⚠️ Rendering failed, showing raw LaTeX. Error: {str(e)}
184
+ </div>
185
+ </div>
186
+ """
187
+
188
+ return formatted_content
src/ui/styles/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """UI Styles package - CSS styles and theme definitions."""
src/ui/styles/ui_styles.py ADDED
@@ -0,0 +1,770 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CSS styles and theme definitions for the Markit UI."""
2
+
3
+ # Main CSS styles for the application
4
+ CSS_STYLES = """
5
+ /* Global styles */
6
+ .gradio-container {
7
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
8
+ }
9
+
10
+ /* Document converter styles */
11
+ .output-container {
12
+ max-height: 420px;
13
+ overflow-y: auto;
14
+ border: 1px solid #ddd;
15
+ padding: 10px;
16
+ }
17
+
18
+ .gradio-container .prose {
19
+ overflow: visible;
20
+ }
21
+
22
+ .processing-controls {
23
+ display: flex;
24
+ justify-content: center;
25
+ gap: 10px;
26
+ margin-top: 10px;
27
+ }
28
+
29
+ .provider-options-row {
30
+ margin-top: 15px;
31
+ margin-bottom: 15px;
32
+ }
33
+
34
+ /* Chat Tab Styles - Complete redesign */
35
+ .chat-tab-container {
36
+ max-width: 1200px;
37
+ margin: 0 auto;
38
+ padding: 20px;
39
+ }
40
+
41
+ .chat-header {
42
+ text-align: center;
43
+ margin-bottom: 30px;
44
+ padding: 20px;
45
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
46
+ border-radius: 15px;
47
+ color: white;
48
+ box-shadow: 0 4px 15px rgba(0,0,0,0.1);
49
+ }
50
+
51
+ .chat-header h2 {
52
+ margin: 0;
53
+ font-size: 1.8em;
54
+ font-weight: 600;
55
+ }
56
+
57
+ .chat-header p {
58
+ margin: 10px 0 0 0;
59
+ opacity: 0.9;
60
+ font-size: 1.1em;
61
+ }
62
+
63
+ /* Status Card Styling */
64
+ .status-card {
65
+ background: #ffffff;
66
+ border: 1px solid #e1e5e9;
67
+ border-radius: 12px;
68
+ padding: 20px;
69
+ margin-bottom: 25px;
70
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
71
+ transition: all 0.3s ease;
72
+ }
73
+
74
+ .status-card:hover {
75
+ box-shadow: 0 4px 20px rgba(0,0,0,0.1);
76
+ }
77
+
78
+ .status-header {
79
+ display: flex;
80
+ justify-content: space-between;
81
+ align-items: center;
82
+ margin-bottom: 20px;
83
+ padding-bottom: 15px;
84
+ border-bottom: 2px solid #f0f2f5;
85
+ }
86
+
87
+ .status-header h3 {
88
+ margin: 0;
89
+ color: #2c3e50;
90
+ font-size: 1.3em;
91
+ font-weight: 600;
92
+ }
93
+
94
+ .status-indicator {
95
+ padding: 8px 16px;
96
+ border-radius: 25px;
97
+ font-weight: 600;
98
+ font-size: 0.9em;
99
+ letter-spacing: 0.5px;
100
+ }
101
+
102
+ .status-ready {
103
+ background: #d4edda;
104
+ color: #155724;
105
+ border: 1px solid #c3e6cb;
106
+ }
107
+
108
+ .status-not-ready {
109
+ background: #f8d7da;
110
+ color: #721c24;
111
+ border: 1px solid #f5c6cb;
112
+ }
113
+
114
+ .status-grid {
115
+ display: grid;
116
+ grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
117
+ gap: 15px;
118
+ margin-bottom: 20px;
119
+ }
120
+
121
+ .status-item {
122
+ background: #f8f9fa;
123
+ padding: 15px;
124
+ border-radius: 8px;
125
+ text-align: center;
126
+ border: 1px solid #e9ecef;
127
+ }
128
+
129
+ .status-label {
130
+ font-size: 0.85em;
131
+ color: #6c757d;
132
+ margin-bottom: 5px;
133
+ font-weight: 500;
134
+ }
135
+
136
+ .status-value {
137
+ font-size: 1.4em;
138
+ font-weight: 700;
139
+ color: #495057;
140
+ }
141
+
142
+ .status-services {
143
+ display: flex;
144
+ gap: 15px;
145
+ flex-wrap: wrap;
146
+ }
147
+
148
+ .service-status {
149
+ display: flex;
150
+ align-items: center;
151
+ gap: 8px;
152
+ padding: 10px 15px;
153
+ border-radius: 8px;
154
+ font-weight: 500;
155
+ flex: 1;
156
+ min-width: 200px;
157
+ color: #2c3e50 !important;
158
+ }
159
+
160
+ .service-status span {
161
+ color: #2c3e50 !important;
162
+ }
163
+
164
+ .service-ready {
165
+ background: #d4edda;
166
+ color: #2c3e50 !important;
167
+ border: 1px solid #c3e6cb;
168
+ }
169
+
170
+ .service-ready span {
171
+ color: #2c3e50 !important;
172
+ }
173
+
174
+ .service-error {
175
+ background: #f8d7da;
176
+ color: #2c3e50 !important;
177
+ border: 1px solid #f5c6cb;
178
+ }
179
+
180
+ .service-error span {
181
+ color: #2c3e50 !important;
182
+ }
183
+
184
+ .service-icon {
185
+ font-size: 1.2em;
186
+ }
187
+
188
+ .service-indicator {
189
+ margin-left: auto;
190
+ }
191
+
192
+ .status-error {
193
+ border-color: #dc3545;
194
+ background: #f8d7da;
195
+ }
196
+
197
+ .error-message {
198
+ color: #721c24;
199
+ margin: 0;
200
+ font-weight: 500;
201
+ }
202
+
203
+ /* Control buttons styling */
204
+ .control-buttons {
205
+ display: flex;
206
+ gap: 12px;
207
+ justify-content: flex-end;
208
+ margin-bottom: 25px;
209
+ }
210
+
211
+ .control-btn {
212
+ padding: 10px 20px;
213
+ border-radius: 8px;
214
+ font-weight: 500;
215
+ transition: all 0.3s ease;
216
+ border: none;
217
+ cursor: pointer;
218
+ }
219
+
220
+ .btn-refresh {
221
+ background: #17a2b8;
222
+ color: white;
223
+ }
224
+
225
+ .btn-refresh:hover {
226
+ background: #138496;
227
+ transform: translateY(-1px);
228
+ }
229
+
230
+ .btn-new-session {
231
+ background: #28a745;
232
+ color: white;
233
+ }
234
+
235
+ .btn-new-session:hover {
236
+ background: #218838;
237
+ transform: translateY(-1px);
238
+ }
239
+
240
+ .btn-clear-data {
241
+ background: #dc3545;
242
+ color: white;
243
+ }
244
+
245
+ .btn-clear-data:hover {
246
+ background: #c82333;
247
+ transform: translateY(-1px);
248
+ }
249
+
250
+ .btn-primary {
251
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
252
+ color: white;
253
+ }
254
+
255
+ .btn-primary:hover {
256
+ transform: translateY(-1px);
257
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
258
+ }
259
+
260
+ /* Chat interface styling */
261
+ .chat-main-container {
262
+ background: #ffffff;
263
+ border-radius: 15px;
264
+ box-shadow: 0 4px 20px rgba(0,0,0,0.08);
265
+ overflow: hidden;
266
+ margin-bottom: 25px;
267
+ }
268
+
269
+ .chat-container {
270
+ background: #ffffff;
271
+ border-radius: 12px;
272
+ border: 1px solid #e1e5e9;
273
+ overflow: hidden;
274
+ }
275
+
276
+ /* Custom chatbot styling */
277
+ .gradio-chatbot {
278
+ border: none !important;
279
+ background: #ffffff;
280
+ }
281
+
282
+ .gradio-chatbot .message {
283
+ padding: 15px 20px;
284
+ margin: 10px;
285
+ border-radius: 12px;
286
+ }
287
+
288
+ .gradio-chatbot .message.user {
289
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
290
+ color: white;
291
+ margin-left: 50px;
292
+ }
293
+
294
+ .gradio-chatbot .message.assistant {
295
+ background: #f8f9fa;
296
+ border: 1px solid #e9ecef;
297
+ margin-right: 50px;
298
+ }
299
+
300
+ /* Input area styling */
301
+ .chat-input-container {
302
+ background: #ffffff;
303
+ padding: 20px;
304
+ border-top: 1px solid #e1e5e9;
305
+ border-radius: 0 0 15px 15px;
306
+ }
307
+
308
+ .input-row {
309
+ display: flex;
310
+ gap: 12px;
311
+ align-items: center;
312
+ }
313
+
314
+ .message-input {
315
+ flex: 1;
316
+ border: 2px solid #e1e5e9;
317
+ border-radius: 25px;
318
+ padding: 12px 20px;
319
+ font-size: 1em;
320
+ transition: all 0.3s ease;
321
+ resize: none;
322
+ max-height: 120px;
323
+ min-height: 48px;
324
+ }
325
+
326
+ .message-input:focus {
327
+ border-color: #667eea;
328
+ box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
329
+ outline: none;
330
+ }
331
+
332
+ .send-button {
333
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
334
+ color: white;
335
+ border: none;
336
+ border-radius: 12px;
337
+ padding: 12px 24px;
338
+ min-width: 80px;
339
+ height: 48px;
340
+ margin-right: 10px;
341
+ cursor: pointer;
342
+ transition: all 0.3s ease;
343
+ display: flex;
344
+ align-items: center;
345
+ justify-content: center;
346
+ font-size: 1em;
347
+ font-weight: 600;
348
+ letter-spacing: 0.5px;
349
+ }
350
+
351
+ .send-button:hover {
352
+ transform: scale(1.05);
353
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
354
+ }
355
+
356
+ /* Session info styling */
357
+ .session-info {
358
+ background: #e7f3ff;
359
+ border: 1px solid #b3d9ff;
360
+ border-radius: 8px;
361
+ padding: 15px;
362
+ color: #0056b3;
363
+ font-weight: 500;
364
+ text-align: center;
365
+ }
366
+
367
+ /* Responsive design */
368
+ @media (max-width: 768px) {
369
+ .chat-tab-container {
370
+ padding: 10px;
371
+ }
372
+
373
+ .status-grid {
374
+ grid-template-columns: repeat(2, 1fr);
375
+ }
376
+
377
+ .service-status {
378
+ min-width: 100%;
379
+ }
380
+
381
+ .control-buttons {
382
+ flex-direction: column;
383
+ gap: 8px;
384
+ }
385
+
386
+ .gradio-chatbot .message.user {
387
+ margin-left: 20px;
388
+ }
389
+
390
+ .gradio-chatbot .message.assistant {
391
+ margin-right: 20px;
392
+ }
393
+ }
394
+
395
+ /* Query Ranker Styles */
396
+ .ranker-container {
397
+ max-width: 1200px;
398
+ margin: 0 auto;
399
+ padding: 20px;
400
+ }
401
+
402
+ .ranker-placeholder {
403
+ text-align: center;
404
+ padding: 40px;
405
+ background: #f8f9fa;
406
+ border-radius: 12px;
407
+ border: 1px solid #e9ecef;
408
+ color: #6c757d;
409
+ }
410
+
411
+ .ranker-placeholder h3 {
412
+ color: #495057;
413
+ margin-bottom: 10px;
414
+ }
415
+
416
+ .ranker-error {
417
+ text-align: center;
418
+ padding: 30px;
419
+ background: #f8d7da;
420
+ border: 1px solid #f5c6cb;
421
+ border-radius: 12px;
422
+ color: #721c24;
423
+ }
424
+
425
+ .ranker-error h3 {
426
+ margin-bottom: 15px;
427
+ }
428
+
429
+ .error-hint {
430
+ font-style: italic;
431
+ margin-top: 10px;
432
+ opacity: 0.8;
433
+ }
434
+
435
+ .ranker-no-results {
436
+ text-align: center;
437
+ padding: 40px;
438
+ background: #ffffff;
439
+ border: 1px solid #e1e5e9;
440
+ border-radius: 12px;
441
+ color: #6c757d;
442
+ }
443
+
444
+ .ranker-no-results h3 {
445
+ color: #495057;
446
+ margin-bottom: 15px;
447
+ }
448
+
449
+ .no-results-hint {
450
+ font-style: italic;
451
+ margin-top: 10px;
452
+ opacity: 0.8;
453
+ }
454
+
455
+ .ranker-header {
456
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
457
+ color: white;
458
+ padding: 20px;
459
+ border-radius: 15px;
460
+ margin-bottom: 25px;
461
+ box-shadow: 0 4px 15px rgba(0,0,0,0.1);
462
+ }
463
+
464
+ .ranker-title h3 {
465
+ margin: 0 0 10px 0;
466
+ font-size: 1.4em;
467
+ font-weight: 600;
468
+ }
469
+
470
+ .query-display {
471
+ font-size: 1.1em;
472
+ opacity: 0.9;
473
+ font-style: italic;
474
+ margin-bottom: 15px;
475
+ }
476
+
477
+ .ranker-meta {
478
+ display: flex;
479
+ gap: 15px;
480
+ align-items: center;
481
+ flex-wrap: wrap;
482
+ }
483
+
484
+ .method-badge {
485
+ background: rgba(255, 255, 255, 0.2);
486
+ padding: 6px 12px;
487
+ border-radius: 20px;
488
+ font-weight: 500;
489
+ font-size: 0.9em;
490
+ }
491
+
492
+ .result-count {
493
+ background: rgba(255, 255, 255, 0.15);
494
+ padding: 6px 12px;
495
+ border-radius: 20px;
496
+ font-weight: 500;
497
+ font-size: 0.9em;
498
+ }
499
+
500
+ .result-card {
501
+ background: #ffffff;
502
+ border: 1px solid #e1e5e9;
503
+ border-radius: 12px;
504
+ margin-bottom: 20px;
505
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
506
+ transition: all 0.3s ease;
507
+ overflow: hidden;
508
+ }
509
+
510
+ .result-card:hover {
511
+ box-shadow: 0 4px 20px rgba(0,0,0,0.1);
512
+ transform: translateY(-2px);
513
+ }
514
+
515
+ .result-header {
516
+ display: flex;
517
+ justify-content: space-between;
518
+ align-items: center;
519
+ padding: 15px 20px;
520
+ background: #f8f9fa;
521
+ border-bottom: 1px solid #e9ecef;
522
+ }
523
+
524
+ .rank-info {
525
+ display: flex;
526
+ gap: 10px;
527
+ align-items: center;
528
+ flex-wrap: wrap;
529
+ }
530
+
531
+ .rank-badge {
532
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
533
+ color: white;
534
+ padding: 4px 10px;
535
+ border-radius: 15px;
536
+ font-weight: 600;
537
+ font-size: 0.85em;
538
+ }
539
+
540
+ .source-info {
541
+ background: #e9ecef;
542
+ color: #495057;
543
+ padding: 4px 8px;
544
+ border-radius: 10px;
545
+ font-size: 0.85em;
546
+ font-weight: 500;
547
+ }
548
+
549
+ .page-info {
550
+ background: #d1ecf1;
551
+ color: #0c5460;
552
+ padding: 4px 8px;
553
+ border-radius: 10px;
554
+ font-size: 0.85em;
555
+ }
556
+
557
+ .length-info {
558
+ background: #f8f9fa;
559
+ color: #6c757d;
560
+ padding: 4px 8px;
561
+ border-radius: 10px;
562
+ font-size: 0.85em;
563
+ }
564
+
565
+ .score-info {
566
+ display: flex;
567
+ gap: 10px;
568
+ align-items: center;
569
+ }
570
+
571
+ .confidence-badge {
572
+ padding: 4px 8px;
573
+ border-radius: 10px;
574
+ font-weight: 600;
575
+ font-size: 0.85em;
576
+ }
577
+
578
+ .score-value {
579
+ background: #2c3e50;
580
+ color: white;
581
+ padding: 6px 12px;
582
+ border-radius: 15px;
583
+ font-weight: 600;
584
+ font-size: 0.9em;
585
+ }
586
+
587
+ .result-content {
588
+ padding: 20px;
589
+ }
590
+
591
+ .content-text {
592
+ line-height: 1.6;
593
+ color: #2c3e50;
594
+ border-left: 3px solid #667eea;
595
+ padding-left: 15px;
596
+ background: #f8f9fa;
597
+ padding: 15px;
598
+ border-radius: 0 8px 8px 0;
599
+ max-height: 300px;
600
+ overflow-y: auto;
601
+ }
602
+
603
+ .result-actions {
604
+ display: flex;
605
+ gap: 10px;
606
+ padding: 15px 20px;
607
+ background: #f8f9fa;
608
+ border-top: 1px solid #e9ecef;
609
+ }
610
+
611
+ .action-btn {
612
+ padding: 8px 16px;
613
+ border: none;
614
+ border-radius: 8px;
615
+ font-weight: 500;
616
+ cursor: pointer;
617
+ transition: all 0.3s ease;
618
+ font-size: 0.9em;
619
+ display: flex;
620
+ align-items: center;
621
+ gap: 5px;
622
+ }
623
+
624
+ .copy-btn {
625
+ background: #17a2b8;
626
+ color: white;
627
+ }
628
+
629
+ .copy-btn:hover {
630
+ background: #138496;
631
+ transform: translateY(-1px);
632
+ }
633
+
634
+ .info-btn {
635
+ background: #6c757d;
636
+ color: white;
637
+ }
638
+
639
+ .info-btn:hover {
640
+ background: #5a6268;
641
+ transform: translateY(-1px);
642
+ }
643
+
644
+ .ranker-methods {
645
+ margin-top: 20px;
646
+ padding-top: 15px;
647
+ border-top: 1px solid #e9ecef;
648
+ }
649
+
650
+ .methods-label {
651
+ font-weight: 600;
652
+ color: #495057;
653
+ margin-bottom: 10px;
654
+ font-size: 0.9em;
655
+ }
656
+
657
+ .methods-list {
658
+ display: flex;
659
+ gap: 8px;
660
+ flex-wrap: wrap;
661
+ }
662
+
663
+ .method-tag {
664
+ background: #e9ecef;
665
+ color: #495057;
666
+ padding: 4px 10px;
667
+ border-radius: 12px;
668
+ font-size: 0.8em;
669
+ font-weight: 500;
670
+ }
671
+
672
+ /* Ranker controls styling */
673
+ .ranker-controls {
674
+ background: #ffffff;
675
+ border: 1px solid #e1e5e9;
676
+ border-radius: 12px;
677
+ padding: 20px;
678
+ margin-bottom: 25px;
679
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
680
+ }
681
+
682
+ .ranker-input-row {
683
+ display: flex;
684
+ gap: 15px;
685
+ align-items: end;
686
+ margin-bottom: 15px;
687
+ }
688
+
689
+ .ranker-query-input {
690
+ flex: 1;
691
+ border: 2px solid #e1e5e9;
692
+ border-radius: 25px;
693
+ padding: 12px 20px;
694
+ font-size: 1em;
695
+ transition: all 0.3s ease;
696
+ }
697
+
698
+ .ranker-query-input:focus {
699
+ border-color: #667eea;
700
+ box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
701
+ outline: none;
702
+ }
703
+
704
+ .ranker-search-btn {
705
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
706
+ color: white;
707
+ border: none;
708
+ border-radius: 12px;
709
+ padding: 12px 24px;
710
+ min-width: 100px;
711
+ cursor: pointer;
712
+ transition: all 0.3s ease;
713
+ font-weight: 600;
714
+ font-size: 1em;
715
+ }
716
+
717
+ .ranker-search-btn:hover {
718
+ transform: scale(1.05);
719
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
720
+ }
721
+
722
+ .ranker-options-row {
723
+ display: flex;
724
+ gap: 15px;
725
+ align-items: center;
726
+ }
727
+
728
+ /* Responsive design for ranker */
729
+ @media (max-width: 768px) {
730
+ .ranker-container {
731
+ padding: 10px;
732
+ }
733
+
734
+ .ranker-input-row {
735
+ flex-direction: column;
736
+ gap: 10px;
737
+ }
738
+
739
+ .ranker-options-row {
740
+ flex-direction: column;
741
+ gap: 10px;
742
+ align-items: stretch;
743
+ }
744
+
745
+ .ranker-meta {
746
+ justify-content: center;
747
+ }
748
+
749
+ .rank-info {
750
+ flex-direction: column;
751
+ gap: 5px;
752
+ align-items: flex-start;
753
+ }
754
+
755
+ .result-header {
756
+ flex-direction: column;
757
+ gap: 10px;
758
+ align-items: flex-start;
759
+ }
760
+
761
+ .score-info {
762
+ align-self: flex-end;
763
+ }
764
+
765
+ .result-actions {
766
+ flex-direction: column;
767
+ gap: 8px;
768
+ }
769
+ }
770
+ """
src/ui/ui.py CHANGED
@@ -1,24 +1,16 @@
 
 
1
  import gradio as gr
2
- import markdown
3
- import threading
4
- import time
5
  import logging
6
- from pathlib import Path
7
- from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
8
- from src.parsers.parser_registry import ParserRegistry
9
- from src.core.config import config
10
- from src.core.exceptions import (
11
- DocumentProcessingError,
12
- UnsupportedFileTypeError,
13
- FileSizeLimitError,
14
- ConfigurationError
15
- )
16
  from src.core.logging_config import get_logger
17
- from src.rag import rag_chat_service, document_ingestion_service
18
- from src.rag.vector_store import vector_store_manager
19
- from src.services.data_clearing_service import data_clearing_service
 
 
20
 
21
- # Use centralized logging
22
  logger = get_logger(__name__)
23
 
24
  # Import MarkItDown to check if it's available
@@ -30,1653 +22,14 @@ except ImportError:
30
  HAS_MARKITDOWN = False
31
  logger.warning("MarkItDown is not available")
32
 
33
- # Add a global variable to track cancellation state
34
- conversion_cancelled = threading.Event()
35
-
36
- # Pass the cancellation flag to the converter module
37
  set_cancellation_flag(conversion_cancelled)
38
 
39
- # Add a background thread to monitor cancellation
40
- def monitor_cancellation():
41
- """Background thread to monitor cancellation and update UI if needed"""
42
- logger.info("Starting cancellation monitor thread")
43
- while is_conversion_in_progress():
44
- if conversion_cancelled.is_set():
45
- logger.info("Cancellation detected by monitor thread")
46
- time.sleep(0.1) # Check every 100ms
47
- logger.info("Cancellation monitor thread ending")
48
-
49
- def update_ui_for_file_count(files):
50
- """Update UI components based on the number of files uploaded."""
51
- if not files or len(files) == 0:
52
- return (
53
- gr.update(visible=False), # processing_type_selector
54
- "<div style='color: #666; font-style: italic;'>Upload documents to begin</div>" # file_status_text
55
- )
56
-
57
- if len(files) == 1:
58
- file_name = files[0].name if hasattr(files[0], 'name') else str(files[0])
59
- return (
60
- gr.update(visible=False), # processing_type_selector (hidden for single file)
61
- f"<div style='color: #2563eb; font-weight: 500;'>πŸ“„ Single document: {file_name}</div>"
62
- )
63
- else:
64
- # Calculate total size for validation display
65
- total_size = 0
66
- try:
67
- for file in files:
68
- if hasattr(file, 'size'):
69
- total_size += file.size
70
- elif hasattr(file, 'name'):
71
- # For file paths, get size from filesystem
72
- total_size += Path(file.name).stat().st_size
73
- except:
74
- pass # Size calculation is optional for display
75
-
76
- size_display = f" ({total_size / (1024*1024):.1f}MB)" if total_size > 0 else ""
77
-
78
- # Check if within limits
79
- if len(files) > 5:
80
- status_color = "#dc2626" # red
81
- status_text = f"⚠️ Too many files: {len(files)}/5 (max 5 files allowed)"
82
- elif total_size > 20 * 1024 * 1024: # 20MB
83
- status_color = "#dc2626" # red
84
- status_text = f"⚠️ Files too large{size_display} (max 20MB combined)"
85
- else:
86
- status_color = "#059669" # green
87
- status_text = f"πŸ“‚ Batch mode: {len(files)} files{size_display}"
88
-
89
- return (
90
- gr.update(visible=True), # processing_type_selector (visible for multiple files)
91
- f"<div style='color: {status_color}; font-weight: 500;'>{status_text}</div>"
92
- )
93
-
94
- def validate_file_for_parser(file_path, parser_name):
95
- """Validate if the file type is supported by the selected parser."""
96
- if not file_path:
97
- return True, "" # No file selected yet
98
-
99
- try:
100
- file_path_obj = Path(file_path)
101
- file_ext = file_path_obj.suffix.lower()
102
-
103
- # Check file size
104
- if file_path_obj.exists():
105
- file_size = file_path_obj.stat().st_size
106
- if file_size > config.app.max_file_size:
107
- size_mb = file_size / (1024 * 1024)
108
- max_mb = config.app.max_file_size / (1024 * 1024)
109
- return False, f"File size ({size_mb:.1f}MB) exceeds maximum allowed size ({max_mb:.1f}MB)"
110
-
111
- # Check file extension
112
- if file_ext not in config.app.allowed_extensions:
113
- return False, f"File type '{file_ext}' is not supported. Allowed types: {', '.join(config.app.allowed_extensions)}"
114
-
115
- # Parser-specific validation
116
- if "GOT-OCR" in parser_name:
117
- if file_ext not in ['.jpg', '.jpeg', '.png']:
118
- return False, "GOT-OCR only supports JPG and PNG formats."
119
-
120
- return True, ""
121
-
122
- except Exception as e:
123
- logger.error(f"Error validating file: {e}")
124
- return False, f"Error validating file: {e}"
125
-
126
- def format_markdown_content(content):
127
- if not content:
128
- return content
129
-
130
- # Convert the content to HTML using markdown library
131
- html_content = markdown.markdown(str(content), extensions=['tables'])
132
- return html_content
133
-
134
- def render_latex_to_html(latex_content):
135
- """Convert LaTeX content to HTML using Mathpix Markdown like GOT-OCR demo."""
136
- import json
137
-
138
- # Clean up the content similar to GOT-OCR demo
139
- content = latex_content.strip()
140
- if content.endswith("<|im_end|>"):
141
- content = content[:-len("<|im_end|>")]
142
-
143
- # Fix unbalanced delimiters exactly like GOT-OCR demo
144
- right_num = content.count("\\right")
145
- left_num = content.count("\\left")
146
-
147
- if right_num != left_num:
148
- content = (
149
- content.replace("\\left(", "(")
150
- .replace("\\right)", ")")
151
- .replace("\\left[", "[")
152
- .replace("\\right]", "]")
153
- .replace("\\left{", "{")
154
- .replace("\\right}", "}")
155
- .replace("\\left|", "|")
156
- .replace("\\right|", "|")
157
- .replace("\\left.", ".")
158
- .replace("\\right.", ".")
159
- )
160
-
161
- # Process content like GOT-OCR demo: remove $ signs and replace quotes
162
- content = content.replace('"', "``").replace("$", "")
163
-
164
- # Split into lines and create JavaScript string like GOT-OCR demo
165
- outputs_list = content.split("\n")
166
- js_text_parts = []
167
- for line in outputs_list:
168
- # Escape backslashes and add line break
169
- escaped_line = line.replace("\\", "\\\\")
170
- js_text_parts.append(f'"{escaped_line}\\n"')
171
-
172
- # Join with + like in GOT-OCR demo
173
- js_text = " + ".join(js_text_parts)
174
-
175
- # Create HTML using Mathpix Markdown like GOT-OCR demo
176
- html_content = f"""<!DOCTYPE html>
177
- <html lang="en" data-lt-installed="true">
178
- <head>
179
- <meta charset="UTF-8">
180
- <title>LaTeX Content</title>
181
- <script>
182
- const text = {js_text};
183
- </script>
184
- <style>
185
- #content {{
186
- max-width: 800px;
187
- margin: auto;
188
- padding: 20px;
189
- }}
190
- body {{
191
- font-family: 'Times New Roman', serif;
192
- line-height: 1.6;
193
- background-color: #ffffff;
194
- color: #333;
195
- }}
196
- table {{
197
- border-collapse: collapse;
198
- width: 100%;
199
- margin: 20px 0;
200
- }}
201
- td, th {{
202
- border: 1px solid #333;
203
- padding: 8px 12px;
204
- text-align: center;
205
- vertical-align: middle;
206
- }}
207
- </style>
208
- <script>
209
- let script = document.createElement('script');
210
- script.src = "https://cdn.jsdelivr.net/npm/mathpix-markdown-it@1.3.6/es5/bundle.js";
211
- document.head.append(script);
212
- script.onload = function() {{
213
- const isLoaded = window.loadMathJax();
214
- if (isLoaded) {{
215
- console.log('Styles loaded!')
216
- }}
217
- const el = window.document.getElementById('content-text');
218
- if (el) {{
219
- const options = {{
220
- htmlTags: true
221
- }};
222
- const html = window.render(text, options);
223
- el.outerHTML = html;
224
- }}
225
- }};
226
- </script>
227
- </head>
228
- <body>
229
- <div id="content">
230
- <div id="content-text"></div>
231
- </div>
232
- </body>
233
- </html>"""
234
-
235
- return html_content
236
-
237
- def format_latex_content(content):
238
- """Format LaTeX content for display in UI using MathJax rendering like GOT-OCR demo."""
239
- if not content:
240
- return content
241
-
242
- try:
243
- # Generate rendered HTML
244
- rendered_html = render_latex_to_html(content)
245
-
246
- # Encode for iframe display (similar to GOT-OCR demo)
247
- import base64
248
- encoded_html = base64.b64encode(rendered_html.encode("utf-8")).decode("utf-8")
249
- iframe_src = f"data:text/html;base64,{encoded_html}"
250
-
251
- # Create the display with both rendered and raw views
252
- formatted_content = f"""
253
- <div style="background-color: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; margin: 10px 0;">
254
- <div style="background-color: #e9ecef; padding: 10px; border-radius: 8px 8px 0 0; font-weight: bold; color: #495057;">
255
- πŸ“„ LaTeX Content (Rendered with MathJax)
256
- </div>
257
- <div style="padding: 0;">
258
- <iframe src="{iframe_src}" width="100%" height="500px" style="border: none; border-radius: 0 0 8px 8px;"></iframe>
259
- </div>
260
- <div style="background-color: #e9ecef; padding: 8px 15px; border-radius: 0; font-size: 12px; color: #6c757d; border-top: 1px solid #dee2e6;">
261
- πŸ’‘ LaTeX content rendered with MathJax. Tables and formulas are displayed as they would appear in a LaTeX document.
262
- </div>
263
- <details style="margin: 0; border-top: 1px solid #dee2e6;">
264
- <summary style="padding: 8px 15px; background-color: #e9ecef; cursor: pointer; font-size: 12px; color: #6c757d;">
265
- πŸ“ View Raw LaTeX Source
266
- </summary>
267
- <div style="padding: 15px; background-color: #f8f9fa;">
268
- <pre style="background-color: transparent; margin: 0; padding: 0;
269
- font-family: 'Courier New', monospace; font-size: 12px; line-height: 1.4;
270
- white-space: pre-wrap; word-wrap: break-word; color: #2c3e50; max-height: 200px; overflow-y: auto;">
271
- {content}
272
- </pre>
273
- </div>
274
- </details>
275
- </div>
276
- """
277
-
278
- except Exception as e:
279
- # Fallback to simple formatting if rendering fails
280
- import html
281
- escaped_content = html.escape(str(content))
282
- formatted_content = f"""
283
- <div style="background-color: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; margin: 10px 0;">
284
- <div style="background-color: #e9ecef; padding: 10px; border-radius: 8px 8px 0 0; font-weight: bold; color: #495057;">
285
- πŸ“„ LaTeX Content (Fallback View)
286
- </div>
287
- <div style="padding: 15px;">
288
- <pre style="background-color: transparent; margin: 0; padding: 0;
289
- font-family: 'Courier New', monospace; font-size: 14px; line-height: 1.4;
290
- white-space: pre-wrap; word-wrap: break-word; color: #2c3e50;">
291
- {escaped_content}
292
- </pre>
293
- </div>
294
- <div style="background-color: #e9ecef; padding: 8px 15px; border-radius: 0 0 8px 8px; font-size: 12px; color: #6c757d;">
295
- ⚠️ Rendering failed, showing raw LaTeX. Error: {str(e)}
296
- </div>
297
- </div>
298
- """
299
-
300
- return formatted_content
301
-
302
- # Function to run conversion in a separate thread
303
- def run_conversion_thread(file_path, parser_name, ocr_method_name, output_format):
304
- """Run the conversion in a separate thread and return the thread object"""
305
- global conversion_cancelled
306
-
307
- # Reset the cancellation flag
308
- conversion_cancelled.clear()
309
-
310
- # Create a container for the results
311
- results = {"content": None, "download_file": None, "error": None}
312
-
313
- def conversion_worker():
314
- try:
315
- content, download_file = convert_file(file_path, parser_name, ocr_method_name, output_format)
316
- results["content"] = content
317
- results["download_file"] = download_file
318
- except Exception as e:
319
- logger.error(f"Error during conversion: {str(e)}")
320
- results["error"] = str(e)
321
-
322
- # Create and start the thread
323
- thread = threading.Thread(target=conversion_worker)
324
- thread.daemon = True
325
- thread.start()
326
-
327
- return thread, results
328
-
329
- def run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type):
330
- """Run the conversion in a separate thread for multiple files."""
331
- import threading
332
- from src.services.document_service import DocumentService
333
-
334
- # Results will be shared between threads
335
- results = {"content": None, "download_file": None, "error": None}
336
-
337
- def conversion_worker():
338
- try:
339
- logger.info(f"Starting multi-file conversion thread for {len(file_paths)} files")
340
-
341
- # Use the new document service unified method
342
- document_service = DocumentService()
343
- document_service.set_cancellation_flag(conversion_cancelled)
344
-
345
- # Call the unified convert_documents method
346
- content, output_file = document_service.convert_documents(
347
- file_paths=file_paths,
348
- parser_name=parser_name,
349
- ocr_method_name=ocr_method_name,
350
- output_format=output_format,
351
- processing_type=processing_type
352
- )
353
-
354
- logger.info(f"Multi-file conversion completed successfully for {len(file_paths)} files")
355
- results["content"] = content
356
- results["download_file"] = output_file
357
-
358
- except Exception as e:
359
- logger.error(f"Error during multi-file conversion: {str(e)}")
360
- results["error"] = str(e)
361
-
362
- # Create and start the thread
363
- thread = threading.Thread(target=conversion_worker)
364
- thread.daemon = True
365
- thread.start()
366
-
367
- return thread, results
368
-
369
- def handle_convert(files, parser_name, ocr_method_name, output_format, processing_type, is_cancelled):
370
- """Handle file conversion for single or multiple files."""
371
- global conversion_cancelled
372
-
373
- # Check if we should cancel before starting
374
- if is_cancelled:
375
- logger.info("Conversion cancelled before starting")
376
- return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
377
-
378
- # Validate files input
379
- if not files or len(files) == 0:
380
- error_msg = "No files uploaded. Please upload at least one document."
381
- logger.error(error_msg)
382
- return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
383
-
384
- # Convert Gradio file objects to file paths
385
- file_paths = []
386
- for file in files:
387
- if hasattr(file, 'name'):
388
- file_paths.append(file.name)
389
- else:
390
- file_paths.append(str(file))
391
-
392
- # Validate file types for the selected parser
393
- for file_path in file_paths:
394
- is_valid, error_msg = validate_file_for_parser(file_path, parser_name)
395
- if not is_valid:
396
- logger.error(f"File validation error: {error_msg}")
397
- return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
398
-
399
- logger.info(f"Starting conversion of {len(file_paths)} file(s) with cancellation flag cleared")
400
-
401
- # Start the conversion in a separate thread
402
- thread, results = run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type)
403
-
404
- # Start the monitoring thread
405
- monitor_thread = threading.Thread(target=monitor_cancellation)
406
- monitor_thread.daemon = True
407
- monitor_thread.start()
408
-
409
- # Wait for the thread to complete or be cancelled
410
- while thread.is_alive():
411
- # Check if cancellation was requested
412
- if conversion_cancelled.is_set():
413
- logger.info("Cancellation detected, waiting for thread to finish")
414
- # Give the thread a chance to clean up
415
- thread.join(timeout=0.5)
416
- if thread.is_alive():
417
- logger.warning("Thread did not finish within timeout")
418
- return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
419
-
420
- # Sleep briefly to avoid busy waiting
421
- time.sleep(0.1)
422
-
423
- # Thread has completed, check results
424
- if results["error"]:
425
- return f"Error: {results['error']}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
426
-
427
- content = results["content"]
428
- download_file = results["download_file"]
429
-
430
- # If conversion returned a cancellation message
431
- if content == "Conversion cancelled.":
432
- logger.info("Converter returned cancellation message")
433
- return content, None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
434
-
435
- # Format the content based on parser type
436
- if "GOT-OCR" in parser_name:
437
- # For GOT-OCR, display as LaTeX
438
- formatted_content = format_latex_content(str(content))
439
- html_output = f"<div class='output-container'>{formatted_content}</div>"
440
- else:
441
- # For other parsers, display as Markdown
442
- formatted_content = format_markdown_content(str(content))
443
- html_output = f"<div class='output-container'>{formatted_content}</div>"
444
-
445
- logger.info("Conversion completed successfully")
446
-
447
- # Auto-ingest the converted document for RAG
448
- try:
449
- # Read original file content for proper deduplication hashing
450
- original_file_content = None
451
- if file_path and Path(file_path).exists():
452
- try:
453
- with open(file_path, 'rb') as f:
454
- original_file_content = f.read().decode('utf-8', errors='ignore')
455
- except Exception as e:
456
- logger.warning(f"Could not read original file content: {e}")
457
-
458
- conversion_result = {
459
- "markdown_content": content,
460
- "original_filename": Path(file_path).name if file_path else "unknown",
461
- "conversion_method": parser_name,
462
- "file_size": Path(file_path).stat().st_size if file_path and Path(file_path).exists() else 0,
463
- "conversion_time": 0, # Could be tracked if needed
464
- "original_file_content": original_file_content
465
- }
466
-
467
- success, ingestion_msg, stats = document_ingestion_service.ingest_from_conversion_result(conversion_result)
468
- if success:
469
- logger.info(f"Document auto-ingested for RAG: {ingestion_msg}")
470
- else:
471
- logger.warning(f"Document ingestion failed: {ingestion_msg}")
472
- except Exception as e:
473
- logger.error(f"Error during auto-ingestion: {e}")
474
-
475
- return html_output, download_file, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
476
-
477
- def handle_chat_message(message, history):
478
- """Handle a new chat message with streaming response."""
479
- if not message or not message.strip():
480
- return "", history, gr.update()
481
-
482
- try:
483
- # Add user message to history
484
- history = history or []
485
- history.append({"role": "user", "content": message})
486
-
487
- # Add assistant message placeholder
488
- history.append({"role": "assistant", "content": ""})
489
-
490
- # Get response from RAG service
491
- response_text = ""
492
- for chunk in rag_chat_service.chat_stream(message):
493
- response_text += chunk
494
- # Update the last message in history with the current response
495
- history[-1]["content"] = response_text
496
- # Update status in real-time during streaming
497
- updated_status = get_chat_status()
498
- yield "", history, updated_status
499
-
500
- logger.info(f"Chat response completed for message: {message[:50]}...")
501
-
502
- # Final status update after message completion
503
- final_status = get_chat_status()
504
- yield "", history, final_status
505
-
506
- except Exception as e:
507
- error_msg = f"Error generating response: {str(e)}"
508
- logger.error(error_msg)
509
- if history and len(history) > 0:
510
- history[-1]["content"] = f"❌ {error_msg}"
511
- else:
512
- history = [
513
- {"role": "user", "content": message},
514
- {"role": "assistant", "content": f"❌ {error_msg}"}
515
- ]
516
- # Update status even on error
517
- error_status = get_chat_status()
518
- yield "", history, error_status
519
-
520
- def start_new_chat_session():
521
- """Start a new chat session."""
522
- try:
523
- session_id = rag_chat_service.start_new_session()
524
- logger.info(f"Started new chat session: {session_id}")
525
- return [], f"βœ… New chat session started: {session_id}"
526
- except Exception as e:
527
- error_msg = f"Error starting new session: {str(e)}"
528
- logger.error(error_msg)
529
- return [], f"❌ {error_msg}"
530
-
531
- def handle_clear_all_data():
532
- """Handle clearing all RAG data (vector store + chat history)."""
533
- try:
534
- # Clear all data using the data clearing service
535
- success, message, stats = data_clearing_service.clear_all_data()
536
-
537
- if success:
538
- # Reset chat session after clearing data
539
- session_id = rag_chat_service.start_new_session()
540
-
541
- # Get updated status
542
- updated_status = get_chat_status()
543
-
544
- # Create success message with stats
545
- if stats.get("total_cleared_documents", 0) > 0 or stats.get("total_cleared_files", 0) > 0:
546
- clear_msg = f"βœ… {message}"
547
- session_msg = f"πŸ†• Started new session: {session_id}"
548
- combined_msg = f'{clear_msg}<br/><div class="session-info">{session_msg}</div>'
549
- else:
550
- combined_msg = f'ℹ️ {message}<br/><div class="session-info">πŸ†• Started new session: {session_id}</div>'
551
-
552
- logger.info(f"Data cleared successfully: {message}")
553
-
554
- return [], combined_msg, updated_status
555
- else:
556
- error_msg = f"❌ {message}"
557
- logger.error(f"Data clearing failed: {message}")
558
-
559
- # Still get updated status even on error
560
- updated_status = get_chat_status()
561
-
562
- return None, f'<div class="session-info">{error_msg}</div>', updated_status
563
-
564
- except Exception as e:
565
- error_msg = f"Error clearing data: {str(e)}"
566
- logger.error(error_msg)
567
-
568
- # Get current status
569
- current_status = get_chat_status()
570
-
571
- return None, f'<div class="session-info">❌ {error_msg}</div>', current_status
572
-
573
- def handle_query_search(query, method, k_value):
574
- """Handle query search and return formatted results."""
575
- if not query or not query.strip():
576
- return """
577
- <div class="ranker-container">
578
- <div class="ranker-placeholder">
579
- <h3>πŸ” Query Ranker</h3>
580
- <p>Enter a search query to find relevant document chunks with similarity scores.</p>
581
- </div>
582
- </div>
583
- """
584
-
585
- try:
586
- logger.info(f"Query search: '{query[:50]}...' using method: {method}")
587
-
588
- # Get results based on method
589
- results = []
590
- if method == "similarity":
591
- retriever = vector_store_manager.get_retriever("similarity", {"k": k_value})
592
- docs = retriever.invoke(query)
593
- # Try to get actual similarity scores
594
- try:
595
- vector_store = vector_store_manager.get_vector_store()
596
- if hasattr(vector_store, 'similarity_search_with_score'):
597
- docs_with_scores = vector_store.similarity_search_with_score(query, k=k_value)
598
- for i, (doc, score) in enumerate(docs_with_scores):
599
- similarity_score = max(0, 1 - score) if score is not None else 0.8
600
- results.append(_format_ranker_result(doc, similarity_score, i + 1))
601
- else:
602
- # Fallback without scores
603
- for i, doc in enumerate(docs):
604
- score = 0.85 - (i * 0.05)
605
- results.append(_format_ranker_result(doc, score, i + 1))
606
- except Exception as e:
607
- logger.warning(f"Could not get similarity scores: {e}")
608
- for i, doc in enumerate(docs):
609
- score = 0.85 - (i * 0.05)
610
- results.append(_format_ranker_result(doc, score, i + 1))
611
-
612
- elif method == "mmr":
613
- retriever = vector_store_manager.get_retriever("mmr", {"k": k_value, "fetch_k": k_value * 2, "lambda_mult": 0.5})
614
- docs = retriever.invoke(query)
615
- for i, doc in enumerate(docs):
616
- results.append(_format_ranker_result(doc, None, i + 1)) # No score for MMR
617
-
618
- elif method == "bm25":
619
- retriever = vector_store_manager.get_bm25_retriever(k=k_value)
620
- docs = retriever.invoke(query)
621
- for i, doc in enumerate(docs):
622
- results.append(_format_ranker_result(doc, None, i + 1)) # No score for BM25
623
-
624
- elif method == "hybrid":
625
- retriever = vector_store_manager.get_hybrid_retriever(k=k_value, semantic_weight=0.7, keyword_weight=0.3)
626
- docs = retriever.invoke(query)
627
- # Explicitly limit results to k_value since EnsembleRetriever may return more
628
- docs = docs[:k_value]
629
- for i, doc in enumerate(docs):
630
- results.append(_format_ranker_result(doc, None, i + 1)) # No score for Hybrid
631
-
632
- return _format_ranker_results_html(results, query, method)
633
-
634
- except Exception as e:
635
- error_msg = f"Error during search: {str(e)}"
636
- logger.error(error_msg)
637
- return f"""
638
- <div class="ranker-container">
639
- <div class="ranker-error">
640
- <h3>❌ Search Error</h3>
641
- <p>{error_msg}</p>
642
- <p class="error-hint">Please check if documents are uploaded and the system is ready.</p>
643
- </div>
644
- </div>
645
- """
646
-
647
- def _format_ranker_result(doc, score, rank):
648
- """Format a single document result for the ranker."""
649
- metadata = doc.metadata or {}
650
-
651
- # Extract metadata
652
- source = metadata.get("source", "Unknown Document")
653
- page = metadata.get("page", "N/A")
654
- chunk_id = metadata.get("chunk_id", f"chunk_{rank}")
655
-
656
- # Content length indicator
657
- content_length = len(doc.page_content)
658
- if content_length < 200:
659
- length_indicator = "πŸ“„ Short"
660
- elif content_length < 500:
661
- length_indicator = "πŸ“„ Medium"
662
- else:
663
- length_indicator = "πŸ“„ Long"
664
-
665
- # Rank-based confidence levels (applies to all methods)
666
- if rank <= 3:
667
- confidence = "High"
668
- confidence_color = "#22c55e"
669
- confidence_icon = "🟒"
670
- elif rank <= 6:
671
- confidence = "Medium"
672
- confidence_color = "#f59e0b"
673
- confidence_icon = "🟑"
674
- else:
675
- confidence = "Low"
676
- confidence_color = "#ef4444"
677
- confidence_icon = "πŸ”΄"
678
-
679
- result = {
680
- "rank": rank,
681
- "content": doc.page_content,
682
- "source": source,
683
- "page": page,
684
- "chunk_id": chunk_id,
685
- "length_indicator": length_indicator,
686
- "has_score": score is not None,
687
- "confidence": confidence,
688
- "confidence_color": confidence_color,
689
- "confidence_icon": confidence_icon
690
- }
691
-
692
- # Only add score if we have a real score (similarity search only)
693
- if score is not None:
694
- result["score"] = round(score, 3)
695
-
696
- return result
697
-
698
- def _format_ranker_results_html(results, query, method):
699
- """Format search results as HTML."""
700
- if not results:
701
- return """
702
- <div class="ranker-container">
703
- <div class="ranker-no-results">
704
- <h3>πŸ” No Results Found</h3>
705
- <p>No relevant documents found for your query.</p>
706
- <p class="no-results-hint">Try different keywords or check if documents are uploaded.</p>
707
- </div>
708
- </div>
709
- """
710
-
711
- # Method display names
712
- method_labels = {
713
- "similarity": "🎯 Similarity Search",
714
- "mmr": "πŸ”€ MMR (Diverse)",
715
- "bm25": "πŸ” BM25 (Keywords)",
716
- "hybrid": "πŸ”— Hybrid (Recommended)"
717
- }
718
- method_display = method_labels.get(method, method)
719
-
720
- # Start building HTML
721
- html_parts = [f"""
722
- <div class="ranker-container">
723
- <div class="ranker-header">
724
- <div class="ranker-title">
725
- <h3>πŸ” Search Results</h3>
726
- <div class="query-display">"{query}"</div>
727
- </div>
728
- <div class="ranker-meta">
729
- <span class="method-badge">{method_display}</span>
730
- <span class="result-count">{len(results)} results</span>
731
- </div>
732
- </div>
733
- """]
734
-
735
- # Add results
736
- for result in results:
737
- rank_emoji = ["πŸ₯‡", "πŸ₯ˆ", "πŸ₯‰"][result["rank"] - 1] if result["rank"] <= 3 else f"#{result['rank']}"
738
-
739
- # Escape content for safe HTML inclusion and JavaScript
740
- escaped_content = result['content'].replace('"', '&quot;').replace("'", "&#39;").replace('\n', '\\n')
741
-
742
- # Build score info - always show confidence, only show score for similarity search
743
- score_info_parts = [f"""
744
- <span class="confidence-badge" style="color: {result['confidence_color']}">
745
- {result['confidence_icon']} {result['confidence']}
746
- </span>"""]
747
-
748
- # Only add score value if we have real scores (similarity search)
749
- if result.get('has_score', False):
750
- score_info_parts.append(f'<span class="score-value">🎯 {result["score"]}</span>')
751
-
752
- score_info_html = f"""
753
- <div class="score-info">
754
- {''.join(score_info_parts)}
755
- </div>"""
756
-
757
- html_parts.append(f"""
758
- <div class="result-card">
759
- <div class="result-header">
760
- <div class="rank-info">
761
- <span class="rank-badge">{rank_emoji} Rank {result['rank']}</span>
762
- <span class="source-info">πŸ“„ {result['source']}</span>
763
- {f"<span class='page-info'>Page {result['page']}</span>" if result['page'] != 'N/A' else ""}
764
- <span class="length-info">{result['length_indicator']}</span>
765
- </div>
766
- {score_info_html}
767
- </div>
768
- <div class="result-content">
769
- <div class="content-text">{result['content']}</div>
770
- </div>
771
- </div>
772
- """)
773
-
774
- html_parts.append("</div>")
775
-
776
- return "".join(html_parts)
777
-
778
- def get_ranker_status():
779
- """Get current ranker system status."""
780
- try:
781
- # Get collection info
782
- collection_info = vector_store_manager.get_collection_info()
783
- document_count = collection_info.get("document_count", 0)
784
-
785
- # Get available methods
786
- available_methods = ["similarity", "mmr", "bm25", "hybrid"]
787
-
788
- # Check if system is ready
789
- ingestion_status = document_ingestion_service.get_ingestion_status()
790
- system_ready = ingestion_status.get('system_ready', False)
791
-
792
- status_html = f"""
793
- <div class="status-card">
794
- <div class="status-header">
795
- <h3>πŸ” Query Ranker Status</h3>
796
- <div class="status-indicator {'status-ready' if system_ready else 'status-not-ready'}">
797
- {'🟒 READY' if system_ready else 'πŸ”΄ NOT READY'}
798
- </div>
799
- </div>
800
-
801
- <div class="status-grid">
802
- <div class="status-item">
803
- <div class="status-label">Available Documents</div>
804
- <div class="status-value">{document_count}</div>
805
- </div>
806
- <div class="status-item">
807
- <div class="status-label">Retrieval Methods</div>
808
- <div class="status-value">{len(available_methods)}</div>
809
- </div>
810
- <div class="status-item">
811
- <div class="status-label">Vector Store</div>
812
- <div class="status-value">{'Ready' if system_ready else 'Not Ready'}</div>
813
- </div>
814
- </div>
815
-
816
- <div class="ranker-methods">
817
- <div class="methods-label">Available Methods:</div>
818
- <div class="methods-list">
819
- <span class="method-tag">🎯 Similarity</span>
820
- <span class="method-tag">πŸ”€ MMR</span>
821
- <span class="method-tag">πŸ” BM25</span>
822
- <span class="method-tag">πŸ”— Hybrid</span>
823
- </div>
824
- </div>
825
- </div>
826
- """
827
-
828
- return status_html
829
-
830
- except Exception as e:
831
- error_msg = f"Error getting ranker status: {str(e)}"
832
- logger.error(error_msg)
833
- return f"""
834
- <div class="status-card status-error">
835
- <div class="status-header">
836
- <h3>❌ System Error</h3>
837
- </div>
838
- <p class="error-message">{error_msg}</p>
839
- </div>
840
- """
841
-
842
- def get_chat_status():
843
- """Get current chat system status."""
844
- try:
845
- # Check ingestion status
846
- ingestion_status = document_ingestion_service.get_ingestion_status()
847
-
848
- # Check usage stats
849
- usage_stats = rag_chat_service.get_usage_stats()
850
-
851
- # Get data status for additional context
852
- data_status = data_clearing_service.get_data_status()
853
-
854
- # Modern status card design with better styling
855
- status_html = f"""
856
- <div class="status-card">
857
- <div class="status-header">
858
- <h3>πŸ’¬ Chat System Status</h3>
859
- <div class="status-indicator {'status-ready' if ingestion_status.get('system_ready', False) else 'status-not-ready'}">
860
- {'🟒 READY' if ingestion_status.get('system_ready', False) else 'πŸ”΄ NOT READY'}
861
- </div>
862
- </div>
863
-
864
- <div class="status-grid">
865
- <div class="status-item">
866
- <div class="status-label">Vector Store Docs</div>
867
- <div class="status-value">{data_status.get('vector_store', {}).get('document_count', 0)}</div>
868
- </div>
869
- <div class="status-item">
870
- <div class="status-label">Chat History Files</div>
871
- <div class="status-value">{data_status.get('chat_history', {}).get('file_count', 0)}</div>
872
- </div>
873
- <div class="status-item">
874
- <div class="status-label">Session Usage</div>
875
- <div class="status-value">{usage_stats.get('session_messages', 0)}/{usage_stats.get('session_limit', 50)}</div>
876
- </div>
877
- <div class="status-item">
878
- <div class="status-label">Environment</div>
879
- <div class="status-value">{'HF Space' if data_status.get('environment') == 'hf_space' else 'Local'}</div>
880
- </div>
881
- </div>
882
-
883
- <div class="status-services">
884
- <div class="service-status {'service-ready' if ingestion_status.get('embedding_model_available', False) else 'service-error'}">
885
- <span class="service-icon">🧠</span>
886
- <span>Embedding Model</span>
887
- <span class="service-indicator">{'βœ…' if ingestion_status.get('embedding_model_available', False) else '❌'}</span>
888
- </div>
889
- <div class="service-status {'service-ready' if ingestion_status.get('vector_store_available', False) else 'service-error'}">
890
- <span class="service-icon">πŸ—„οΈ</span>
891
- <span>Vector Store</span>
892
- <span class="service-indicator">{'βœ…' if ingestion_status.get('vector_store_available', False) else '❌'}</span>
893
- </div>
894
- </div>
895
- </div>
896
- """
897
-
898
- return status_html
899
-
900
- except Exception as e:
901
- error_msg = f"Error getting chat status: {str(e)}"
902
- logger.error(error_msg)
903
- return f"""
904
- <div class="status-card status-error">
905
- <div class="status-header">
906
- <h3>❌ System Error</h3>
907
- </div>
908
- <p class="error-message">{error_msg}</p>
909
- </div>
910
- """
911
 
912
  def create_ui():
913
- with gr.Blocks(css="""
914
- /* Global styles */
915
- .gradio-container {
916
- font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
917
- }
918
-
919
- /* Document converter styles */
920
- .output-container {
921
- max-height: 420px;
922
- overflow-y: auto;
923
- border: 1px solid #ddd;
924
- padding: 10px;
925
- }
926
-
927
- .gradio-container .prose {
928
- overflow: visible;
929
- }
930
-
931
- .processing-controls {
932
- display: flex;
933
- justify-content: center;
934
- gap: 10px;
935
- margin-top: 10px;
936
- }
937
-
938
- .provider-options-row {
939
- margin-top: 15px;
940
- margin-bottom: 15px;
941
- }
942
-
943
- /* Chat Tab Styles - Complete redesign */
944
- .chat-tab-container {
945
- max-width: 1200px;
946
- margin: 0 auto;
947
- padding: 20px;
948
- }
949
-
950
- .chat-header {
951
- text-align: center;
952
- margin-bottom: 30px;
953
- padding: 20px;
954
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
955
- border-radius: 15px;
956
- color: white;
957
- box-shadow: 0 4px 15px rgba(0,0,0,0.1);
958
- }
959
-
960
- .chat-header h2 {
961
- margin: 0;
962
- font-size: 1.8em;
963
- font-weight: 600;
964
- }
965
-
966
- .chat-header p {
967
- margin: 10px 0 0 0;
968
- opacity: 0.9;
969
- font-size: 1.1em;
970
- }
971
-
972
- /* Status Card Styling */
973
- .status-card {
974
- background: #ffffff;
975
- border: 1px solid #e1e5e9;
976
- border-radius: 12px;
977
- padding: 20px;
978
- margin-bottom: 25px;
979
- box-shadow: 0 2px 10px rgba(0,0,0,0.05);
980
- transition: all 0.3s ease;
981
- }
982
-
983
- .status-card:hover {
984
- box-shadow: 0 4px 20px rgba(0,0,0,0.1);
985
- }
986
-
987
- .status-header {
988
- display: flex;
989
- justify-content: space-between;
990
- align-items: center;
991
- margin-bottom: 20px;
992
- padding-bottom: 15px;
993
- border-bottom: 2px solid #f0f2f5;
994
- }
995
-
996
- .status-header h3 {
997
- margin: 0;
998
- color: #2c3e50;
999
- font-size: 1.3em;
1000
- font-weight: 600;
1001
- }
1002
-
1003
- .status-indicator {
1004
- padding: 8px 16px;
1005
- border-radius: 25px;
1006
- font-weight: 600;
1007
- font-size: 0.9em;
1008
- letter-spacing: 0.5px;
1009
- }
1010
-
1011
- .status-ready {
1012
- background: #d4edda;
1013
- color: #155724;
1014
- border: 1px solid #c3e6cb;
1015
- }
1016
-
1017
- .status-not-ready {
1018
- background: #f8d7da;
1019
- color: #721c24;
1020
- border: 1px solid #f5c6cb;
1021
- }
1022
-
1023
- .status-grid {
1024
- display: grid;
1025
- grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
1026
- gap: 15px;
1027
- margin-bottom: 20px;
1028
- }
1029
-
1030
- .status-item {
1031
- background: #f8f9fa;
1032
- padding: 15px;
1033
- border-radius: 8px;
1034
- text-align: center;
1035
- border: 1px solid #e9ecef;
1036
- }
1037
-
1038
- .status-label {
1039
- font-size: 0.85em;
1040
- color: #6c757d;
1041
- margin-bottom: 5px;
1042
- font-weight: 500;
1043
- }
1044
-
1045
- .status-value {
1046
- font-size: 1.4em;
1047
- font-weight: 700;
1048
- color: #495057;
1049
- }
1050
-
1051
- .status-services {
1052
- display: flex;
1053
- gap: 15px;
1054
- flex-wrap: wrap;
1055
- }
1056
-
1057
- .service-status {
1058
- display: flex;
1059
- align-items: center;
1060
- gap: 8px;
1061
- padding: 10px 15px;
1062
- border-radius: 8px;
1063
- font-weight: 500;
1064
- flex: 1;
1065
- min-width: 200px;
1066
- color: #2c3e50 !important;
1067
- }
1068
-
1069
- .service-status span {
1070
- color: #2c3e50 !important;
1071
- }
1072
-
1073
- .service-ready {
1074
- background: #d4edda;
1075
- color: #2c3e50 !important;
1076
- border: 1px solid #c3e6cb;
1077
- }
1078
-
1079
- .service-ready span {
1080
- color: #2c3e50 !important;
1081
- }
1082
-
1083
- .service-error {
1084
- background: #f8d7da;
1085
- color: #2c3e50 !important;
1086
- border: 1px solid #f5c6cb;
1087
- }
1088
-
1089
- .service-error span {
1090
- color: #2c3e50 !important;
1091
- }
1092
-
1093
- .service-icon {
1094
- font-size: 1.2em;
1095
- }
1096
-
1097
- .service-indicator {
1098
- margin-left: auto;
1099
- }
1100
-
1101
- .status-error {
1102
- border-color: #dc3545;
1103
- background: #f8d7da;
1104
- }
1105
-
1106
- .error-message {
1107
- color: #721c24;
1108
- margin: 0;
1109
- font-weight: 500;
1110
- }
1111
-
1112
- /* Control buttons styling */
1113
- .control-buttons {
1114
- display: flex;
1115
- gap: 12px;
1116
- justify-content: flex-end;
1117
- margin-bottom: 25px;
1118
- }
1119
-
1120
- .control-btn {
1121
- padding: 10px 20px;
1122
- border-radius: 8px;
1123
- font-weight: 500;
1124
- transition: all 0.3s ease;
1125
- border: none;
1126
- cursor: pointer;
1127
- }
1128
-
1129
- .btn-refresh {
1130
- background: #17a2b8;
1131
- color: white;
1132
- }
1133
-
1134
- .btn-refresh:hover {
1135
- background: #138496;
1136
- transform: translateY(-1px);
1137
- }
1138
-
1139
- .btn-new-session {
1140
- background: #28a745;
1141
- color: white;
1142
- }
1143
-
1144
- .btn-new-session:hover {
1145
- background: #218838;
1146
- transform: translateY(-1px);
1147
- }
1148
-
1149
- .btn-clear-data {
1150
- background: #dc3545;
1151
- color: white;
1152
- }
1153
-
1154
- .btn-clear-data:hover {
1155
- background: #c82333;
1156
- transform: translateY(-1px);
1157
- }
1158
-
1159
- .btn-primary {
1160
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1161
- color: white;
1162
- }
1163
-
1164
- .btn-primary:hover {
1165
- transform: translateY(-1px);
1166
- box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1167
- }
1168
-
1169
- /* Chat interface styling */
1170
- .chat-main-container {
1171
- background: #ffffff;
1172
- border-radius: 15px;
1173
- box-shadow: 0 4px 20px rgba(0,0,0,0.08);
1174
- overflow: hidden;
1175
- margin-bottom: 25px;
1176
- }
1177
-
1178
- .chat-container {
1179
- background: #ffffff;
1180
- border-radius: 12px;
1181
- border: 1px solid #e1e5e9;
1182
- overflow: hidden;
1183
- }
1184
-
1185
- /* Custom chatbot styling */
1186
- .gradio-chatbot {
1187
- border: none !important;
1188
- background: #ffffff;
1189
- }
1190
-
1191
- .gradio-chatbot .message {
1192
- padding: 15px 20px;
1193
- margin: 10px;
1194
- border-radius: 12px;
1195
- }
1196
-
1197
- .gradio-chatbot .message.user {
1198
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1199
- color: white;
1200
- margin-left: 50px;
1201
- }
1202
-
1203
- .gradio-chatbot .message.assistant {
1204
- background: #f8f9fa;
1205
- border: 1px solid #e9ecef;
1206
- margin-right: 50px;
1207
- }
1208
-
1209
- /* Input area styling */
1210
- .chat-input-container {
1211
- background: #ffffff;
1212
- padding: 20px;
1213
- border-top: 1px solid #e1e5e9;
1214
- border-radius: 0 0 15px 15px;
1215
- }
1216
-
1217
- .input-row {
1218
- display: flex;
1219
- gap: 12px;
1220
- align-items: center;
1221
- }
1222
-
1223
- .message-input {
1224
- flex: 1;
1225
- border: 2px solid #e1e5e9;
1226
- border-radius: 25px;
1227
- padding: 12px 20px;
1228
- font-size: 1em;
1229
- transition: all 0.3s ease;
1230
- resize: none;
1231
- max-height: 120px;
1232
- min-height: 48px;
1233
- }
1234
-
1235
- .message-input:focus {
1236
- border-color: #667eea;
1237
- box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
1238
- outline: none;
1239
- }
1240
-
1241
- .send-button {
1242
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1243
- color: white;
1244
- border: none;
1245
- border-radius: 12px;
1246
- padding: 12px 24px;
1247
- min-width: 80px;
1248
- height: 48px;
1249
- margin-right: 10px;
1250
- cursor: pointer;
1251
- transition: all 0.3s ease;
1252
- display: flex;
1253
- align-items: center;
1254
- justify-content: center;
1255
- font-size: 1em;
1256
- font-weight: 600;
1257
- letter-spacing: 0.5px;
1258
- }
1259
-
1260
- .send-button:hover {
1261
- transform: scale(1.05);
1262
- box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1263
- }
1264
-
1265
- /* Session info styling */
1266
- .session-info {
1267
- background: #e7f3ff;
1268
- border: 1px solid #b3d9ff;
1269
- border-radius: 8px;
1270
- padding: 15px;
1271
- color: #0056b3;
1272
- font-weight: 500;
1273
- text-align: center;
1274
- }
1275
-
1276
- /* Responsive design */
1277
- @media (max-width: 768px) {
1278
- .chat-tab-container {
1279
- padding: 10px;
1280
- }
1281
-
1282
- .status-grid {
1283
- grid-template-columns: repeat(2, 1fr);
1284
- }
1285
-
1286
- .service-status {
1287
- min-width: 100%;
1288
- }
1289
-
1290
- .control-buttons {
1291
- flex-direction: column;
1292
- gap: 8px;
1293
- }
1294
-
1295
- .gradio-chatbot .message.user {
1296
- margin-left: 20px;
1297
- }
1298
-
1299
- .gradio-chatbot .message.assistant {
1300
- margin-right: 20px;
1301
- }
1302
- }
1303
-
1304
- /* Query Ranker Styles */
1305
- .ranker-container {
1306
- max-width: 1200px;
1307
- margin: 0 auto;
1308
- padding: 20px;
1309
- }
1310
-
1311
- .ranker-placeholder {
1312
- text-align: center;
1313
- padding: 40px;
1314
- background: #f8f9fa;
1315
- border-radius: 12px;
1316
- border: 1px solid #e9ecef;
1317
- color: #6c757d;
1318
- }
1319
-
1320
- .ranker-placeholder h3 {
1321
- color: #495057;
1322
- margin-bottom: 10px;
1323
- }
1324
-
1325
- .ranker-error {
1326
- text-align: center;
1327
- padding: 30px;
1328
- background: #f8d7da;
1329
- border: 1px solid #f5c6cb;
1330
- border-radius: 12px;
1331
- color: #721c24;
1332
- }
1333
-
1334
- .ranker-error h3 {
1335
- margin-bottom: 15px;
1336
- }
1337
-
1338
- .error-hint {
1339
- font-style: italic;
1340
- margin-top: 10px;
1341
- opacity: 0.8;
1342
- }
1343
-
1344
- .ranker-no-results {
1345
- text-align: center;
1346
- padding: 40px;
1347
- background: #ffffff;
1348
- border: 1px solid #e1e5e9;
1349
- border-radius: 12px;
1350
- color: #6c757d;
1351
- }
1352
-
1353
- .ranker-no-results h3 {
1354
- color: #495057;
1355
- margin-bottom: 15px;
1356
- }
1357
-
1358
- .no-results-hint {
1359
- font-style: italic;
1360
- margin-top: 10px;
1361
- opacity: 0.8;
1362
- }
1363
-
1364
- .ranker-header {
1365
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1366
- color: white;
1367
- padding: 20px;
1368
- border-radius: 15px;
1369
- margin-bottom: 25px;
1370
- box-shadow: 0 4px 15px rgba(0,0,0,0.1);
1371
- }
1372
-
1373
- .ranker-title h3 {
1374
- margin: 0 0 10px 0;
1375
- font-size: 1.4em;
1376
- font-weight: 600;
1377
- }
1378
-
1379
- .query-display {
1380
- font-size: 1.1em;
1381
- opacity: 0.9;
1382
- font-style: italic;
1383
- margin-bottom: 15px;
1384
- }
1385
-
1386
- .ranker-meta {
1387
- display: flex;
1388
- gap: 15px;
1389
- align-items: center;
1390
- flex-wrap: wrap;
1391
- }
1392
-
1393
- .method-badge {
1394
- background: rgba(255, 255, 255, 0.2);
1395
- padding: 6px 12px;
1396
- border-radius: 20px;
1397
- font-weight: 500;
1398
- font-size: 0.9em;
1399
- }
1400
-
1401
- .result-count {
1402
- background: rgba(255, 255, 255, 0.15);
1403
- padding: 6px 12px;
1404
- border-radius: 20px;
1405
- font-weight: 500;
1406
- font-size: 0.9em;
1407
- }
1408
-
1409
- .result-card {
1410
- background: #ffffff;
1411
- border: 1px solid #e1e5e9;
1412
- border-radius: 12px;
1413
- margin-bottom: 20px;
1414
- box-shadow: 0 2px 10px rgba(0,0,0,0.05);
1415
- transition: all 0.3s ease;
1416
- overflow: hidden;
1417
- }
1418
-
1419
- .result-card:hover {
1420
- box-shadow: 0 4px 20px rgba(0,0,0,0.1);
1421
- transform: translateY(-2px);
1422
- }
1423
-
1424
- .result-header {
1425
- display: flex;
1426
- justify-content: space-between;
1427
- align-items: center;
1428
- padding: 15px 20px;
1429
- background: #f8f9fa;
1430
- border-bottom: 1px solid #e9ecef;
1431
- }
1432
-
1433
- .rank-info {
1434
- display: flex;
1435
- gap: 10px;
1436
- align-items: center;
1437
- flex-wrap: wrap;
1438
- }
1439
-
1440
- .rank-badge {
1441
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1442
- color: white;
1443
- padding: 4px 10px;
1444
- border-radius: 15px;
1445
- font-weight: 600;
1446
- font-size: 0.85em;
1447
- }
1448
-
1449
- .source-info {
1450
- background: #e9ecef;
1451
- color: #495057;
1452
- padding: 4px 8px;
1453
- border-radius: 10px;
1454
- font-size: 0.85em;
1455
- font-weight: 500;
1456
- }
1457
-
1458
- .page-info {
1459
- background: #d1ecf1;
1460
- color: #0c5460;
1461
- padding: 4px 8px;
1462
- border-radius: 10px;
1463
- font-size: 0.85em;
1464
- }
1465
-
1466
- .length-info {
1467
- background: #f8f9fa;
1468
- color: #6c757d;
1469
- padding: 4px 8px;
1470
- border-radius: 10px;
1471
- font-size: 0.85em;
1472
- }
1473
-
1474
- .score-info {
1475
- display: flex;
1476
- gap: 10px;
1477
- align-items: center;
1478
- }
1479
-
1480
- .confidence-badge {
1481
- padding: 4px 8px;
1482
- border-radius: 10px;
1483
- font-weight: 600;
1484
- font-size: 0.85em;
1485
- }
1486
-
1487
- .score-value {
1488
- background: #2c3e50;
1489
- color: white;
1490
- padding: 6px 12px;
1491
- border-radius: 15px;
1492
- font-weight: 600;
1493
- font-size: 0.9em;
1494
- }
1495
-
1496
- .result-content {
1497
- padding: 20px;
1498
- }
1499
-
1500
- .content-text {
1501
- line-height: 1.6;
1502
- color: #2c3e50;
1503
- border-left: 3px solid #667eea;
1504
- padding-left: 15px;
1505
- background: #f8f9fa;
1506
- padding: 15px;
1507
- border-radius: 0 8px 8px 0;
1508
- max-height: 300px;
1509
- overflow-y: auto;
1510
- }
1511
-
1512
- .result-actions {
1513
- display: flex;
1514
- gap: 10px;
1515
- padding: 15px 20px;
1516
- background: #f8f9fa;
1517
- border-top: 1px solid #e9ecef;
1518
- }
1519
-
1520
- .action-btn {
1521
- padding: 8px 16px;
1522
- border: none;
1523
- border-radius: 8px;
1524
- font-weight: 500;
1525
- cursor: pointer;
1526
- transition: all 0.3s ease;
1527
- font-size: 0.9em;
1528
- display: flex;
1529
- align-items: center;
1530
- gap: 5px;
1531
- }
1532
-
1533
- .copy-btn {
1534
- background: #17a2b8;
1535
- color: white;
1536
- }
1537
-
1538
- .copy-btn:hover {
1539
- background: #138496;
1540
- transform: translateY(-1px);
1541
- }
1542
-
1543
- .info-btn {
1544
- background: #6c757d;
1545
- color: white;
1546
- }
1547
-
1548
- .info-btn:hover {
1549
- background: #5a6268;
1550
- transform: translateY(-1px);
1551
- }
1552
-
1553
- .ranker-methods {
1554
- margin-top: 20px;
1555
- padding-top: 15px;
1556
- border-top: 1px solid #e9ecef;
1557
- }
1558
-
1559
- .methods-label {
1560
- font-weight: 600;
1561
- color: #495057;
1562
- margin-bottom: 10px;
1563
- font-size: 0.9em;
1564
- }
1565
-
1566
- .methods-list {
1567
- display: flex;
1568
- gap: 8px;
1569
- flex-wrap: wrap;
1570
- }
1571
-
1572
- .method-tag {
1573
- background: #e9ecef;
1574
- color: #495057;
1575
- padding: 4px 10px;
1576
- border-radius: 12px;
1577
- font-size: 0.8em;
1578
- font-weight: 500;
1579
- }
1580
-
1581
- /* Ranker controls styling */
1582
- .ranker-controls {
1583
- background: #ffffff;
1584
- border: 1px solid #e1e5e9;
1585
- border-radius: 12px;
1586
- padding: 20px;
1587
- margin-bottom: 25px;
1588
- box-shadow: 0 2px 10px rgba(0,0,0,0.05);
1589
- }
1590
-
1591
- .ranker-input-row {
1592
- display: flex;
1593
- gap: 15px;
1594
- align-items: end;
1595
- margin-bottom: 15px;
1596
- }
1597
-
1598
- .ranker-query-input {
1599
- flex: 1;
1600
- border: 2px solid #e1e5e9;
1601
- border-radius: 25px;
1602
- padding: 12px 20px;
1603
- font-size: 1em;
1604
- transition: all 0.3s ease;
1605
- }
1606
-
1607
- .ranker-query-input:focus {
1608
- border-color: #667eea;
1609
- box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
1610
- outline: none;
1611
- }
1612
-
1613
- .ranker-search-btn {
1614
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1615
- color: white;
1616
- border: none;
1617
- border-radius: 12px;
1618
- padding: 12px 24px;
1619
- min-width: 100px;
1620
- cursor: pointer;
1621
- transition: all 0.3s ease;
1622
- font-weight: 600;
1623
- font-size: 1em;
1624
- }
1625
-
1626
- .ranker-search-btn:hover {
1627
- transform: scale(1.05);
1628
- box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1629
- }
1630
-
1631
- .ranker-options-row {
1632
- display: flex;
1633
- gap: 15px;
1634
- align-items: center;
1635
- }
1636
-
1637
- /* Responsive design for ranker */
1638
- @media (max-width: 768px) {
1639
- .ranker-container {
1640
- padding: 10px;
1641
- }
1642
-
1643
- .ranker-input-row {
1644
- flex-direction: column;
1645
- gap: 10px;
1646
- }
1647
-
1648
- .ranker-options-row {
1649
- flex-direction: column;
1650
- gap: 10px;
1651
- align-items: stretch;
1652
- }
1653
-
1654
- .ranker-meta {
1655
- justify-content: center;
1656
- }
1657
-
1658
- .rank-info {
1659
- flex-direction: column;
1660
- gap: 5px;
1661
- align-items: flex-start;
1662
- }
1663
-
1664
- .result-header {
1665
- flex-direction: column;
1666
- gap: 10px;
1667
- align-items: flex-start;
1668
- }
1669
-
1670
- .score-info {
1671
- align-self: flex-end;
1672
- }
1673
-
1674
- .result-actions {
1675
- flex-direction: column;
1676
- gap: 8px;
1677
- }
1678
- }
1679
- """) as demo:
1680
  # Modern title with better styling
1681
  gr.Markdown("""
1682
  # πŸš€ Markit
@@ -1684,352 +37,21 @@ def create_ui():
1684
  """)
1685
 
1686
  with gr.Tabs():
1687
- # Document Converter Tab
1688
- with gr.TabItem("πŸ“„ Document Converter"):
1689
- with gr.Column(elem_classes=["chat-tab-container"]):
1690
- # Modern header matching other tabs
1691
- gr.HTML("""
1692
- <div class="chat-header">
1693
- <h2>πŸ“„ Document Converter</h2>
1694
- <p>Convert documents to Markdown format with advanced OCR and AI processing</p>
1695
- </div>
1696
- """)
1697
-
1698
- # State to track if cancellation is requested
1699
- cancel_requested = gr.State(False)
1700
- # State to store the conversion thread
1701
- conversion_thread = gr.State(None)
1702
- # State to store the output format (fixed to Markdown)
1703
- output_format_state = gr.State("Markdown")
1704
-
1705
- # Multi-file input (supports single and multiple files)
1706
- files_input = gr.Files(
1707
- label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
1708
- file_count="multiple",
1709
- file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
1710
- )
1711
-
1712
- # Processing type selector (visible only for multiple files)
1713
- processing_type_selector = gr.Radio(
1714
- choices=["combined", "individual", "summary", "comparison"],
1715
- value="combined",
1716
- label="Multi-Document Processing Type",
1717
- info="How to process multiple documents together",
1718
- visible=False
1719
- )
1720
-
1721
- # Status text to show file count and processing mode
1722
- file_status_text = gr.HTML(
1723
- value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
1724
- label=""
1725
- )
1726
-
1727
- # Provider and OCR options below the file input
1728
- with gr.Row(elem_classes=["provider-options-row"]):
1729
- with gr.Column(scale=1):
1730
- parser_names = ParserRegistry.get_parser_names()
1731
-
1732
- # Make MarkItDown the default parser if available
1733
- default_parser = next((p for p in parser_names if p == "MarkItDown"), parser_names[0] if parser_names else "PyPdfium")
1734
-
1735
- provider_dropdown = gr.Dropdown(
1736
- label="Provider",
1737
- choices=parser_names,
1738
- value=default_parser,
1739
- interactive=True
1740
- )
1741
- with gr.Column(scale=1):
1742
- default_ocr_options = ParserRegistry.get_ocr_options(default_parser)
1743
- default_ocr = default_ocr_options[0] if default_ocr_options else "No OCR"
1744
-
1745
- ocr_dropdown = gr.Dropdown(
1746
- label="OCR Options",
1747
- choices=default_ocr_options,
1748
- value=default_ocr,
1749
- interactive=True
1750
- )
1751
-
1752
- # Processing controls row with consistent styling
1753
- with gr.Row(elem_classes=["control-buttons"]):
1754
- convert_button = gr.Button("πŸš€ Convert", elem_classes=["control-btn", "btn-primary"])
1755
- cancel_button = gr.Button("⏹️ Cancel", elem_classes=["control-btn", "btn-clear-data"], visible=False)
1756
-
1757
- # Simple output container with just one scrollbar
1758
- file_display = gr.HTML(
1759
- value="<div class='output-container'></div>",
1760
- label="Converted Content"
1761
- )
1762
-
1763
- file_download = gr.File(label="Download File")
1764
-
1765
- # Event handlers for document converter
1766
-
1767
- # Update UI when files are uploaded/changed
1768
- files_input.change(
1769
- fn=update_ui_for_file_count,
1770
- inputs=[files_input],
1771
- outputs=[processing_type_selector, file_status_text]
1772
- )
1773
-
1774
- provider_dropdown.change(
1775
- lambda p: gr.Dropdown(
1776
- choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
1777
- value="Plain Text" if "GOT-OCR" in p else (ParserRegistry.get_ocr_options(p)[0] if ParserRegistry.get_ocr_options(p) else None)
1778
- ),
1779
- inputs=[provider_dropdown],
1780
- outputs=[ocr_dropdown]
1781
- )
1782
-
1783
- # Reset cancel flag when starting conversion
1784
- def start_conversion():
1785
- global conversion_cancelled
1786
- conversion_cancelled.clear()
1787
- logger.info("Starting conversion with cancellation flag cleared")
1788
- return gr.update(visible=False), gr.update(visible=True), False
1789
-
1790
- # Set cancel flag and terminate thread when cancel button is clicked
1791
- def request_cancellation(thread):
1792
- global conversion_cancelled
1793
- conversion_cancelled.set()
1794
- logger.info("Cancel button clicked, cancellation flag set")
1795
-
1796
- # Try to join the thread with a timeout
1797
- if thread is not None:
1798
- logger.info(f"Attempting to join conversion thread: {thread}")
1799
- thread.join(timeout=0.5)
1800
- if thread.is_alive():
1801
- logger.warning("Thread did not finish within timeout")
1802
-
1803
- # Add immediate feedback to the user
1804
- return gr.update(visible=True), gr.update(visible=False), True, None
1805
-
1806
- # Start conversion sequence
1807
- convert_button.click(
1808
- fn=start_conversion,
1809
- inputs=[],
1810
- outputs=[convert_button, cancel_button, cancel_requested],
1811
- queue=False # Execute immediately
1812
- ).then(
1813
- fn=handle_convert,
1814
- inputs=[files_input, provider_dropdown, ocr_dropdown, output_format_state, processing_type_selector, cancel_requested],
1815
- outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
1816
- )
1817
-
1818
- # Handle cancel button click
1819
- cancel_button.click(
1820
- fn=request_cancellation,
1821
- inputs=[conversion_thread],
1822
- outputs=[convert_button, cancel_button, cancel_requested, conversion_thread],
1823
- queue=False # Execute immediately
1824
- )
1825
-
1826
- # Chat Tab - Completely redesigned
1827
- with gr.TabItem("πŸ’¬ Chat with Documents"):
1828
- with gr.Column(elem_classes=["chat-tab-container"]):
1829
- # Modern header
1830
- gr.HTML("""
1831
- <div class="chat-header">
1832
- <h2>πŸ’¬ Chat with your converted documents</h2>
1833
- <p>Ask questions about your documents using advanced RAG technology</p>
1834
- </div>
1835
- """)
1836
-
1837
- # Status section with modern design
1838
- status_display = gr.HTML(value=get_chat_status())
1839
-
1840
- # Control buttons
1841
- with gr.Row(elem_classes=["control-buttons"]):
1842
- refresh_status_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
1843
- new_session_btn = gr.Button("πŸ†• New Session", elem_classes=["control-btn", "btn-new-session"])
1844
- clear_data_btn = gr.Button("πŸ—‘οΈ Clear All Data", elem_classes=["control-btn", "btn-clear-data"], variant="stop")
1845
-
1846
- # Main chat interface
1847
- with gr.Column(elem_classes=["chat-main-container"]):
1848
- chatbot = gr.Chatbot(
1849
- elem_classes=["chat-container"],
1850
- height=500,
1851
- show_label=False,
1852
- show_share_button=False,
1853
- bubble_full_width=False,
1854
- type="messages",
1855
- placeholder="Start a conversation by asking questions about your documents..."
1856
- )
1857
-
1858
- # Input area
1859
- with gr.Row(elem_classes=["input-row"]):
1860
- msg_input = gr.Textbox(
1861
- placeholder="Ask questions about your documents...",
1862
- show_label=False,
1863
- scale=5,
1864
- lines=1,
1865
- max_lines=3,
1866
- elem_classes=["message-input"]
1867
- )
1868
- send_btn = gr.Button("Submit", elem_classes=["send-button"], scale=0)
1869
-
1870
- # Session info with better styling
1871
- session_info = gr.HTML(
1872
- value='<div class="session-info">No active session - Click "New Session" to start</div>'
1873
- )
1874
-
1875
- # Event handlers for chat
1876
- def clear_input():
1877
- return ""
1878
-
1879
- # Send message when button clicked or Enter pressed
1880
- msg_input.submit(
1881
- fn=handle_chat_message,
1882
- inputs=[msg_input, chatbot],
1883
- outputs=[msg_input, chatbot, status_display]
1884
- )
1885
-
1886
- send_btn.click(
1887
- fn=handle_chat_message,
1888
- inputs=[msg_input, chatbot],
1889
- outputs=[msg_input, chatbot, status_display]
1890
- )
1891
-
1892
- # New session handler with improved feedback
1893
- def enhanced_new_session():
1894
- history, info = start_new_chat_session()
1895
- session_html = f'<div class="session-info">{info}</div>'
1896
- updated_status = get_chat_status()
1897
- return history, session_html, updated_status
1898
-
1899
- new_session_btn.click(
1900
- fn=enhanced_new_session,
1901
- inputs=[],
1902
- outputs=[chatbot, session_info, status_display]
1903
- )
1904
-
1905
- # Refresh status handler
1906
- refresh_status_btn.click(
1907
- fn=get_chat_status,
1908
- inputs=[],
1909
- outputs=[status_display]
1910
- )
1911
-
1912
- # Clear all data handler
1913
- clear_data_btn.click(
1914
- fn=handle_clear_all_data,
1915
- inputs=[],
1916
- outputs=[chatbot, session_info, status_display]
1917
- )
1918
-
1919
- # Query Ranker Tab
1920
- with gr.TabItem("πŸ” Query Ranker"):
1921
- with gr.Column(elem_classes=["ranker-container"]):
1922
- # Modern header
1923
- gr.HTML("""
1924
- <div class="chat-header">
1925
- <h2>πŸ” Query Ranker</h2>
1926
- <p>Search and rank document chunks with similarity scores</p>
1927
- </div>
1928
- """)
1929
-
1930
- # Status section
1931
- ranker_status_display = gr.HTML(value=get_ranker_status())
1932
-
1933
- # Control buttons
1934
- with gr.Row(elem_classes=["control-buttons"]):
1935
- refresh_ranker_status_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
1936
- clear_results_btn = gr.Button("πŸ—‘οΈ Clear Results", elem_classes=["control-btn", "btn-clear-data"])
1937
-
1938
- # Search controls
1939
- with gr.Column(elem_classes=["ranker-controls"]):
1940
- with gr.Row(elem_classes=["ranker-input-row"]):
1941
- query_input = gr.Textbox(
1942
- placeholder="Enter your search query...",
1943
- show_label=False,
1944
- elem_classes=["ranker-query-input"],
1945
- scale=4
1946
- )
1947
- search_btn = gr.Button("πŸ” Search", elem_classes=["ranker-search-btn"], scale=0)
1948
-
1949
- with gr.Row(elem_classes=["ranker-options-row"]):
1950
- method_dropdown = gr.Dropdown(
1951
- choices=[
1952
- ("🎯 Similarity Search", "similarity"),
1953
- ("πŸ”€ MMR (Diverse)", "mmr"),
1954
- ("πŸ” BM25 (Keywords)", "bm25"),
1955
- ("πŸ”— Hybrid (Recommended)", "hybrid")
1956
- ],
1957
- value="hybrid",
1958
- label="Retrieval Method",
1959
- scale=2
1960
- )
1961
- k_slider = gr.Slider(
1962
- minimum=1,
1963
- maximum=10,
1964
- value=5,
1965
- step=1,
1966
- label="Number of Results",
1967
- scale=1
1968
- )
1969
-
1970
- # Results display
1971
- results_display = gr.HTML(
1972
- value=handle_query_search("", "hybrid", 5), # Initial placeholder
1973
- elem_classes=["ranker-results-container"]
1974
- )
1975
-
1976
- # Event handlers for Query Ranker
1977
- def clear_ranker_results():
1978
- """Clear the search results and reset to placeholder."""
1979
- return handle_query_search("", "hybrid", 5), ""
1980
-
1981
- def refresh_ranker_status():
1982
- """Refresh the ranker status display."""
1983
- return get_ranker_status()
1984
-
1985
- # Search functionality
1986
- query_input.submit(
1987
- fn=handle_query_search,
1988
- inputs=[query_input, method_dropdown, k_slider],
1989
- outputs=[results_display]
1990
- )
1991
-
1992
- search_btn.click(
1993
- fn=handle_query_search,
1994
- inputs=[query_input, method_dropdown, k_slider],
1995
- outputs=[results_display]
1996
- )
1997
-
1998
- # Control button handlers
1999
- refresh_ranker_status_btn.click(
2000
- fn=refresh_ranker_status,
2001
- inputs=[],
2002
- outputs=[ranker_status_display]
2003
- )
2004
-
2005
- clear_results_btn.click(
2006
- fn=clear_ranker_results,
2007
- inputs=[],
2008
- outputs=[results_display, query_input]
2009
- )
2010
-
2011
- # Update results when method or k changes
2012
- method_dropdown.change(
2013
- fn=handle_query_search,
2014
- inputs=[query_input, method_dropdown, k_slider],
2015
- outputs=[results_display]
2016
- )
2017
-
2018
- k_slider.change(
2019
- fn=handle_query_search,
2020
- inputs=[query_input, method_dropdown, k_slider],
2021
- outputs=[results_display]
2022
- )
2023
-
2024
  return demo
2025
 
2026
 
2027
- def launch_ui(server_name="0.0.0.0", server_port=7860, share=False):
 
 
2028
  demo = create_ui()
2029
- demo.launch(
 
2030
  server_name=server_name,
2031
  server_port=server_port,
2032
- root_path="",
2033
- show_error=True,
2034
- share=share
2035
- )
 
1
+ """Main UI orchestrator - Refactored modular interface for Markit application."""
2
+
3
  import gradio as gr
 
 
 
4
  import logging
5
+
6
+ from src.core.converter import set_cancellation_flag
 
 
 
 
 
 
 
 
7
  from src.core.logging_config import get_logger
8
+ from src.ui.styles.ui_styles import CSS_STYLES
9
+ from src.ui.components.document_converter import create_document_converter_tab
10
+ from src.ui.components.chat_interface import create_chat_interface_tab
11
+ from src.ui.components.query_ranker import create_query_ranker_tab
12
+ from src.ui.utils.threading_utils import get_cancellation_event
13
 
 
14
  logger = get_logger(__name__)
15
 
16
  # Import MarkItDown to check if it's available
 
22
  HAS_MARKITDOWN = False
23
  logger.warning("MarkItDown is not available")
24
 
25
+ # Initialize global cancellation event and pass to converter module
26
+ conversion_cancelled = get_cancellation_event()
 
 
27
  set_cancellation_flag(conversion_cancelled)
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  def create_ui():
31
+ """Create the main Gradio interface with all tabs."""
32
+ with gr.Blocks(css=CSS_STYLES) as demo:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  # Modern title with better styling
34
  gr.Markdown("""
35
  # πŸš€ Markit
 
37
  """)
38
 
39
  with gr.Tabs():
40
+ # Create all tabs using component functions
41
+ create_document_converter_tab()
42
+ create_chat_interface_tab()
43
+ create_query_ranker_tab()
44
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  return demo
46
 
47
 
48
+ def launch_ui(share=False, server_name="0.0.0.0", server_port=7860):
49
+ """Launch the Gradio interface."""
50
+ logger.info("Creating and launching UI...")
51
  demo = create_ui()
52
+ return demo.launch(
53
+ share=share,
54
  server_name=server_name,
55
  server_port=server_port,
56
+ show_error=True
57
+ )
 
 
src/ui/ui_backup.py ADDED
@@ -0,0 +1,2035 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import markdown
3
+ import threading
4
+ import time
5
+ import logging
6
+ from pathlib import Path
7
+ from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
8
+ from src.parsers.parser_registry import ParserRegistry
9
+ from src.core.config import config
10
+ from src.core.exceptions import (
11
+ DocumentProcessingError,
12
+ UnsupportedFileTypeError,
13
+ FileSizeLimitError,
14
+ ConfigurationError
15
+ )
16
+ from src.core.logging_config import get_logger
17
+ from src.rag import rag_chat_service, document_ingestion_service
18
+ from src.rag.vector_store import vector_store_manager
19
+ from src.services.data_clearing_service import data_clearing_service
20
+
21
+ # Use centralized logging
22
+ logger = get_logger(__name__)
23
+
24
+ # Import MarkItDown to check if it's available
25
+ try:
26
+ from markitdown import MarkItDown
27
+ HAS_MARKITDOWN = True
28
+ logger.info("MarkItDown is available for use")
29
+ except ImportError:
30
+ HAS_MARKITDOWN = False
31
+ logger.warning("MarkItDown is not available")
32
+
33
+ # Add a global variable to track cancellation state
34
+ conversion_cancelled = threading.Event()
35
+
36
+ # Pass the cancellation flag to the converter module
37
+ set_cancellation_flag(conversion_cancelled)
38
+
39
+ # Add a background thread to monitor cancellation
40
+ def monitor_cancellation():
41
+ """Background thread to monitor cancellation and update UI if needed"""
42
+ logger.info("Starting cancellation monitor thread")
43
+ while is_conversion_in_progress():
44
+ if conversion_cancelled.is_set():
45
+ logger.info("Cancellation detected by monitor thread")
46
+ time.sleep(0.1) # Check every 100ms
47
+ logger.info("Cancellation monitor thread ending")
48
+
49
+ def update_ui_for_file_count(files):
50
+ """Update UI components based on the number of files uploaded."""
51
+ if not files or len(files) == 0:
52
+ return (
53
+ gr.update(visible=False), # processing_type_selector
54
+ "<div style='color: #666; font-style: italic;'>Upload documents to begin</div>" # file_status_text
55
+ )
56
+
57
+ if len(files) == 1:
58
+ file_name = files[0].name if hasattr(files[0], 'name') else str(files[0])
59
+ return (
60
+ gr.update(visible=False), # processing_type_selector (hidden for single file)
61
+ f"<div style='color: #2563eb; font-weight: 500;'>πŸ“„ Single document: {file_name}</div>"
62
+ )
63
+ else:
64
+ # Calculate total size for validation display
65
+ total_size = 0
66
+ try:
67
+ for file in files:
68
+ if hasattr(file, 'size'):
69
+ total_size += file.size
70
+ elif hasattr(file, 'name'):
71
+ # For file paths, get size from filesystem
72
+ total_size += Path(file.name).stat().st_size
73
+ except:
74
+ pass # Size calculation is optional for display
75
+
76
+ size_display = f" ({total_size / (1024*1024):.1f}MB)" if total_size > 0 else ""
77
+
78
+ # Check if within limits
79
+ if len(files) > 5:
80
+ status_color = "#dc2626" # red
81
+ status_text = f"⚠️ Too many files: {len(files)}/5 (max 5 files allowed)"
82
+ elif total_size > 20 * 1024 * 1024: # 20MB
83
+ status_color = "#dc2626" # red
84
+ status_text = f"⚠️ Files too large{size_display} (max 20MB combined)"
85
+ else:
86
+ status_color = "#059669" # green
87
+ status_text = f"πŸ“‚ Batch mode: {len(files)} files{size_display}"
88
+
89
+ return (
90
+ gr.update(visible=True), # processing_type_selector (visible for multiple files)
91
+ f"<div style='color: {status_color}; font-weight: 500;'>{status_text}</div>"
92
+ )
93
+
94
+ def validate_file_for_parser(file_path, parser_name):
95
+ """Validate if the file type is supported by the selected parser."""
96
+ if not file_path:
97
+ return True, "" # No file selected yet
98
+
99
+ try:
100
+ file_path_obj = Path(file_path)
101
+ file_ext = file_path_obj.suffix.lower()
102
+
103
+ # Check file size
104
+ if file_path_obj.exists():
105
+ file_size = file_path_obj.stat().st_size
106
+ if file_size > config.app.max_file_size:
107
+ size_mb = file_size / (1024 * 1024)
108
+ max_mb = config.app.max_file_size / (1024 * 1024)
109
+ return False, f"File size ({size_mb:.1f}MB) exceeds maximum allowed size ({max_mb:.1f}MB)"
110
+
111
+ # Check file extension
112
+ if file_ext not in config.app.allowed_extensions:
113
+ return False, f"File type '{file_ext}' is not supported. Allowed types: {', '.join(config.app.allowed_extensions)}"
114
+
115
+ # Parser-specific validation
116
+ if "GOT-OCR" in parser_name:
117
+ if file_ext not in ['.jpg', '.jpeg', '.png']:
118
+ return False, "GOT-OCR only supports JPG and PNG formats."
119
+
120
+ return True, ""
121
+
122
+ except Exception as e:
123
+ logger.error(f"Error validating file: {e}")
124
+ return False, f"Error validating file: {e}"
125
+
126
+ def format_markdown_content(content):
127
+ if not content:
128
+ return content
129
+
130
+ # Convert the content to HTML using markdown library
131
+ html_content = markdown.markdown(str(content), extensions=['tables'])
132
+ return html_content
133
+
134
+ def render_latex_to_html(latex_content):
135
+ """Convert LaTeX content to HTML using Mathpix Markdown like GOT-OCR demo."""
136
+ import json
137
+
138
+ # Clean up the content similar to GOT-OCR demo
139
+ content = latex_content.strip()
140
+ if content.endswith("<|im_end|>"):
141
+ content = content[:-len("<|im_end|>")]
142
+
143
+ # Fix unbalanced delimiters exactly like GOT-OCR demo
144
+ right_num = content.count("\\right")
145
+ left_num = content.count("\\left")
146
+
147
+ if right_num != left_num:
148
+ content = (
149
+ content.replace("\\left(", "(")
150
+ .replace("\\right)", ")")
151
+ .replace("\\left[", "[")
152
+ .replace("\\right]", "]")
153
+ .replace("\\left{", "{")
154
+ .replace("\\right}", "}")
155
+ .replace("\\left|", "|")
156
+ .replace("\\right|", "|")
157
+ .replace("\\left.", ".")
158
+ .replace("\\right.", ".")
159
+ )
160
+
161
+ # Process content like GOT-OCR demo: remove $ signs and replace quotes
162
+ content = content.replace('"', "``").replace("$", "")
163
+
164
+ # Split into lines and create JavaScript string like GOT-OCR demo
165
+ outputs_list = content.split("\n")
166
+ js_text_parts = []
167
+ for line in outputs_list:
168
+ # Escape backslashes and add line break
169
+ escaped_line = line.replace("\\", "\\\\")
170
+ js_text_parts.append(f'"{escaped_line}\\n"')
171
+
172
+ # Join with + like in GOT-OCR demo
173
+ js_text = " + ".join(js_text_parts)
174
+
175
+ # Create HTML using Mathpix Markdown like GOT-OCR demo
176
+ html_content = f"""<!DOCTYPE html>
177
+ <html lang="en" data-lt-installed="true">
178
+ <head>
179
+ <meta charset="UTF-8">
180
+ <title>LaTeX Content</title>
181
+ <script>
182
+ const text = {js_text};
183
+ </script>
184
+ <style>
185
+ #content {{
186
+ max-width: 800px;
187
+ margin: auto;
188
+ padding: 20px;
189
+ }}
190
+ body {{
191
+ font-family: 'Times New Roman', serif;
192
+ line-height: 1.6;
193
+ background-color: #ffffff;
194
+ color: #333;
195
+ }}
196
+ table {{
197
+ border-collapse: collapse;
198
+ width: 100%;
199
+ margin: 20px 0;
200
+ }}
201
+ td, th {{
202
+ border: 1px solid #333;
203
+ padding: 8px 12px;
204
+ text-align: center;
205
+ vertical-align: middle;
206
+ }}
207
+ </style>
208
+ <script>
209
+ let script = document.createElement('script');
210
+ script.src = "https://cdn.jsdelivr.net/npm/mathpix-markdown-it@1.3.6/es5/bundle.js";
211
+ document.head.append(script);
212
+ script.onload = function() {{
213
+ const isLoaded = window.loadMathJax();
214
+ if (isLoaded) {{
215
+ console.log('Styles loaded!')
216
+ }}
217
+ const el = window.document.getElementById('content-text');
218
+ if (el) {{
219
+ const options = {{
220
+ htmlTags: true
221
+ }};
222
+ const html = window.render(text, options);
223
+ el.outerHTML = html;
224
+ }}
225
+ }};
226
+ </script>
227
+ </head>
228
+ <body>
229
+ <div id="content">
230
+ <div id="content-text"></div>
231
+ </div>
232
+ </body>
233
+ </html>"""
234
+
235
+ return html_content
236
+
237
+ def format_latex_content(content):
238
+ """Format LaTeX content for display in UI using MathJax rendering like GOT-OCR demo."""
239
+ if not content:
240
+ return content
241
+
242
+ try:
243
+ # Generate rendered HTML
244
+ rendered_html = render_latex_to_html(content)
245
+
246
+ # Encode for iframe display (similar to GOT-OCR demo)
247
+ import base64
248
+ encoded_html = base64.b64encode(rendered_html.encode("utf-8")).decode("utf-8")
249
+ iframe_src = f"data:text/html;base64,{encoded_html}"
250
+
251
+ # Create the display with both rendered and raw views
252
+ formatted_content = f"""
253
+ <div style="background-color: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; margin: 10px 0;">
254
+ <div style="background-color: #e9ecef; padding: 10px; border-radius: 8px 8px 0 0; font-weight: bold; color: #495057;">
255
+ πŸ“„ LaTeX Content (Rendered with MathJax)
256
+ </div>
257
+ <div style="padding: 0;">
258
+ <iframe src="{iframe_src}" width="100%" height="500px" style="border: none; border-radius: 0 0 8px 8px;"></iframe>
259
+ </div>
260
+ <div style="background-color: #e9ecef; padding: 8px 15px; border-radius: 0; font-size: 12px; color: #6c757d; border-top: 1px solid #dee2e6;">
261
+ πŸ’‘ LaTeX content rendered with MathJax. Tables and formulas are displayed as they would appear in a LaTeX document.
262
+ </div>
263
+ <details style="margin: 0; border-top: 1px solid #dee2e6;">
264
+ <summary style="padding: 8px 15px; background-color: #e9ecef; cursor: pointer; font-size: 12px; color: #6c757d;">
265
+ πŸ“ View Raw LaTeX Source
266
+ </summary>
267
+ <div style="padding: 15px; background-color: #f8f9fa;">
268
+ <pre style="background-color: transparent; margin: 0; padding: 0;
269
+ font-family: 'Courier New', monospace; font-size: 12px; line-height: 1.4;
270
+ white-space: pre-wrap; word-wrap: break-word; color: #2c3e50; max-height: 200px; overflow-y: auto;">
271
+ {content}
272
+ </pre>
273
+ </div>
274
+ </details>
275
+ </div>
276
+ """
277
+
278
+ except Exception as e:
279
+ # Fallback to simple formatting if rendering fails
280
+ import html
281
+ escaped_content = html.escape(str(content))
282
+ formatted_content = f"""
283
+ <div style="background-color: #f8f9fa; border-radius: 8px; border: 1px solid #e9ecef; margin: 10px 0;">
284
+ <div style="background-color: #e9ecef; padding: 10px; border-radius: 8px 8px 0 0; font-weight: bold; color: #495057;">
285
+ πŸ“„ LaTeX Content (Fallback View)
286
+ </div>
287
+ <div style="padding: 15px;">
288
+ <pre style="background-color: transparent; margin: 0; padding: 0;
289
+ font-family: 'Courier New', monospace; font-size: 14px; line-height: 1.4;
290
+ white-space: pre-wrap; word-wrap: break-word; color: #2c3e50;">
291
+ {escaped_content}
292
+ </pre>
293
+ </div>
294
+ <div style="background-color: #e9ecef; padding: 8px 15px; border-radius: 0 0 8px 8px; font-size: 12px; color: #6c757d;">
295
+ ⚠️ Rendering failed, showing raw LaTeX. Error: {str(e)}
296
+ </div>
297
+ </div>
298
+ """
299
+
300
+ return formatted_content
301
+
302
+ # Function to run conversion in a separate thread
303
+ def run_conversion_thread(file_path, parser_name, ocr_method_name, output_format):
304
+ """Run the conversion in a separate thread and return the thread object"""
305
+ global conversion_cancelled
306
+
307
+ # Reset the cancellation flag
308
+ conversion_cancelled.clear()
309
+
310
+ # Create a container for the results
311
+ results = {"content": None, "download_file": None, "error": None}
312
+
313
+ def conversion_worker():
314
+ try:
315
+ content, download_file = convert_file(file_path, parser_name, ocr_method_name, output_format)
316
+ results["content"] = content
317
+ results["download_file"] = download_file
318
+ except Exception as e:
319
+ logger.error(f"Error during conversion: {str(e)}")
320
+ results["error"] = str(e)
321
+
322
+ # Create and start the thread
323
+ thread = threading.Thread(target=conversion_worker)
324
+ thread.daemon = True
325
+ thread.start()
326
+
327
+ return thread, results
328
+
329
+ def run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type):
330
+ """Run the conversion in a separate thread for multiple files."""
331
+ import threading
332
+ from src.services.document_service import DocumentService
333
+
334
+ # Results will be shared between threads
335
+ results = {"content": None, "download_file": None, "error": None}
336
+
337
+ def conversion_worker():
338
+ try:
339
+ logger.info(f"Starting multi-file conversion thread for {len(file_paths)} files")
340
+
341
+ # Use the new document service unified method
342
+ document_service = DocumentService()
343
+ document_service.set_cancellation_flag(conversion_cancelled)
344
+
345
+ # Call the unified convert_documents method
346
+ content, output_file = document_service.convert_documents(
347
+ file_paths=file_paths,
348
+ parser_name=parser_name,
349
+ ocr_method_name=ocr_method_name,
350
+ output_format=output_format,
351
+ processing_type=processing_type
352
+ )
353
+
354
+ logger.info(f"Multi-file conversion completed successfully for {len(file_paths)} files")
355
+ results["content"] = content
356
+ results["download_file"] = output_file
357
+
358
+ except Exception as e:
359
+ logger.error(f"Error during multi-file conversion: {str(e)}")
360
+ results["error"] = str(e)
361
+
362
+ # Create and start the thread
363
+ thread = threading.Thread(target=conversion_worker)
364
+ thread.daemon = True
365
+ thread.start()
366
+
367
+ return thread, results
368
+
369
+ def handle_convert(files, parser_name, ocr_method_name, output_format, processing_type, is_cancelled):
370
+ """Handle file conversion for single or multiple files."""
371
+ global conversion_cancelled
372
+
373
+ # Check if we should cancel before starting
374
+ if is_cancelled:
375
+ logger.info("Conversion cancelled before starting")
376
+ return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
377
+
378
+ # Validate files input
379
+ if not files or len(files) == 0:
380
+ error_msg = "No files uploaded. Please upload at least one document."
381
+ logger.error(error_msg)
382
+ return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
383
+
384
+ # Convert Gradio file objects to file paths
385
+ file_paths = []
386
+ for file in files:
387
+ if hasattr(file, 'name'):
388
+ file_paths.append(file.name)
389
+ else:
390
+ file_paths.append(str(file))
391
+
392
+ # Validate file types for the selected parser
393
+ for file_path in file_paths:
394
+ is_valid, error_msg = validate_file_for_parser(file_path, parser_name)
395
+ if not is_valid:
396
+ logger.error(f"File validation error: {error_msg}")
397
+ return f"Error: {error_msg}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
398
+
399
+ logger.info(f"Starting conversion of {len(file_paths)} file(s) with cancellation flag cleared")
400
+
401
+ # Start the conversion in a separate thread
402
+ thread, results = run_conversion_thread_multi(file_paths, parser_name, ocr_method_name, output_format, processing_type)
403
+
404
+ # Start the monitoring thread
405
+ monitor_thread = threading.Thread(target=monitor_cancellation)
406
+ monitor_thread.daemon = True
407
+ monitor_thread.start()
408
+
409
+ # Wait for the thread to complete or be cancelled
410
+ while thread.is_alive():
411
+ # Check if cancellation was requested
412
+ if conversion_cancelled.is_set():
413
+ logger.info("Cancellation detected, waiting for thread to finish")
414
+ # Give the thread a chance to clean up
415
+ thread.join(timeout=0.5)
416
+ if thread.is_alive():
417
+ logger.warning("Thread did not finish within timeout")
418
+ return "Conversion cancelled.", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
419
+
420
+ # Sleep briefly to avoid busy waiting
421
+ time.sleep(0.1)
422
+
423
+ # Thread has completed, check results
424
+ if results["error"]:
425
+ return f"Error: {results['error']}", None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
426
+
427
+ content = results["content"]
428
+ download_file = results["download_file"]
429
+
430
+ # If conversion returned a cancellation message
431
+ if content == "Conversion cancelled.":
432
+ logger.info("Converter returned cancellation message")
433
+ return content, None, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
434
+
435
+ # Format the content based on parser type
436
+ if "GOT-OCR" in parser_name:
437
+ # For GOT-OCR, display as LaTeX
438
+ formatted_content = format_latex_content(str(content))
439
+ html_output = f"<div class='output-container'>{formatted_content}</div>"
440
+ else:
441
+ # For other parsers, display as Markdown
442
+ formatted_content = format_markdown_content(str(content))
443
+ html_output = f"<div class='output-container'>{formatted_content}</div>"
444
+
445
+ logger.info("Conversion completed successfully")
446
+
447
+ # Auto-ingest the converted document for RAG
448
+ try:
449
+ # Read original file content for proper deduplication hashing
450
+ original_file_content = None
451
+ if file_path and Path(file_path).exists():
452
+ try:
453
+ with open(file_path, 'rb') as f:
454
+ original_file_content = f.read().decode('utf-8', errors='ignore')
455
+ except Exception as e:
456
+ logger.warning(f"Could not read original file content: {e}")
457
+
458
+ conversion_result = {
459
+ "markdown_content": content,
460
+ "original_filename": Path(file_path).name if file_path else "unknown",
461
+ "conversion_method": parser_name,
462
+ "file_size": Path(file_path).stat().st_size if file_path and Path(file_path).exists() else 0,
463
+ "conversion_time": 0, # Could be tracked if needed
464
+ "original_file_content": original_file_content
465
+ }
466
+
467
+ success, ingestion_msg, stats = document_ingestion_service.ingest_from_conversion_result(conversion_result)
468
+ if success:
469
+ logger.info(f"Document auto-ingested for RAG: {ingestion_msg}")
470
+ else:
471
+ logger.warning(f"Document ingestion failed: {ingestion_msg}")
472
+ except Exception as e:
473
+ logger.error(f"Error during auto-ingestion: {e}")
474
+
475
+ return html_output, download_file, gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
476
+
477
+ def handle_chat_message(message, history):
478
+ """Handle a new chat message with streaming response."""
479
+ if not message or not message.strip():
480
+ return "", history, gr.update()
481
+
482
+ try:
483
+ # Add user message to history
484
+ history = history or []
485
+ history.append({"role": "user", "content": message})
486
+
487
+ # Add assistant message placeholder
488
+ history.append({"role": "assistant", "content": ""})
489
+
490
+ # Get response from RAG service
491
+ response_text = ""
492
+ for chunk in rag_chat_service.chat_stream(message):
493
+ response_text += chunk
494
+ # Update the last message in history with the current response
495
+ history[-1]["content"] = response_text
496
+ # Update status in real-time during streaming
497
+ updated_status = get_chat_status()
498
+ yield "", history, updated_status
499
+
500
+ logger.info(f"Chat response completed for message: {message[:50]}...")
501
+
502
+ # Final status update after message completion
503
+ final_status = get_chat_status()
504
+ yield "", history, final_status
505
+
506
+ except Exception as e:
507
+ error_msg = f"Error generating response: {str(e)}"
508
+ logger.error(error_msg)
509
+ if history and len(history) > 0:
510
+ history[-1]["content"] = f"❌ {error_msg}"
511
+ else:
512
+ history = [
513
+ {"role": "user", "content": message},
514
+ {"role": "assistant", "content": f"❌ {error_msg}"}
515
+ ]
516
+ # Update status even on error
517
+ error_status = get_chat_status()
518
+ yield "", history, error_status
519
+
520
+ def start_new_chat_session():
521
+ """Start a new chat session."""
522
+ try:
523
+ session_id = rag_chat_service.start_new_session()
524
+ logger.info(f"Started new chat session: {session_id}")
525
+ return [], f"βœ… New chat session started: {session_id}"
526
+ except Exception as e:
527
+ error_msg = f"Error starting new session: {str(e)}"
528
+ logger.error(error_msg)
529
+ return [], f"❌ {error_msg}"
530
+
531
+ def handle_clear_all_data():
532
+ """Handle clearing all RAG data (vector store + chat history)."""
533
+ try:
534
+ # Clear all data using the data clearing service
535
+ success, message, stats = data_clearing_service.clear_all_data()
536
+
537
+ if success:
538
+ # Reset chat session after clearing data
539
+ session_id = rag_chat_service.start_new_session()
540
+
541
+ # Get updated status
542
+ updated_status = get_chat_status()
543
+
544
+ # Create success message with stats
545
+ if stats.get("total_cleared_documents", 0) > 0 or stats.get("total_cleared_files", 0) > 0:
546
+ clear_msg = f"βœ… {message}"
547
+ session_msg = f"πŸ†• Started new session: {session_id}"
548
+ combined_msg = f'{clear_msg}<br/><div class="session-info">{session_msg}</div>'
549
+ else:
550
+ combined_msg = f'ℹ️ {message}<br/><div class="session-info">πŸ†• Started new session: {session_id}</div>'
551
+
552
+ logger.info(f"Data cleared successfully: {message}")
553
+
554
+ return [], combined_msg, updated_status
555
+ else:
556
+ error_msg = f"❌ {message}"
557
+ logger.error(f"Data clearing failed: {message}")
558
+
559
+ # Still get updated status even on error
560
+ updated_status = get_chat_status()
561
+
562
+ return None, f'<div class="session-info">{error_msg}</div>', updated_status
563
+
564
+ except Exception as e:
565
+ error_msg = f"Error clearing data: {str(e)}"
566
+ logger.error(error_msg)
567
+
568
+ # Get current status
569
+ current_status = get_chat_status()
570
+
571
+ return None, f'<div class="session-info">❌ {error_msg}</div>', current_status
572
+
573
+ def handle_query_search(query, method, k_value):
574
+ """Handle query search and return formatted results."""
575
+ if not query or not query.strip():
576
+ return """
577
+ <div class="ranker-container">
578
+ <div class="ranker-placeholder">
579
+ <h3>πŸ” Query Ranker</h3>
580
+ <p>Enter a search query to find relevant document chunks with similarity scores.</p>
581
+ </div>
582
+ </div>
583
+ """
584
+
585
+ try:
586
+ logger.info(f"Query search: '{query[:50]}...' using method: {method}")
587
+
588
+ # Get results based on method
589
+ results = []
590
+ if method == "similarity":
591
+ retriever = vector_store_manager.get_retriever("similarity", {"k": k_value})
592
+ docs = retriever.invoke(query)
593
+ # Try to get actual similarity scores
594
+ try:
595
+ vector_store = vector_store_manager.get_vector_store()
596
+ if hasattr(vector_store, 'similarity_search_with_score'):
597
+ docs_with_scores = vector_store.similarity_search_with_score(query, k=k_value)
598
+ for i, (doc, score) in enumerate(docs_with_scores):
599
+ similarity_score = max(0, 1 - score) if score is not None else 0.8
600
+ results.append(_format_ranker_result(doc, similarity_score, i + 1))
601
+ else:
602
+ # Fallback without scores
603
+ for i, doc in enumerate(docs):
604
+ score = 0.85 - (i * 0.05)
605
+ results.append(_format_ranker_result(doc, score, i + 1))
606
+ except Exception as e:
607
+ logger.warning(f"Could not get similarity scores: {e}")
608
+ for i, doc in enumerate(docs):
609
+ score = 0.85 - (i * 0.05)
610
+ results.append(_format_ranker_result(doc, score, i + 1))
611
+
612
+ elif method == "mmr":
613
+ retriever = vector_store_manager.get_retriever("mmr", {"k": k_value, "fetch_k": k_value * 2, "lambda_mult": 0.5})
614
+ docs = retriever.invoke(query)
615
+ for i, doc in enumerate(docs):
616
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for MMR
617
+
618
+ elif method == "bm25":
619
+ retriever = vector_store_manager.get_bm25_retriever(k=k_value)
620
+ docs = retriever.invoke(query)
621
+ for i, doc in enumerate(docs):
622
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for BM25
623
+
624
+ elif method == "hybrid":
625
+ retriever = vector_store_manager.get_hybrid_retriever(k=k_value, semantic_weight=0.7, keyword_weight=0.3)
626
+ docs = retriever.invoke(query)
627
+ # Explicitly limit results to k_value since EnsembleRetriever may return more
628
+ docs = docs[:k_value]
629
+ for i, doc in enumerate(docs):
630
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for Hybrid
631
+
632
+ return _format_ranker_results_html(results, query, method)
633
+
634
+ except Exception as e:
635
+ error_msg = f"Error during search: {str(e)}"
636
+ logger.error(error_msg)
637
+ return f"""
638
+ <div class="ranker-container">
639
+ <div class="ranker-error">
640
+ <h3>❌ Search Error</h3>
641
+ <p>{error_msg}</p>
642
+ <p class="error-hint">Please check if documents are uploaded and the system is ready.</p>
643
+ </div>
644
+ </div>
645
+ """
646
+
647
+ def _format_ranker_result(doc, score, rank):
648
+ """Format a single document result for the ranker."""
649
+ metadata = doc.metadata or {}
650
+
651
+ # Extract metadata
652
+ source = metadata.get("source", "Unknown Document")
653
+ page = metadata.get("page", "N/A")
654
+ chunk_id = metadata.get("chunk_id", f"chunk_{rank}")
655
+
656
+ # Content length indicator
657
+ content_length = len(doc.page_content)
658
+ if content_length < 200:
659
+ length_indicator = "πŸ“„ Short"
660
+ elif content_length < 500:
661
+ length_indicator = "πŸ“„ Medium"
662
+ else:
663
+ length_indicator = "πŸ“„ Long"
664
+
665
+ # Rank-based confidence levels (applies to all methods)
666
+ if rank <= 3:
667
+ confidence = "High"
668
+ confidence_color = "#22c55e"
669
+ confidence_icon = "🟒"
670
+ elif rank <= 6:
671
+ confidence = "Medium"
672
+ confidence_color = "#f59e0b"
673
+ confidence_icon = "🟑"
674
+ else:
675
+ confidence = "Low"
676
+ confidence_color = "#ef4444"
677
+ confidence_icon = "πŸ”΄"
678
+
679
+ result = {
680
+ "rank": rank,
681
+ "content": doc.page_content,
682
+ "source": source,
683
+ "page": page,
684
+ "chunk_id": chunk_id,
685
+ "length_indicator": length_indicator,
686
+ "has_score": score is not None,
687
+ "confidence": confidence,
688
+ "confidence_color": confidence_color,
689
+ "confidence_icon": confidence_icon
690
+ }
691
+
692
+ # Only add score if we have a real score (similarity search only)
693
+ if score is not None:
694
+ result["score"] = round(score, 3)
695
+
696
+ return result
697
+
698
+ def _format_ranker_results_html(results, query, method):
699
+ """Format search results as HTML."""
700
+ if not results:
701
+ return """
702
+ <div class="ranker-container">
703
+ <div class="ranker-no-results">
704
+ <h3>πŸ” No Results Found</h3>
705
+ <p>No relevant documents found for your query.</p>
706
+ <p class="no-results-hint">Try different keywords or check if documents are uploaded.</p>
707
+ </div>
708
+ </div>
709
+ """
710
+
711
+ # Method display names
712
+ method_labels = {
713
+ "similarity": "🎯 Similarity Search",
714
+ "mmr": "πŸ”€ MMR (Diverse)",
715
+ "bm25": "πŸ” BM25 (Keywords)",
716
+ "hybrid": "πŸ”— Hybrid (Recommended)"
717
+ }
718
+ method_display = method_labels.get(method, method)
719
+
720
+ # Start building HTML
721
+ html_parts = [f"""
722
+ <div class="ranker-container">
723
+ <div class="ranker-header">
724
+ <div class="ranker-title">
725
+ <h3>πŸ” Search Results</h3>
726
+ <div class="query-display">"{query}"</div>
727
+ </div>
728
+ <div class="ranker-meta">
729
+ <span class="method-badge">{method_display}</span>
730
+ <span class="result-count">{len(results)} results</span>
731
+ </div>
732
+ </div>
733
+ """]
734
+
735
+ # Add results
736
+ for result in results:
737
+ rank_emoji = ["πŸ₯‡", "πŸ₯ˆ", "πŸ₯‰"][result["rank"] - 1] if result["rank"] <= 3 else f"#{result['rank']}"
738
+
739
+ # Escape content for safe HTML inclusion and JavaScript
740
+ escaped_content = result['content'].replace('"', '&quot;').replace("'", "&#39;").replace('\n', '\\n')
741
+
742
+ # Build score info - always show confidence, only show score for similarity search
743
+ score_info_parts = [f"""
744
+ <span class="confidence-badge" style="color: {result['confidence_color']}">
745
+ {result['confidence_icon']} {result['confidence']}
746
+ </span>"""]
747
+
748
+ # Only add score value if we have real scores (similarity search)
749
+ if result.get('has_score', False):
750
+ score_info_parts.append(f'<span class="score-value">🎯 {result["score"]}</span>')
751
+
752
+ score_info_html = f"""
753
+ <div class="score-info">
754
+ {''.join(score_info_parts)}
755
+ </div>"""
756
+
757
+ html_parts.append(f"""
758
+ <div class="result-card">
759
+ <div class="result-header">
760
+ <div class="rank-info">
761
+ <span class="rank-badge">{rank_emoji} Rank {result['rank']}</span>
762
+ <span class="source-info">πŸ“„ {result['source']}</span>
763
+ {f"<span class='page-info'>Page {result['page']}</span>" if result['page'] != 'N/A' else ""}
764
+ <span class="length-info">{result['length_indicator']}</span>
765
+ </div>
766
+ {score_info_html}
767
+ </div>
768
+ <div class="result-content">
769
+ <div class="content-text">{result['content']}</div>
770
+ </div>
771
+ </div>
772
+ """)
773
+
774
+ html_parts.append("</div>")
775
+
776
+ return "".join(html_parts)
777
+
778
+ def get_ranker_status():
779
+ """Get current ranker system status."""
780
+ try:
781
+ # Get collection info
782
+ collection_info = vector_store_manager.get_collection_info()
783
+ document_count = collection_info.get("document_count", 0)
784
+
785
+ # Get available methods
786
+ available_methods = ["similarity", "mmr", "bm25", "hybrid"]
787
+
788
+ # Check if system is ready
789
+ ingestion_status = document_ingestion_service.get_ingestion_status()
790
+ system_ready = ingestion_status.get('system_ready', False)
791
+
792
+ status_html = f"""
793
+ <div class="status-card">
794
+ <div class="status-header">
795
+ <h3>πŸ” Query Ranker Status</h3>
796
+ <div class="status-indicator {'status-ready' if system_ready else 'status-not-ready'}">
797
+ {'🟒 READY' if system_ready else 'πŸ”΄ NOT READY'}
798
+ </div>
799
+ </div>
800
+
801
+ <div class="status-grid">
802
+ <div class="status-item">
803
+ <div class="status-label">Available Documents</div>
804
+ <div class="status-value">{document_count}</div>
805
+ </div>
806
+ <div class="status-item">
807
+ <div class="status-label">Retrieval Methods</div>
808
+ <div class="status-value">{len(available_methods)}</div>
809
+ </div>
810
+ <div class="status-item">
811
+ <div class="status-label">Vector Store</div>
812
+ <div class="status-value">{'Ready' if system_ready else 'Not Ready'}</div>
813
+ </div>
814
+ </div>
815
+
816
+ <div class="ranker-methods">
817
+ <div class="methods-label">Available Methods:</div>
818
+ <div class="methods-list">
819
+ <span class="method-tag">🎯 Similarity</span>
820
+ <span class="method-tag">πŸ”€ MMR</span>
821
+ <span class="method-tag">πŸ” BM25</span>
822
+ <span class="method-tag">πŸ”— Hybrid</span>
823
+ </div>
824
+ </div>
825
+ </div>
826
+ """
827
+
828
+ return status_html
829
+
830
+ except Exception as e:
831
+ error_msg = f"Error getting ranker status: {str(e)}"
832
+ logger.error(error_msg)
833
+ return f"""
834
+ <div class="status-card status-error">
835
+ <div class="status-header">
836
+ <h3>❌ System Error</h3>
837
+ </div>
838
+ <p class="error-message">{error_msg}</p>
839
+ </div>
840
+ """
841
+
842
+ def get_chat_status():
843
+ """Get current chat system status."""
844
+ try:
845
+ # Check ingestion status
846
+ ingestion_status = document_ingestion_service.get_ingestion_status()
847
+
848
+ # Check usage stats
849
+ usage_stats = rag_chat_service.get_usage_stats()
850
+
851
+ # Get data status for additional context
852
+ data_status = data_clearing_service.get_data_status()
853
+
854
+ # Modern status card design with better styling
855
+ status_html = f"""
856
+ <div class="status-card">
857
+ <div class="status-header">
858
+ <h3>πŸ’¬ Chat System Status</h3>
859
+ <div class="status-indicator {'status-ready' if ingestion_status.get('system_ready', False) else 'status-not-ready'}">
860
+ {'🟒 READY' if ingestion_status.get('system_ready', False) else 'πŸ”΄ NOT READY'}
861
+ </div>
862
+ </div>
863
+
864
+ <div class="status-grid">
865
+ <div class="status-item">
866
+ <div class="status-label">Vector Store Docs</div>
867
+ <div class="status-value">{data_status.get('vector_store', {}).get('document_count', 0)}</div>
868
+ </div>
869
+ <div class="status-item">
870
+ <div class="status-label">Chat History Files</div>
871
+ <div class="status-value">{data_status.get('chat_history', {}).get('file_count', 0)}</div>
872
+ </div>
873
+ <div class="status-item">
874
+ <div class="status-label">Session Usage</div>
875
+ <div class="status-value">{usage_stats.get('session_messages', 0)}/{usage_stats.get('session_limit', 50)}</div>
876
+ </div>
877
+ <div class="status-item">
878
+ <div class="status-label">Environment</div>
879
+ <div class="status-value">{'HF Space' if data_status.get('environment') == 'hf_space' else 'Local'}</div>
880
+ </div>
881
+ </div>
882
+
883
+ <div class="status-services">
884
+ <div class="service-status {'service-ready' if ingestion_status.get('embedding_model_available', False) else 'service-error'}">
885
+ <span class="service-icon">🧠</span>
886
+ <span>Embedding Model</span>
887
+ <span class="service-indicator">{'βœ…' if ingestion_status.get('embedding_model_available', False) else '❌'}</span>
888
+ </div>
889
+ <div class="service-status {'service-ready' if ingestion_status.get('vector_store_available', False) else 'service-error'}">
890
+ <span class="service-icon">πŸ—„οΈ</span>
891
+ <span>Vector Store</span>
892
+ <span class="service-indicator">{'βœ…' if ingestion_status.get('vector_store_available', False) else '❌'}</span>
893
+ </div>
894
+ </div>
895
+ </div>
896
+ """
897
+
898
+ return status_html
899
+
900
+ except Exception as e:
901
+ error_msg = f"Error getting chat status: {str(e)}"
902
+ logger.error(error_msg)
903
+ return f"""
904
+ <div class="status-card status-error">
905
+ <div class="status-header">
906
+ <h3>❌ System Error</h3>
907
+ </div>
908
+ <p class="error-message">{error_msg}</p>
909
+ </div>
910
+ """
911
+
912
+ def create_ui():
913
+ with gr.Blocks(css="""
914
+ /* Global styles */
915
+ .gradio-container {
916
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
917
+ }
918
+
919
+ /* Document converter styles */
920
+ .output-container {
921
+ max-height: 420px;
922
+ overflow-y: auto;
923
+ border: 1px solid #ddd;
924
+ padding: 10px;
925
+ }
926
+
927
+ .gradio-container .prose {
928
+ overflow: visible;
929
+ }
930
+
931
+ .processing-controls {
932
+ display: flex;
933
+ justify-content: center;
934
+ gap: 10px;
935
+ margin-top: 10px;
936
+ }
937
+
938
+ .provider-options-row {
939
+ margin-top: 15px;
940
+ margin-bottom: 15px;
941
+ }
942
+
943
+ /* Chat Tab Styles - Complete redesign */
944
+ .chat-tab-container {
945
+ max-width: 1200px;
946
+ margin: 0 auto;
947
+ padding: 20px;
948
+ }
949
+
950
+ .chat-header {
951
+ text-align: center;
952
+ margin-bottom: 30px;
953
+ padding: 20px;
954
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
955
+ border-radius: 15px;
956
+ color: white;
957
+ box-shadow: 0 4px 15px rgba(0,0,0,0.1);
958
+ }
959
+
960
+ .chat-header h2 {
961
+ margin: 0;
962
+ font-size: 1.8em;
963
+ font-weight: 600;
964
+ }
965
+
966
+ .chat-header p {
967
+ margin: 10px 0 0 0;
968
+ opacity: 0.9;
969
+ font-size: 1.1em;
970
+ }
971
+
972
+ /* Status Card Styling */
973
+ .status-card {
974
+ background: #ffffff;
975
+ border: 1px solid #e1e5e9;
976
+ border-radius: 12px;
977
+ padding: 20px;
978
+ margin-bottom: 25px;
979
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
980
+ transition: all 0.3s ease;
981
+ }
982
+
983
+ .status-card:hover {
984
+ box-shadow: 0 4px 20px rgba(0,0,0,0.1);
985
+ }
986
+
987
+ .status-header {
988
+ display: flex;
989
+ justify-content: space-between;
990
+ align-items: center;
991
+ margin-bottom: 20px;
992
+ padding-bottom: 15px;
993
+ border-bottom: 2px solid #f0f2f5;
994
+ }
995
+
996
+ .status-header h3 {
997
+ margin: 0;
998
+ color: #2c3e50;
999
+ font-size: 1.3em;
1000
+ font-weight: 600;
1001
+ }
1002
+
1003
+ .status-indicator {
1004
+ padding: 8px 16px;
1005
+ border-radius: 25px;
1006
+ font-weight: 600;
1007
+ font-size: 0.9em;
1008
+ letter-spacing: 0.5px;
1009
+ }
1010
+
1011
+ .status-ready {
1012
+ background: #d4edda;
1013
+ color: #155724;
1014
+ border: 1px solid #c3e6cb;
1015
+ }
1016
+
1017
+ .status-not-ready {
1018
+ background: #f8d7da;
1019
+ color: #721c24;
1020
+ border: 1px solid #f5c6cb;
1021
+ }
1022
+
1023
+ .status-grid {
1024
+ display: grid;
1025
+ grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
1026
+ gap: 15px;
1027
+ margin-bottom: 20px;
1028
+ }
1029
+
1030
+ .status-item {
1031
+ background: #f8f9fa;
1032
+ padding: 15px;
1033
+ border-radius: 8px;
1034
+ text-align: center;
1035
+ border: 1px solid #e9ecef;
1036
+ }
1037
+
1038
+ .status-label {
1039
+ font-size: 0.85em;
1040
+ color: #6c757d;
1041
+ margin-bottom: 5px;
1042
+ font-weight: 500;
1043
+ }
1044
+
1045
+ .status-value {
1046
+ font-size: 1.4em;
1047
+ font-weight: 700;
1048
+ color: #495057;
1049
+ }
1050
+
1051
+ .status-services {
1052
+ display: flex;
1053
+ gap: 15px;
1054
+ flex-wrap: wrap;
1055
+ }
1056
+
1057
+ .service-status {
1058
+ display: flex;
1059
+ align-items: center;
1060
+ gap: 8px;
1061
+ padding: 10px 15px;
1062
+ border-radius: 8px;
1063
+ font-weight: 500;
1064
+ flex: 1;
1065
+ min-width: 200px;
1066
+ color: #2c3e50 !important;
1067
+ }
1068
+
1069
+ .service-status span {
1070
+ color: #2c3e50 !important;
1071
+ }
1072
+
1073
+ .service-ready {
1074
+ background: #d4edda;
1075
+ color: #2c3e50 !important;
1076
+ border: 1px solid #c3e6cb;
1077
+ }
1078
+
1079
+ .service-ready span {
1080
+ color: #2c3e50 !important;
1081
+ }
1082
+
1083
+ .service-error {
1084
+ background: #f8d7da;
1085
+ color: #2c3e50 !important;
1086
+ border: 1px solid #f5c6cb;
1087
+ }
1088
+
1089
+ .service-error span {
1090
+ color: #2c3e50 !important;
1091
+ }
1092
+
1093
+ .service-icon {
1094
+ font-size: 1.2em;
1095
+ }
1096
+
1097
+ .service-indicator {
1098
+ margin-left: auto;
1099
+ }
1100
+
1101
+ .status-error {
1102
+ border-color: #dc3545;
1103
+ background: #f8d7da;
1104
+ }
1105
+
1106
+ .error-message {
1107
+ color: #721c24;
1108
+ margin: 0;
1109
+ font-weight: 500;
1110
+ }
1111
+
1112
+ /* Control buttons styling */
1113
+ .control-buttons {
1114
+ display: flex;
1115
+ gap: 12px;
1116
+ justify-content: flex-end;
1117
+ margin-bottom: 25px;
1118
+ }
1119
+
1120
+ .control-btn {
1121
+ padding: 10px 20px;
1122
+ border-radius: 8px;
1123
+ font-weight: 500;
1124
+ transition: all 0.3s ease;
1125
+ border: none;
1126
+ cursor: pointer;
1127
+ }
1128
+
1129
+ .btn-refresh {
1130
+ background: #17a2b8;
1131
+ color: white;
1132
+ }
1133
+
1134
+ .btn-refresh:hover {
1135
+ background: #138496;
1136
+ transform: translateY(-1px);
1137
+ }
1138
+
1139
+ .btn-new-session {
1140
+ background: #28a745;
1141
+ color: white;
1142
+ }
1143
+
1144
+ .btn-new-session:hover {
1145
+ background: #218838;
1146
+ transform: translateY(-1px);
1147
+ }
1148
+
1149
+ .btn-clear-data {
1150
+ background: #dc3545;
1151
+ color: white;
1152
+ }
1153
+
1154
+ .btn-clear-data:hover {
1155
+ background: #c82333;
1156
+ transform: translateY(-1px);
1157
+ }
1158
+
1159
+ .btn-primary {
1160
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1161
+ color: white;
1162
+ }
1163
+
1164
+ .btn-primary:hover {
1165
+ transform: translateY(-1px);
1166
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1167
+ }
1168
+
1169
+ /* Chat interface styling */
1170
+ .chat-main-container {
1171
+ background: #ffffff;
1172
+ border-radius: 15px;
1173
+ box-shadow: 0 4px 20px rgba(0,0,0,0.08);
1174
+ overflow: hidden;
1175
+ margin-bottom: 25px;
1176
+ }
1177
+
1178
+ .chat-container {
1179
+ background: #ffffff;
1180
+ border-radius: 12px;
1181
+ border: 1px solid #e1e5e9;
1182
+ overflow: hidden;
1183
+ }
1184
+
1185
+ /* Custom chatbot styling */
1186
+ .gradio-chatbot {
1187
+ border: none !important;
1188
+ background: #ffffff;
1189
+ }
1190
+
1191
+ .gradio-chatbot .message {
1192
+ padding: 15px 20px;
1193
+ margin: 10px;
1194
+ border-radius: 12px;
1195
+ }
1196
+
1197
+ .gradio-chatbot .message.user {
1198
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1199
+ color: white;
1200
+ margin-left: 50px;
1201
+ }
1202
+
1203
+ .gradio-chatbot .message.assistant {
1204
+ background: #f8f9fa;
1205
+ border: 1px solid #e9ecef;
1206
+ margin-right: 50px;
1207
+ }
1208
+
1209
+ /* Input area styling */
1210
+ .chat-input-container {
1211
+ background: #ffffff;
1212
+ padding: 20px;
1213
+ border-top: 1px solid #e1e5e9;
1214
+ border-radius: 0 0 15px 15px;
1215
+ }
1216
+
1217
+ .input-row {
1218
+ display: flex;
1219
+ gap: 12px;
1220
+ align-items: center;
1221
+ }
1222
+
1223
+ .message-input {
1224
+ flex: 1;
1225
+ border: 2px solid #e1e5e9;
1226
+ border-radius: 25px;
1227
+ padding: 12px 20px;
1228
+ font-size: 1em;
1229
+ transition: all 0.3s ease;
1230
+ resize: none;
1231
+ max-height: 120px;
1232
+ min-height: 48px;
1233
+ }
1234
+
1235
+ .message-input:focus {
1236
+ border-color: #667eea;
1237
+ box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
1238
+ outline: none;
1239
+ }
1240
+
1241
+ .send-button {
1242
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1243
+ color: white;
1244
+ border: none;
1245
+ border-radius: 12px;
1246
+ padding: 12px 24px;
1247
+ min-width: 80px;
1248
+ height: 48px;
1249
+ margin-right: 10px;
1250
+ cursor: pointer;
1251
+ transition: all 0.3s ease;
1252
+ display: flex;
1253
+ align-items: center;
1254
+ justify-content: center;
1255
+ font-size: 1em;
1256
+ font-weight: 600;
1257
+ letter-spacing: 0.5px;
1258
+ }
1259
+
1260
+ .send-button:hover {
1261
+ transform: scale(1.05);
1262
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1263
+ }
1264
+
1265
+ /* Session info styling */
1266
+ .session-info {
1267
+ background: #e7f3ff;
1268
+ border: 1px solid #b3d9ff;
1269
+ border-radius: 8px;
1270
+ padding: 15px;
1271
+ color: #0056b3;
1272
+ font-weight: 500;
1273
+ text-align: center;
1274
+ }
1275
+
1276
+ /* Responsive design */
1277
+ @media (max-width: 768px) {
1278
+ .chat-tab-container {
1279
+ padding: 10px;
1280
+ }
1281
+
1282
+ .status-grid {
1283
+ grid-template-columns: repeat(2, 1fr);
1284
+ }
1285
+
1286
+ .service-status {
1287
+ min-width: 100%;
1288
+ }
1289
+
1290
+ .control-buttons {
1291
+ flex-direction: column;
1292
+ gap: 8px;
1293
+ }
1294
+
1295
+ .gradio-chatbot .message.user {
1296
+ margin-left: 20px;
1297
+ }
1298
+
1299
+ .gradio-chatbot .message.assistant {
1300
+ margin-right: 20px;
1301
+ }
1302
+ }
1303
+
1304
+ /* Query Ranker Styles */
1305
+ .ranker-container {
1306
+ max-width: 1200px;
1307
+ margin: 0 auto;
1308
+ padding: 20px;
1309
+ }
1310
+
1311
+ .ranker-placeholder {
1312
+ text-align: center;
1313
+ padding: 40px;
1314
+ background: #f8f9fa;
1315
+ border-radius: 12px;
1316
+ border: 1px solid #e9ecef;
1317
+ color: #6c757d;
1318
+ }
1319
+
1320
+ .ranker-placeholder h3 {
1321
+ color: #495057;
1322
+ margin-bottom: 10px;
1323
+ }
1324
+
1325
+ .ranker-error {
1326
+ text-align: center;
1327
+ padding: 30px;
1328
+ background: #f8d7da;
1329
+ border: 1px solid #f5c6cb;
1330
+ border-radius: 12px;
1331
+ color: #721c24;
1332
+ }
1333
+
1334
+ .ranker-error h3 {
1335
+ margin-bottom: 15px;
1336
+ }
1337
+
1338
+ .error-hint {
1339
+ font-style: italic;
1340
+ margin-top: 10px;
1341
+ opacity: 0.8;
1342
+ }
1343
+
1344
+ .ranker-no-results {
1345
+ text-align: center;
1346
+ padding: 40px;
1347
+ background: #ffffff;
1348
+ border: 1px solid #e1e5e9;
1349
+ border-radius: 12px;
1350
+ color: #6c757d;
1351
+ }
1352
+
1353
+ .ranker-no-results h3 {
1354
+ color: #495057;
1355
+ margin-bottom: 15px;
1356
+ }
1357
+
1358
+ .no-results-hint {
1359
+ font-style: italic;
1360
+ margin-top: 10px;
1361
+ opacity: 0.8;
1362
+ }
1363
+
1364
+ .ranker-header {
1365
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1366
+ color: white;
1367
+ padding: 20px;
1368
+ border-radius: 15px;
1369
+ margin-bottom: 25px;
1370
+ box-shadow: 0 4px 15px rgba(0,0,0,0.1);
1371
+ }
1372
+
1373
+ .ranker-title h3 {
1374
+ margin: 0 0 10px 0;
1375
+ font-size: 1.4em;
1376
+ font-weight: 600;
1377
+ }
1378
+
1379
+ .query-display {
1380
+ font-size: 1.1em;
1381
+ opacity: 0.9;
1382
+ font-style: italic;
1383
+ margin-bottom: 15px;
1384
+ }
1385
+
1386
+ .ranker-meta {
1387
+ display: flex;
1388
+ gap: 15px;
1389
+ align-items: center;
1390
+ flex-wrap: wrap;
1391
+ }
1392
+
1393
+ .method-badge {
1394
+ background: rgba(255, 255, 255, 0.2);
1395
+ padding: 6px 12px;
1396
+ border-radius: 20px;
1397
+ font-weight: 500;
1398
+ font-size: 0.9em;
1399
+ }
1400
+
1401
+ .result-count {
1402
+ background: rgba(255, 255, 255, 0.15);
1403
+ padding: 6px 12px;
1404
+ border-radius: 20px;
1405
+ font-weight: 500;
1406
+ font-size: 0.9em;
1407
+ }
1408
+
1409
+ .result-card {
1410
+ background: #ffffff;
1411
+ border: 1px solid #e1e5e9;
1412
+ border-radius: 12px;
1413
+ margin-bottom: 20px;
1414
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
1415
+ transition: all 0.3s ease;
1416
+ overflow: hidden;
1417
+ }
1418
+
1419
+ .result-card:hover {
1420
+ box-shadow: 0 4px 20px rgba(0,0,0,0.1);
1421
+ transform: translateY(-2px);
1422
+ }
1423
+
1424
+ .result-header {
1425
+ display: flex;
1426
+ justify-content: space-between;
1427
+ align-items: center;
1428
+ padding: 15px 20px;
1429
+ background: #f8f9fa;
1430
+ border-bottom: 1px solid #e9ecef;
1431
+ }
1432
+
1433
+ .rank-info {
1434
+ display: flex;
1435
+ gap: 10px;
1436
+ align-items: center;
1437
+ flex-wrap: wrap;
1438
+ }
1439
+
1440
+ .rank-badge {
1441
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1442
+ color: white;
1443
+ padding: 4px 10px;
1444
+ border-radius: 15px;
1445
+ font-weight: 600;
1446
+ font-size: 0.85em;
1447
+ }
1448
+
1449
+ .source-info {
1450
+ background: #e9ecef;
1451
+ color: #495057;
1452
+ padding: 4px 8px;
1453
+ border-radius: 10px;
1454
+ font-size: 0.85em;
1455
+ font-weight: 500;
1456
+ }
1457
+
1458
+ .page-info {
1459
+ background: #d1ecf1;
1460
+ color: #0c5460;
1461
+ padding: 4px 8px;
1462
+ border-radius: 10px;
1463
+ font-size: 0.85em;
1464
+ }
1465
+
1466
+ .length-info {
1467
+ background: #f8f9fa;
1468
+ color: #6c757d;
1469
+ padding: 4px 8px;
1470
+ border-radius: 10px;
1471
+ font-size: 0.85em;
1472
+ }
1473
+
1474
+ .score-info {
1475
+ display: flex;
1476
+ gap: 10px;
1477
+ align-items: center;
1478
+ }
1479
+
1480
+ .confidence-badge {
1481
+ padding: 4px 8px;
1482
+ border-radius: 10px;
1483
+ font-weight: 600;
1484
+ font-size: 0.85em;
1485
+ }
1486
+
1487
+ .score-value {
1488
+ background: #2c3e50;
1489
+ color: white;
1490
+ padding: 6px 12px;
1491
+ border-radius: 15px;
1492
+ font-weight: 600;
1493
+ font-size: 0.9em;
1494
+ }
1495
+
1496
+ .result-content {
1497
+ padding: 20px;
1498
+ }
1499
+
1500
+ .content-text {
1501
+ line-height: 1.6;
1502
+ color: #2c3e50;
1503
+ border-left: 3px solid #667eea;
1504
+ padding-left: 15px;
1505
+ background: #f8f9fa;
1506
+ padding: 15px;
1507
+ border-radius: 0 8px 8px 0;
1508
+ max-height: 300px;
1509
+ overflow-y: auto;
1510
+ }
1511
+
1512
+ .result-actions {
1513
+ display: flex;
1514
+ gap: 10px;
1515
+ padding: 15px 20px;
1516
+ background: #f8f9fa;
1517
+ border-top: 1px solid #e9ecef;
1518
+ }
1519
+
1520
+ .action-btn {
1521
+ padding: 8px 16px;
1522
+ border: none;
1523
+ border-radius: 8px;
1524
+ font-weight: 500;
1525
+ cursor: pointer;
1526
+ transition: all 0.3s ease;
1527
+ font-size: 0.9em;
1528
+ display: flex;
1529
+ align-items: center;
1530
+ gap: 5px;
1531
+ }
1532
+
1533
+ .copy-btn {
1534
+ background: #17a2b8;
1535
+ color: white;
1536
+ }
1537
+
1538
+ .copy-btn:hover {
1539
+ background: #138496;
1540
+ transform: translateY(-1px);
1541
+ }
1542
+
1543
+ .info-btn {
1544
+ background: #6c757d;
1545
+ color: white;
1546
+ }
1547
+
1548
+ .info-btn:hover {
1549
+ background: #5a6268;
1550
+ transform: translateY(-1px);
1551
+ }
1552
+
1553
+ .ranker-methods {
1554
+ margin-top: 20px;
1555
+ padding-top: 15px;
1556
+ border-top: 1px solid #e9ecef;
1557
+ }
1558
+
1559
+ .methods-label {
1560
+ font-weight: 600;
1561
+ color: #495057;
1562
+ margin-bottom: 10px;
1563
+ font-size: 0.9em;
1564
+ }
1565
+
1566
+ .methods-list {
1567
+ display: flex;
1568
+ gap: 8px;
1569
+ flex-wrap: wrap;
1570
+ }
1571
+
1572
+ .method-tag {
1573
+ background: #e9ecef;
1574
+ color: #495057;
1575
+ padding: 4px 10px;
1576
+ border-radius: 12px;
1577
+ font-size: 0.8em;
1578
+ font-weight: 500;
1579
+ }
1580
+
1581
+ /* Ranker controls styling */
1582
+ .ranker-controls {
1583
+ background: #ffffff;
1584
+ border: 1px solid #e1e5e9;
1585
+ border-radius: 12px;
1586
+ padding: 20px;
1587
+ margin-bottom: 25px;
1588
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
1589
+ }
1590
+
1591
+ .ranker-input-row {
1592
+ display: flex;
1593
+ gap: 15px;
1594
+ align-items: end;
1595
+ margin-bottom: 15px;
1596
+ }
1597
+
1598
+ .ranker-query-input {
1599
+ flex: 1;
1600
+ border: 2px solid #e1e5e9;
1601
+ border-radius: 25px;
1602
+ padding: 12px 20px;
1603
+ font-size: 1em;
1604
+ transition: all 0.3s ease;
1605
+ }
1606
+
1607
+ .ranker-query-input:focus {
1608
+ border-color: #667eea;
1609
+ box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
1610
+ outline: none;
1611
+ }
1612
+
1613
+ .ranker-search-btn {
1614
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1615
+ color: white;
1616
+ border: none;
1617
+ border-radius: 12px;
1618
+ padding: 12px 24px;
1619
+ min-width: 100px;
1620
+ cursor: pointer;
1621
+ transition: all 0.3s ease;
1622
+ font-weight: 600;
1623
+ font-size: 1em;
1624
+ }
1625
+
1626
+ .ranker-search-btn:hover {
1627
+ transform: scale(1.05);
1628
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1629
+ }
1630
+
1631
+ .ranker-options-row {
1632
+ display: flex;
1633
+ gap: 15px;
1634
+ align-items: center;
1635
+ }
1636
+
1637
+ /* Responsive design for ranker */
1638
+ @media (max-width: 768px) {
1639
+ .ranker-container {
1640
+ padding: 10px;
1641
+ }
1642
+
1643
+ .ranker-input-row {
1644
+ flex-direction: column;
1645
+ gap: 10px;
1646
+ }
1647
+
1648
+ .ranker-options-row {
1649
+ flex-direction: column;
1650
+ gap: 10px;
1651
+ align-items: stretch;
1652
+ }
1653
+
1654
+ .ranker-meta {
1655
+ justify-content: center;
1656
+ }
1657
+
1658
+ .rank-info {
1659
+ flex-direction: column;
1660
+ gap: 5px;
1661
+ align-items: flex-start;
1662
+ }
1663
+
1664
+ .result-header {
1665
+ flex-direction: column;
1666
+ gap: 10px;
1667
+ align-items: flex-start;
1668
+ }
1669
+
1670
+ .score-info {
1671
+ align-self: flex-end;
1672
+ }
1673
+
1674
+ .result-actions {
1675
+ flex-direction: column;
1676
+ gap: 8px;
1677
+ }
1678
+ }
1679
+ """) as demo:
1680
+ # Modern title with better styling
1681
+ gr.Markdown("""
1682
+ # πŸš€ Markit
1683
+ ## Document to Markdown Converter with RAG Chat
1684
+ """)
1685
+
1686
+ with gr.Tabs():
1687
+ # Document Converter Tab
1688
+ with gr.TabItem("πŸ“„ Document Converter"):
1689
+ with gr.Column(elem_classes=["chat-tab-container"]):
1690
+ # Modern header matching other tabs
1691
+ gr.HTML("""
1692
+ <div class="chat-header">
1693
+ <h2>πŸ“„ Document Converter</h2>
1694
+ <p>Convert documents to Markdown format with advanced OCR and AI processing</p>
1695
+ </div>
1696
+ """)
1697
+
1698
+ # State to track if cancellation is requested
1699
+ cancel_requested = gr.State(False)
1700
+ # State to store the conversion thread
1701
+ conversion_thread = gr.State(None)
1702
+ # State to store the output format (fixed to Markdown)
1703
+ output_format_state = gr.State("Markdown")
1704
+
1705
+ # Multi-file input (supports single and multiple files)
1706
+ files_input = gr.Files(
1707
+ label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
1708
+ file_count="multiple",
1709
+ file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
1710
+ )
1711
+
1712
+ # Processing type selector (visible only for multiple files)
1713
+ processing_type_selector = gr.Radio(
1714
+ choices=["combined", "individual", "summary", "comparison"],
1715
+ value="combined",
1716
+ label="Multi-Document Processing Type",
1717
+ info="How to process multiple documents together",
1718
+ visible=False
1719
+ )
1720
+
1721
+ # Status text to show file count and processing mode
1722
+ file_status_text = gr.HTML(
1723
+ value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
1724
+ label=""
1725
+ )
1726
+
1727
+ # Provider and OCR options below the file input
1728
+ with gr.Row(elem_classes=["provider-options-row"]):
1729
+ with gr.Column(scale=1):
1730
+ parser_names = ParserRegistry.get_parser_names()
1731
+
1732
+ # Make MarkItDown the default parser if available
1733
+ default_parser = next((p for p in parser_names if p == "MarkItDown"), parser_names[0] if parser_names else "PyPdfium")
1734
+
1735
+ provider_dropdown = gr.Dropdown(
1736
+ label="Provider",
1737
+ choices=parser_names,
1738
+ value=default_parser,
1739
+ interactive=True
1740
+ )
1741
+ with gr.Column(scale=1):
1742
+ default_ocr_options = ParserRegistry.get_ocr_options(default_parser)
1743
+ default_ocr = default_ocr_options[0] if default_ocr_options else "No OCR"
1744
+
1745
+ ocr_dropdown = gr.Dropdown(
1746
+ label="OCR Options",
1747
+ choices=default_ocr_options,
1748
+ value=default_ocr,
1749
+ interactive=True
1750
+ )
1751
+
1752
+ # Processing controls row with consistent styling
1753
+ with gr.Row(elem_classes=["control-buttons"]):
1754
+ convert_button = gr.Button("πŸš€ Convert", elem_classes=["control-btn", "btn-primary"])
1755
+ cancel_button = gr.Button("⏹️ Cancel", elem_classes=["control-btn", "btn-clear-data"], visible=False)
1756
+
1757
+ # Simple output container with just one scrollbar
1758
+ file_display = gr.HTML(
1759
+ value="<div class='output-container'></div>",
1760
+ label="Converted Content"
1761
+ )
1762
+
1763
+ file_download = gr.File(label="Download File")
1764
+
1765
+ # Event handlers for document converter
1766
+
1767
+ # Update UI when files are uploaded/changed
1768
+ files_input.change(
1769
+ fn=update_ui_for_file_count,
1770
+ inputs=[files_input],
1771
+ outputs=[processing_type_selector, file_status_text]
1772
+ )
1773
+
1774
+ provider_dropdown.change(
1775
+ lambda p: gr.Dropdown(
1776
+ choices=["Plain Text", "Formatted Text"] if "GOT-OCR" in p else ParserRegistry.get_ocr_options(p),
1777
+ value="Plain Text" if "GOT-OCR" in p else (ParserRegistry.get_ocr_options(p)[0] if ParserRegistry.get_ocr_options(p) else None)
1778
+ ),
1779
+ inputs=[provider_dropdown],
1780
+ outputs=[ocr_dropdown]
1781
+ )
1782
+
1783
+ # Reset cancel flag when starting conversion
1784
+ def start_conversion():
1785
+ global conversion_cancelled
1786
+ conversion_cancelled.clear()
1787
+ logger.info("Starting conversion with cancellation flag cleared")
1788
+ return gr.update(visible=False), gr.update(visible=True), False
1789
+
1790
+ # Set cancel flag and terminate thread when cancel button is clicked
1791
+ def request_cancellation(thread):
1792
+ global conversion_cancelled
1793
+ conversion_cancelled.set()
1794
+ logger.info("Cancel button clicked, cancellation flag set")
1795
+
1796
+ # Try to join the thread with a timeout
1797
+ if thread is not None:
1798
+ logger.info(f"Attempting to join conversion thread: {thread}")
1799
+ thread.join(timeout=0.5)
1800
+ if thread.is_alive():
1801
+ logger.warning("Thread did not finish within timeout")
1802
+
1803
+ # Add immediate feedback to the user
1804
+ return gr.update(visible=True), gr.update(visible=False), True, None
1805
+
1806
+ # Start conversion sequence
1807
+ convert_button.click(
1808
+ fn=start_conversion,
1809
+ inputs=[],
1810
+ outputs=[convert_button, cancel_button, cancel_requested],
1811
+ queue=False # Execute immediately
1812
+ ).then(
1813
+ fn=handle_convert,
1814
+ inputs=[files_input, provider_dropdown, ocr_dropdown, output_format_state, processing_type_selector, cancel_requested],
1815
+ outputs=[file_display, file_download, convert_button, cancel_button, conversion_thread]
1816
+ )
1817
+
1818
+ # Handle cancel button click
1819
+ cancel_button.click(
1820
+ fn=request_cancellation,
1821
+ inputs=[conversion_thread],
1822
+ outputs=[convert_button, cancel_button, cancel_requested, conversion_thread],
1823
+ queue=False # Execute immediately
1824
+ )
1825
+
1826
+ # Chat Tab - Completely redesigned
1827
+ with gr.TabItem("πŸ’¬ Chat with Documents"):
1828
+ with gr.Column(elem_classes=["chat-tab-container"]):
1829
+ # Modern header
1830
+ gr.HTML("""
1831
+ <div class="chat-header">
1832
+ <h2>πŸ’¬ Chat with your converted documents</h2>
1833
+ <p>Ask questions about your documents using advanced RAG technology</p>
1834
+ </div>
1835
+ """)
1836
+
1837
+ # Status section with modern design
1838
+ status_display = gr.HTML(value=get_chat_status())
1839
+
1840
+ # Control buttons
1841
+ with gr.Row(elem_classes=["control-buttons"]):
1842
+ refresh_status_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
1843
+ new_session_btn = gr.Button("πŸ†• New Session", elem_classes=["control-btn", "btn-new-session"])
1844
+ clear_data_btn = gr.Button("πŸ—‘οΈ Clear All Data", elem_classes=["control-btn", "btn-clear-data"], variant="stop")
1845
+
1846
+ # Main chat interface
1847
+ with gr.Column(elem_classes=["chat-main-container"]):
1848
+ chatbot = gr.Chatbot(
1849
+ elem_classes=["chat-container"],
1850
+ height=500,
1851
+ show_label=False,
1852
+ show_share_button=False,
1853
+ bubble_full_width=False,
1854
+ type="messages",
1855
+ placeholder="Start a conversation by asking questions about your documents..."
1856
+ )
1857
+
1858
+ # Input area
1859
+ with gr.Row(elem_classes=["input-row"]):
1860
+ msg_input = gr.Textbox(
1861
+ placeholder="Ask questions about your documents...",
1862
+ show_label=False,
1863
+ scale=5,
1864
+ lines=1,
1865
+ max_lines=3,
1866
+ elem_classes=["message-input"]
1867
+ )
1868
+ send_btn = gr.Button("Submit", elem_classes=["send-button"], scale=0)
1869
+
1870
+ # Session info with better styling
1871
+ session_info = gr.HTML(
1872
+ value='<div class="session-info">No active session - Click "New Session" to start</div>'
1873
+ )
1874
+
1875
+ # Event handlers for chat
1876
+ def clear_input():
1877
+ return ""
1878
+
1879
+ # Send message when button clicked or Enter pressed
1880
+ msg_input.submit(
1881
+ fn=handle_chat_message,
1882
+ inputs=[msg_input, chatbot],
1883
+ outputs=[msg_input, chatbot, status_display]
1884
+ )
1885
+
1886
+ send_btn.click(
1887
+ fn=handle_chat_message,
1888
+ inputs=[msg_input, chatbot],
1889
+ outputs=[msg_input, chatbot, status_display]
1890
+ )
1891
+
1892
+ # New session handler with improved feedback
1893
+ def enhanced_new_session():
1894
+ history, info = start_new_chat_session()
1895
+ session_html = f'<div class="session-info">{info}</div>'
1896
+ updated_status = get_chat_status()
1897
+ return history, session_html, updated_status
1898
+
1899
+ new_session_btn.click(
1900
+ fn=enhanced_new_session,
1901
+ inputs=[],
1902
+ outputs=[chatbot, session_info, status_display]
1903
+ )
1904
+
1905
+ # Refresh status handler
1906
+ refresh_status_btn.click(
1907
+ fn=get_chat_status,
1908
+ inputs=[],
1909
+ outputs=[status_display]
1910
+ )
1911
+
1912
+ # Clear all data handler
1913
+ clear_data_btn.click(
1914
+ fn=handle_clear_all_data,
1915
+ inputs=[],
1916
+ outputs=[chatbot, session_info, status_display]
1917
+ )
1918
+
1919
+ # Query Ranker Tab
1920
+ with gr.TabItem("πŸ” Query Ranker"):
1921
+ with gr.Column(elem_classes=["ranker-container"]):
1922
+ # Modern header
1923
+ gr.HTML("""
1924
+ <div class="chat-header">
1925
+ <h2>πŸ” Query Ranker</h2>
1926
+ <p>Search and rank document chunks with similarity scores</p>
1927
+ </div>
1928
+ """)
1929
+
1930
+ # Status section
1931
+ ranker_status_display = gr.HTML(value=get_ranker_status())
1932
+
1933
+ # Control buttons
1934
+ with gr.Row(elem_classes=["control-buttons"]):
1935
+ refresh_ranker_status_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
1936
+ clear_results_btn = gr.Button("πŸ—‘οΈ Clear Results", elem_classes=["control-btn", "btn-clear-data"])
1937
+
1938
+ # Search controls
1939
+ with gr.Column(elem_classes=["ranker-controls"]):
1940
+ with gr.Row(elem_classes=["ranker-input-row"]):
1941
+ query_input = gr.Textbox(
1942
+ placeholder="Enter your search query...",
1943
+ show_label=False,
1944
+ elem_classes=["ranker-query-input"],
1945
+ scale=4
1946
+ )
1947
+ search_btn = gr.Button("πŸ” Search", elem_classes=["ranker-search-btn"], scale=0)
1948
+
1949
+ with gr.Row(elem_classes=["ranker-options-row"]):
1950
+ method_dropdown = gr.Dropdown(
1951
+ choices=[
1952
+ ("🎯 Similarity Search", "similarity"),
1953
+ ("πŸ”€ MMR (Diverse)", "mmr"),
1954
+ ("πŸ” BM25 (Keywords)", "bm25"),
1955
+ ("πŸ”— Hybrid (Recommended)", "hybrid")
1956
+ ],
1957
+ value="hybrid",
1958
+ label="Retrieval Method",
1959
+ scale=2
1960
+ )
1961
+ k_slider = gr.Slider(
1962
+ minimum=1,
1963
+ maximum=10,
1964
+ value=5,
1965
+ step=1,
1966
+ label="Number of Results",
1967
+ scale=1
1968
+ )
1969
+
1970
+ # Results display
1971
+ results_display = gr.HTML(
1972
+ value=handle_query_search("", "hybrid", 5), # Initial placeholder
1973
+ elem_classes=["ranker-results-container"]
1974
+ )
1975
+
1976
+ # Event handlers for Query Ranker
1977
+ def clear_ranker_results():
1978
+ """Clear the search results and reset to placeholder."""
1979
+ return handle_query_search("", "hybrid", 5), ""
1980
+
1981
+ def refresh_ranker_status():
1982
+ """Refresh the ranker status display."""
1983
+ return get_ranker_status()
1984
+
1985
+ # Search functionality
1986
+ query_input.submit(
1987
+ fn=handle_query_search,
1988
+ inputs=[query_input, method_dropdown, k_slider],
1989
+ outputs=[results_display]
1990
+ )
1991
+
1992
+ search_btn.click(
1993
+ fn=handle_query_search,
1994
+ inputs=[query_input, method_dropdown, k_slider],
1995
+ outputs=[results_display]
1996
+ )
1997
+
1998
+ # Control button handlers
1999
+ refresh_ranker_status_btn.click(
2000
+ fn=refresh_ranker_status,
2001
+ inputs=[],
2002
+ outputs=[ranker_status_display]
2003
+ )
2004
+
2005
+ clear_results_btn.click(
2006
+ fn=clear_ranker_results,
2007
+ inputs=[],
2008
+ outputs=[results_display, query_input]
2009
+ )
2010
+
2011
+ # Update results when method or k changes
2012
+ method_dropdown.change(
2013
+ fn=handle_query_search,
2014
+ inputs=[query_input, method_dropdown, k_slider],
2015
+ outputs=[results_display]
2016
+ )
2017
+
2018
+ k_slider.change(
2019
+ fn=handle_query_search,
2020
+ inputs=[query_input, method_dropdown, k_slider],
2021
+ outputs=[results_display]
2022
+ )
2023
+
2024
+ return demo
2025
+
2026
+
2027
+ def launch_ui(server_name="0.0.0.0", server_port=7860, share=False):
2028
+ demo = create_ui()
2029
+ demo.launch(
2030
+ server_name=server_name,
2031
+ server_port=server_port,
2032
+ root_path="",
2033
+ show_error=True,
2034
+ share=share
2035
+ )
src/ui/utils/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """UI Utils package - Utility functions for UI components."""
src/ui/utils/file_validation.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """File validation utilities for the UI components."""
2
+
3
+ import gradio as gr
4
+ import logging
5
+ from pathlib import Path
6
+
7
+ from src.core.config import config
8
+ from src.core.logging_config import get_logger
9
+
10
+ logger = get_logger(__name__)
11
+
12
+
13
+ def update_ui_for_file_count(files):
14
+ """Update UI components based on the number of files uploaded."""
15
+ if not files or len(files) == 0:
16
+ return (
17
+ gr.update(visible=False), # processing_type_selector
18
+ "<div style='color: #666; font-style: italic;'>Upload documents to begin</div>" # file_status_text
19
+ )
20
+
21
+ if len(files) == 1:
22
+ file_name = files[0].name if hasattr(files[0], 'name') else str(files[0])
23
+ return (
24
+ gr.update(visible=False), # processing_type_selector (hidden for single file)
25
+ f"<div style='color: #2563eb; font-weight: 500;'>πŸ“„ Single document: {file_name}</div>"
26
+ )
27
+ else:
28
+ # Calculate total size for validation display
29
+ total_size = 0
30
+ try:
31
+ for file in files:
32
+ if hasattr(file, 'size'):
33
+ total_size += file.size
34
+ elif hasattr(file, 'name'):
35
+ # For file paths, get size from filesystem
36
+ total_size += Path(file.name).stat().st_size
37
+ except:
38
+ pass # Size calculation is optional for display
39
+
40
+ size_display = f" ({total_size / (1024*1024):.1f}MB)" if total_size > 0 else ""
41
+
42
+ # Check if within limits
43
+ if len(files) > 5:
44
+ status_color = "#dc2626" # red
45
+ status_text = f"⚠️ Too many files: {len(files)}/5 (max 5 files allowed)"
46
+ elif total_size > 20 * 1024 * 1024: # 20MB
47
+ status_color = "#dc2626" # red
48
+ status_text = f"⚠️ Files too large{size_display} (max 20MB combined)"
49
+ else:
50
+ status_color = "#059669" # green
51
+ status_text = f"πŸ“‚ Batch mode: {len(files)} files{size_display}"
52
+
53
+ return (
54
+ gr.update(visible=True), # processing_type_selector (visible for multiple files)
55
+ f"<div style='color: {status_color}; font-weight: 500;'>{status_text}</div>"
56
+ )
57
+
58
+
59
+ def validate_file_for_parser(file_path, parser_name):
60
+ """Validate if the file type is supported by the selected parser."""
61
+ if not file_path:
62
+ return True, "" # No file selected yet
63
+
64
+ try:
65
+ file_path_obj = Path(file_path)
66
+ file_ext = file_path_obj.suffix.lower()
67
+
68
+ # Check file size
69
+ if file_path_obj.exists():
70
+ file_size = file_path_obj.stat().st_size
71
+ if file_size > config.app.max_file_size:
72
+ size_mb = file_size / (1024 * 1024)
73
+ max_mb = config.app.max_file_size / (1024 * 1024)
74
+ return False, f"File size ({size_mb:.1f}MB) exceeds maximum allowed size ({max_mb:.1f}MB)"
75
+
76
+ # Check file extension
77
+ if file_ext not in config.app.allowed_extensions:
78
+ return False, f"File type '{file_ext}' is not supported. Allowed types: {', '.join(config.app.allowed_extensions)}"
79
+
80
+ # Parser-specific validation
81
+ if "GOT-OCR" in parser_name:
82
+ if file_ext not in ['.jpg', '.jpeg', '.png']:
83
+ return False, "GOT-OCR only supports JPG and PNG formats."
84
+
85
+ return True, ""
86
+
87
+ except Exception as e:
88
+ logger.error(f"Error validating file: {e}")
89
+ return False, f"Error validating file: {e}"
src/ui/utils/threading_utils.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Threading utilities for UI components."""
2
+
3
+ import threading
4
+ import time
5
+ import logging
6
+
7
+ from src.core.converter import is_conversion_in_progress
8
+ from src.core.logging_config import get_logger
9
+
10
+ logger = get_logger(__name__)
11
+
12
+ # Global variable to track cancellation state
13
+ conversion_cancelled = threading.Event()
14
+
15
+
16
+ def monitor_cancellation():
17
+ """Background thread to monitor cancellation and update UI if needed"""
18
+ logger.info("Starting cancellation monitor thread")
19
+ while is_conversion_in_progress():
20
+ if conversion_cancelled.is_set():
21
+ logger.info("Cancellation detected by monitor thread")
22
+ time.sleep(0.1) # Check every 100ms
23
+ logger.info("Cancellation monitor thread ending")
24
+
25
+
26
+ def get_cancellation_event():
27
+ """Get the global cancellation event."""
28
+ return conversion_cancelled
29
+
30
+
31
+ def reset_cancellation():
32
+ """Reset the cancellation event."""
33
+ conversion_cancelled.clear()
34
+
35
+
36
+ def set_cancellation():
37
+ """Set the cancellation event."""
38
+ conversion_cancelled.set()