AnseMin commited on
Commit
e8b3cf9
Β·
unverified Β·
2 Parent(s): b2cd0ff 623ad58

Merge pull request #10 from ansemin/development

Browse files

Enhance UI with new Query Ranker feature and improve document search …

Files changed (2) hide show
  1. README.md +30 -2
  2. src/ui/ui.py +835 -65
README.md CHANGED
@@ -50,8 +50,19 @@ A Hugging Face Space that converts various document formats to Markdown and lets
50
  - **OpenAI embeddings** for accurate document retrieval
51
  - **πŸ—‘οΈ Clear All Data** button for easy data management in both local and HF Space environments
52
 
 
 
 
 
 
 
 
 
 
 
 
53
  ### User Interface
54
- - **Dual-tab interface**: Document Converter + Chat
55
  - **πŸ†• Unified File Input**: Single interface handles both single and multiple file uploads
56
  - **πŸ†• Dynamic Processing Options**: Multi-document processing type selector appears automatically
57
  - **πŸ†• Real-time Validation**: Live feedback on file count, size limits, and processing mode
@@ -61,6 +72,7 @@ A Hugging Face Space that converts various document formats to Markdown and lets
61
  - **Data management controls**: Clear All Data button with comprehensive feedback
62
  - **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β†’ "example data.md")
63
  - **πŸ†• Smart Output Naming**: Batch processing creates descriptive filenames (e.g., "Combined_3_Documents_20240125.md")
 
64
  - Clean, responsive UI with modern styling
65
 
66
  ## Supported Libraries
@@ -228,11 +240,26 @@ The application uses centralized configuration management. You can enhance funct
228
  7. Use "πŸ—‘οΈ Clear All Data" to remove all documents and chat history
229
  8. Monitor your usage limits in the status panel
230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
  #### πŸ” **Retrieval Strategy Guide:**
232
  - **For research papers**: Use MMR to get diverse perspectives
233
  - **For technical docs**: Use Hybrid for comprehensive coverage
234
  - **For specific facts**: Use Similarity for targeted results
235
  - **For broad topics**: Use Hybrid for balanced semantic + keyword matching
 
236
 
237
  ## Local Development
238
 
@@ -417,7 +444,7 @@ markit_v2/
417
  β”‚ β”‚ └── ingestion.py # Document ingestion pipeline
418
  β”‚ └── ui/ # User interface layer
419
  β”‚ β”œβ”€β”€ __init__.py # Package initialization
420
- β”‚ └── ui.py # Gradio UI with dual tabs (Converter + Chat)
421
  β”œβ”€β”€ documents/ # Documentation and examples (gitignored)
422
  β”œβ”€β”€ tessdata/ # Tesseract OCR data (gitignored)
423
  └── tests/ # πŸ†• Test suite for Phase 1 RAG implementation
@@ -438,6 +465,7 @@ markit_v2/
438
  - **Lightweight Launcher**: Quick development startup with `run_app.py`
439
  - **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
440
  - **πŸ†• RAG System**: Complete RAG implementation with vector search and chat capabilities
 
441
 
442
  ### 🧠 **RAG System Architecture:**
443
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
 
50
  - **OpenAI embeddings** for accurate document retrieval
51
  - **πŸ—‘οΈ Clear All Data** button for easy data management in both local and HF Space environments
52
 
53
+ ### πŸ” Query Ranker (NEW!)
54
+ - **πŸ†• Third dedicated tab** for document search and ranking
55
+ - **Interactive query search** with real-time document chunk ranking
56
+ - **Multiple retrieval methods**: Similarity, MMR, BM25, and Hybrid search
57
+ - **Intelligent confidence scoring**: Rank-based confidence levels (High/Medium/Low)
58
+ - **Real similarity scores**: Actual ChromaDB similarity scores for similarity search
59
+ - **Transparent results**: Clear display of source documents, page numbers, and chunk lengths
60
+ - **Adjustable result count**: 1-10 results with responsive slider control
61
+ - **Method comparison**: Test different retrieval strategies on the same query
62
+ - **Modern card-based UI**: Clean, professional result display with hover effects
63
+
64
  ### User Interface
65
+ - **πŸ†• Three-tab interface**: Document Converter + Chat + Query Ranker
66
  - **πŸ†• Unified File Input**: Single interface handles both single and multiple file uploads
67
  - **πŸ†• Dynamic Processing Options**: Multi-document processing type selector appears automatically
68
  - **πŸ†• Real-time Validation**: Live feedback on file count, size limits, and processing mode
 
72
  - **Data management controls**: Clear All Data button with comprehensive feedback
73
  - **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β†’ "example data.md")
74
  - **πŸ†• Smart Output Naming**: Batch processing creates descriptive filenames (e.g., "Combined_3_Documents_20240125.md")
75
+ - **πŸ†• Consistent modern styling**: All tabs share the same professional design theme
76
  - Clean, responsive UI with modern styling
77
 
78
  ## Supported Libraries
 
240
  7. Use "πŸ—‘οΈ Clear All Data" to remove all documents and chat history
241
  8. Monitor your usage limits in the status panel
242
 
243
+ ### πŸ” Query Ranker (NEW!)
244
+ 1. Go to the **"Query Ranker"** tab
245
+ 2. Check the system status to ensure documents are loaded
246
+ 3. **Enter your search query** in the search box
247
+ 4. **Choose your retrieval method**:
248
+ - **🎯 Similarity Search**: Semantic similarity with real scores
249
+ - **πŸ”€ MMR (Diverse)**: Diverse results with reduced redundancy
250
+ - **πŸ” BM25 (Keywords)**: Traditional keyword-based search
251
+ - **πŸ”— Hybrid (Recommended)**: Best overall accuracy combining semantic + keyword
252
+ 5. **Adjust result count** (1-10) using the slider
253
+ 6. **Review ranked results** with confidence levels and source information
254
+ 7. **Compare methods** by trying different retrieval strategies on the same query
255
+ 8. Use results to understand how your documents are chunked and ranked
256
+
257
  #### πŸ” **Retrieval Strategy Guide:**
258
  - **For research papers**: Use MMR to get diverse perspectives
259
  - **For technical docs**: Use Hybrid for comprehensive coverage
260
  - **For specific facts**: Use Similarity for targeted results
261
  - **For broad topics**: Use Hybrid for balanced semantic + keyword matching
262
+ - **For transparency**: Use Query Ranker to see exactly which chunks are being retrieved
263
 
264
  ## Local Development
265
 
 
444
  β”‚ β”‚ └── ingestion.py # Document ingestion pipeline
445
  β”‚ └── ui/ # User interface layer
446
  β”‚ β”œβ”€β”€ __init__.py # Package initialization
447
+ β”‚ └── ui.py # πŸ†• Gradio UI with three tabs (Converter + Chat + Query Ranker)
448
  β”œβ”€β”€ documents/ # Documentation and examples (gitignored)
449
  β”œβ”€β”€ tessdata/ # Tesseract OCR data (gitignored)
450
  └── tests/ # πŸ†• Test suite for Phase 1 RAG implementation
 
465
  - **Lightweight Launcher**: Quick development startup with `run_app.py`
466
  - **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
467
  - **πŸ†• RAG System**: Complete RAG implementation with vector search and chat capabilities
468
+ - **πŸ†• Query Ranker Interface**: Dedicated transparency tool for document search and ranking
469
 
470
  ### 🧠 **RAG System Architecture:**
471
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
src/ui/ui.py CHANGED
@@ -15,6 +15,7 @@ from src.core.exceptions import (
15
  )
16
  from src.core.logging_config import get_logger
17
  from src.rag import rag_chat_service, document_ingestion_service
 
18
  from src.services.data_clearing_service import data_clearing_service
19
 
20
  # Use centralized logging
@@ -395,6 +396,275 @@ def handle_clear_all_data():
395
 
396
  return None, f'<div class="session-info">❌ {error_msg}</div>', current_status
397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
398
  def get_chat_status():
399
  """Get current chat system status."""
400
  try:
@@ -712,6 +982,16 @@ def create_ui():
712
  transform: translateY(-1px);
713
  }
714
 
 
 
 
 
 
 
 
 
 
 
715
  /* Chat interface styling */
716
  .chat-main-container {
717
  background: #ffffff;
@@ -846,6 +1126,382 @@ def create_ui():
846
  margin-right: 20px;
847
  }
848
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
849
  """) as demo:
850
  # Modern title with better styling
851
  gr.Markdown("""
@@ -856,72 +1512,81 @@ def create_ui():
856
  with gr.Tabs():
857
  # Document Converter Tab
858
  with gr.TabItem("πŸ“„ Document Converter"):
859
- # State to track if cancellation is requested
860
- cancel_requested = gr.State(False)
861
- # State to store the conversion thread
862
- conversion_thread = gr.State(None)
863
- # State to store the output format (fixed to Markdown)
864
- output_format_state = gr.State("Markdown")
 
 
 
 
 
 
 
 
 
865
 
866
- # Multi-file input (supports single and multiple files)
867
- files_input = gr.Files(
868
- label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
869
- file_count="multiple",
870
- file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
871
- )
872
-
873
- # Processing type selector (visible only for multiple files)
874
- processing_type_selector = gr.Radio(
875
- choices=["combined", "individual", "summary", "comparison"],
876
- value="combined",
877
- label="Multi-Document Processing Type",
878
- info="How to process multiple documents together",
879
- visible=False
880
- )
881
-
882
- # Status text to show file count and processing mode
883
- file_status_text = gr.HTML(
884
- value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
885
- label=""
886
- )
887
-
888
- # Provider and OCR options below the file input
889
- with gr.Row(elem_classes=["provider-options-row"]):
890
- with gr.Column(scale=1):
891
- parser_names = ParserRegistry.get_parser_names()
892
-
893
- # Make MarkItDown the default parser if available
894
- default_parser = next((p for p in parser_names if p == "MarkItDown"), parser_names[0] if parser_names else "PyPdfium")
895
-
896
- provider_dropdown = gr.Dropdown(
897
- label="Provider",
898
- choices=parser_names,
899
- value=default_parser,
900
- interactive=True
901
- )
902
- with gr.Column(scale=1):
903
- default_ocr_options = ParserRegistry.get_ocr_options(default_parser)
904
- default_ocr = default_ocr_options[0] if default_ocr_options else "No OCR"
905
-
906
- ocr_dropdown = gr.Dropdown(
907
- label="OCR Options",
908
- choices=default_ocr_options,
909
- value=default_ocr,
910
- interactive=True
911
- )
912
-
913
- # Simple output container with just one scrollbar
914
- file_display = gr.HTML(
915
- value="<div class='output-container'></div>",
916
- label="Converted Content"
917
- )
918
-
919
- file_download = gr.File(label="Download File")
920
-
921
- # Processing controls row
922
- with gr.Row(elem_classes=["processing-controls"]):
923
- convert_button = gr.Button("Convert", variant="primary")
924
- cancel_button = gr.Button("Cancel", variant="stop", visible=False)
925
 
926
  # Event handlers for document converter
927
 
@@ -1077,6 +1742,111 @@ def create_ui():
1077
  outputs=[chatbot, session_info, status_display]
1078
  )
1079
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1080
  return demo
1081
 
1082
 
 
15
  )
16
  from src.core.logging_config import get_logger
17
  from src.rag import rag_chat_service, document_ingestion_service
18
+ from src.rag.vector_store import vector_store_manager
19
  from src.services.data_clearing_service import data_clearing_service
20
 
21
  # Use centralized logging
 
396
 
397
  return None, f'<div class="session-info">❌ {error_msg}</div>', current_status
398
 
399
+ def handle_query_search(query, method, k_value):
400
+ """Handle query search and return formatted results."""
401
+ if not query or not query.strip():
402
+ return """
403
+ <div class="ranker-container">
404
+ <div class="ranker-placeholder">
405
+ <h3>πŸ” Query Ranker</h3>
406
+ <p>Enter a search query to find relevant document chunks with similarity scores.</p>
407
+ </div>
408
+ </div>
409
+ """
410
+
411
+ try:
412
+ logger.info(f"Query search: '{query[:50]}...' using method: {method}")
413
+
414
+ # Get results based on method
415
+ results = []
416
+ if method == "similarity":
417
+ retriever = vector_store_manager.get_retriever("similarity", {"k": k_value})
418
+ docs = retriever.invoke(query)
419
+ # Try to get actual similarity scores
420
+ try:
421
+ vector_store = vector_store_manager.get_vector_store()
422
+ if hasattr(vector_store, 'similarity_search_with_score'):
423
+ docs_with_scores = vector_store.similarity_search_with_score(query, k=k_value)
424
+ for i, (doc, score) in enumerate(docs_with_scores):
425
+ similarity_score = max(0, 1 - score) if score is not None else 0.8
426
+ results.append(_format_ranker_result(doc, similarity_score, i + 1))
427
+ else:
428
+ # Fallback without scores
429
+ for i, doc in enumerate(docs):
430
+ score = 0.85 - (i * 0.05)
431
+ results.append(_format_ranker_result(doc, score, i + 1))
432
+ except Exception as e:
433
+ logger.warning(f"Could not get similarity scores: {e}")
434
+ for i, doc in enumerate(docs):
435
+ score = 0.85 - (i * 0.05)
436
+ results.append(_format_ranker_result(doc, score, i + 1))
437
+
438
+ elif method == "mmr":
439
+ retriever = vector_store_manager.get_retriever("mmr", {"k": k_value, "fetch_k": k_value * 2, "lambda_mult": 0.5})
440
+ docs = retriever.invoke(query)
441
+ for i, doc in enumerate(docs):
442
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for MMR
443
+
444
+ elif method == "bm25":
445
+ retriever = vector_store_manager.get_bm25_retriever(k=k_value)
446
+ docs = retriever.invoke(query)
447
+ for i, doc in enumerate(docs):
448
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for BM25
449
+
450
+ elif method == "hybrid":
451
+ retriever = vector_store_manager.get_hybrid_retriever(k=k_value, semantic_weight=0.7, keyword_weight=0.3)
452
+ docs = retriever.invoke(query)
453
+ # Explicitly limit results to k_value since EnsembleRetriever may return more
454
+ docs = docs[:k_value]
455
+ for i, doc in enumerate(docs):
456
+ results.append(_format_ranker_result(doc, None, i + 1)) # No score for Hybrid
457
+
458
+ return _format_ranker_results_html(results, query, method)
459
+
460
+ except Exception as e:
461
+ error_msg = f"Error during search: {str(e)}"
462
+ logger.error(error_msg)
463
+ return f"""
464
+ <div class="ranker-container">
465
+ <div class="ranker-error">
466
+ <h3>❌ Search Error</h3>
467
+ <p>{error_msg}</p>
468
+ <p class="error-hint">Please check if documents are uploaded and the system is ready.</p>
469
+ </div>
470
+ </div>
471
+ """
472
+
473
+ def _format_ranker_result(doc, score, rank):
474
+ """Format a single document result for the ranker."""
475
+ metadata = doc.metadata or {}
476
+
477
+ # Extract metadata
478
+ source = metadata.get("source", "Unknown Document")
479
+ page = metadata.get("page", "N/A")
480
+ chunk_id = metadata.get("chunk_id", f"chunk_{rank}")
481
+
482
+ # Content length indicator
483
+ content_length = len(doc.page_content)
484
+ if content_length < 200:
485
+ length_indicator = "πŸ“„ Short"
486
+ elif content_length < 500:
487
+ length_indicator = "πŸ“„ Medium"
488
+ else:
489
+ length_indicator = "πŸ“„ Long"
490
+
491
+ # Rank-based confidence levels (applies to all methods)
492
+ if rank <= 3:
493
+ confidence = "High"
494
+ confidence_color = "#22c55e"
495
+ confidence_icon = "🟒"
496
+ elif rank <= 6:
497
+ confidence = "Medium"
498
+ confidence_color = "#f59e0b"
499
+ confidence_icon = "🟑"
500
+ else:
501
+ confidence = "Low"
502
+ confidence_color = "#ef4444"
503
+ confidence_icon = "πŸ”΄"
504
+
505
+ result = {
506
+ "rank": rank,
507
+ "content": doc.page_content,
508
+ "source": source,
509
+ "page": page,
510
+ "chunk_id": chunk_id,
511
+ "length_indicator": length_indicator,
512
+ "has_score": score is not None,
513
+ "confidence": confidence,
514
+ "confidence_color": confidence_color,
515
+ "confidence_icon": confidence_icon
516
+ }
517
+
518
+ # Only add score if we have a real score (similarity search only)
519
+ if score is not None:
520
+ result["score"] = round(score, 3)
521
+
522
+ return result
523
+
524
+ def _format_ranker_results_html(results, query, method):
525
+ """Format search results as HTML."""
526
+ if not results:
527
+ return """
528
+ <div class="ranker-container">
529
+ <div class="ranker-no-results">
530
+ <h3>πŸ” No Results Found</h3>
531
+ <p>No relevant documents found for your query.</p>
532
+ <p class="no-results-hint">Try different keywords or check if documents are uploaded.</p>
533
+ </div>
534
+ </div>
535
+ """
536
+
537
+ # Method display names
538
+ method_labels = {
539
+ "similarity": "🎯 Similarity Search",
540
+ "mmr": "πŸ”€ MMR (Diverse)",
541
+ "bm25": "πŸ” BM25 (Keywords)",
542
+ "hybrid": "πŸ”— Hybrid (Recommended)"
543
+ }
544
+ method_display = method_labels.get(method, method)
545
+
546
+ # Start building HTML
547
+ html_parts = [f"""
548
+ <div class="ranker-container">
549
+ <div class="ranker-header">
550
+ <div class="ranker-title">
551
+ <h3>πŸ” Search Results</h3>
552
+ <div class="query-display">"{query}"</div>
553
+ </div>
554
+ <div class="ranker-meta">
555
+ <span class="method-badge">{method_display}</span>
556
+ <span class="result-count">{len(results)} results</span>
557
+ </div>
558
+ </div>
559
+ """]
560
+
561
+ # Add results
562
+ for result in results:
563
+ rank_emoji = ["πŸ₯‡", "πŸ₯ˆ", "πŸ₯‰"][result["rank"] - 1] if result["rank"] <= 3 else f"#{result['rank']}"
564
+
565
+ # Escape content for safe HTML inclusion and JavaScript
566
+ escaped_content = result['content'].replace('"', '&quot;').replace("'", "&#39;").replace('\n', '\\n')
567
+
568
+ # Build score info - always show confidence, only show score for similarity search
569
+ score_info_parts = [f"""
570
+ <span class="confidence-badge" style="color: {result['confidence_color']}">
571
+ {result['confidence_icon']} {result['confidence']}
572
+ </span>"""]
573
+
574
+ # Only add score value if we have real scores (similarity search)
575
+ if result.get('has_score', False):
576
+ score_info_parts.append(f'<span class="score-value">🎯 {result["score"]}</span>')
577
+
578
+ score_info_html = f"""
579
+ <div class="score-info">
580
+ {''.join(score_info_parts)}
581
+ </div>"""
582
+
583
+ html_parts.append(f"""
584
+ <div class="result-card">
585
+ <div class="result-header">
586
+ <div class="rank-info">
587
+ <span class="rank-badge">{rank_emoji} Rank {result['rank']}</span>
588
+ <span class="source-info">πŸ“„ {result['source']}</span>
589
+ {f"<span class='page-info'>Page {result['page']}</span>" if result['page'] != 'N/A' else ""}
590
+ <span class="length-info">{result['length_indicator']}</span>
591
+ </div>
592
+ {score_info_html}
593
+ </div>
594
+ <div class="result-content">
595
+ <div class="content-text">{result['content']}</div>
596
+ </div>
597
+ </div>
598
+ """)
599
+
600
+ html_parts.append("</div>")
601
+
602
+ return "".join(html_parts)
603
+
604
+ def get_ranker_status():
605
+ """Get current ranker system status."""
606
+ try:
607
+ # Get collection info
608
+ collection_info = vector_store_manager.get_collection_info()
609
+ document_count = collection_info.get("document_count", 0)
610
+
611
+ # Get available methods
612
+ available_methods = ["similarity", "mmr", "bm25", "hybrid"]
613
+
614
+ # Check if system is ready
615
+ ingestion_status = document_ingestion_service.get_ingestion_status()
616
+ system_ready = ingestion_status.get('system_ready', False)
617
+
618
+ status_html = f"""
619
+ <div class="status-card">
620
+ <div class="status-header">
621
+ <h3>πŸ” Query Ranker Status</h3>
622
+ <div class="status-indicator {'status-ready' if system_ready else 'status-not-ready'}">
623
+ {'🟒 READY' if system_ready else 'πŸ”΄ NOT READY'}
624
+ </div>
625
+ </div>
626
+
627
+ <div class="status-grid">
628
+ <div class="status-item">
629
+ <div class="status-label">Available Documents</div>
630
+ <div class="status-value">{document_count}</div>
631
+ </div>
632
+ <div class="status-item">
633
+ <div class="status-label">Retrieval Methods</div>
634
+ <div class="status-value">{len(available_methods)}</div>
635
+ </div>
636
+ <div class="status-item">
637
+ <div class="status-label">Vector Store</div>
638
+ <div class="status-value">{'Ready' if system_ready else 'Not Ready'}</div>
639
+ </div>
640
+ </div>
641
+
642
+ <div class="ranker-methods">
643
+ <div class="methods-label">Available Methods:</div>
644
+ <div class="methods-list">
645
+ <span class="method-tag">🎯 Similarity</span>
646
+ <span class="method-tag">πŸ”€ MMR</span>
647
+ <span class="method-tag">πŸ” BM25</span>
648
+ <span class="method-tag">πŸ”— Hybrid</span>
649
+ </div>
650
+ </div>
651
+ </div>
652
+ """
653
+
654
+ return status_html
655
+
656
+ except Exception as e:
657
+ error_msg = f"Error getting ranker status: {str(e)}"
658
+ logger.error(error_msg)
659
+ return f"""
660
+ <div class="status-card status-error">
661
+ <div class="status-header">
662
+ <h3>❌ System Error</h3>
663
+ </div>
664
+ <p class="error-message">{error_msg}</p>
665
+ </div>
666
+ """
667
+
668
  def get_chat_status():
669
  """Get current chat system status."""
670
  try:
 
982
  transform: translateY(-1px);
983
  }
984
 
985
+ .btn-primary {
986
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
987
+ color: white;
988
+ }
989
+
990
+ .btn-primary:hover {
991
+ transform: translateY(-1px);
992
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
993
+ }
994
+
995
  /* Chat interface styling */
996
  .chat-main-container {
997
  background: #ffffff;
 
1126
  margin-right: 20px;
1127
  }
1128
  }
1129
+
1130
+ /* Query Ranker Styles */
1131
+ .ranker-container {
1132
+ max-width: 1200px;
1133
+ margin: 0 auto;
1134
+ padding: 20px;
1135
+ }
1136
+
1137
+ .ranker-placeholder {
1138
+ text-align: center;
1139
+ padding: 40px;
1140
+ background: #f8f9fa;
1141
+ border-radius: 12px;
1142
+ border: 1px solid #e9ecef;
1143
+ color: #6c757d;
1144
+ }
1145
+
1146
+ .ranker-placeholder h3 {
1147
+ color: #495057;
1148
+ margin-bottom: 10px;
1149
+ }
1150
+
1151
+ .ranker-error {
1152
+ text-align: center;
1153
+ padding: 30px;
1154
+ background: #f8d7da;
1155
+ border: 1px solid #f5c6cb;
1156
+ border-radius: 12px;
1157
+ color: #721c24;
1158
+ }
1159
+
1160
+ .ranker-error h3 {
1161
+ margin-bottom: 15px;
1162
+ }
1163
+
1164
+ .error-hint {
1165
+ font-style: italic;
1166
+ margin-top: 10px;
1167
+ opacity: 0.8;
1168
+ }
1169
+
1170
+ .ranker-no-results {
1171
+ text-align: center;
1172
+ padding: 40px;
1173
+ background: #ffffff;
1174
+ border: 1px solid #e1e5e9;
1175
+ border-radius: 12px;
1176
+ color: #6c757d;
1177
+ }
1178
+
1179
+ .ranker-no-results h3 {
1180
+ color: #495057;
1181
+ margin-bottom: 15px;
1182
+ }
1183
+
1184
+ .no-results-hint {
1185
+ font-style: italic;
1186
+ margin-top: 10px;
1187
+ opacity: 0.8;
1188
+ }
1189
+
1190
+ .ranker-header {
1191
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1192
+ color: white;
1193
+ padding: 20px;
1194
+ border-radius: 15px;
1195
+ margin-bottom: 25px;
1196
+ box-shadow: 0 4px 15px rgba(0,0,0,0.1);
1197
+ }
1198
+
1199
+ .ranker-title h3 {
1200
+ margin: 0 0 10px 0;
1201
+ font-size: 1.4em;
1202
+ font-weight: 600;
1203
+ }
1204
+
1205
+ .query-display {
1206
+ font-size: 1.1em;
1207
+ opacity: 0.9;
1208
+ font-style: italic;
1209
+ margin-bottom: 15px;
1210
+ }
1211
+
1212
+ .ranker-meta {
1213
+ display: flex;
1214
+ gap: 15px;
1215
+ align-items: center;
1216
+ flex-wrap: wrap;
1217
+ }
1218
+
1219
+ .method-badge {
1220
+ background: rgba(255, 255, 255, 0.2);
1221
+ padding: 6px 12px;
1222
+ border-radius: 20px;
1223
+ font-weight: 500;
1224
+ font-size: 0.9em;
1225
+ }
1226
+
1227
+ .result-count {
1228
+ background: rgba(255, 255, 255, 0.15);
1229
+ padding: 6px 12px;
1230
+ border-radius: 20px;
1231
+ font-weight: 500;
1232
+ font-size: 0.9em;
1233
+ }
1234
+
1235
+ .result-card {
1236
+ background: #ffffff;
1237
+ border: 1px solid #e1e5e9;
1238
+ border-radius: 12px;
1239
+ margin-bottom: 20px;
1240
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
1241
+ transition: all 0.3s ease;
1242
+ overflow: hidden;
1243
+ }
1244
+
1245
+ .result-card:hover {
1246
+ box-shadow: 0 4px 20px rgba(0,0,0,0.1);
1247
+ transform: translateY(-2px);
1248
+ }
1249
+
1250
+ .result-header {
1251
+ display: flex;
1252
+ justify-content: space-between;
1253
+ align-items: center;
1254
+ padding: 15px 20px;
1255
+ background: #f8f9fa;
1256
+ border-bottom: 1px solid #e9ecef;
1257
+ }
1258
+
1259
+ .rank-info {
1260
+ display: flex;
1261
+ gap: 10px;
1262
+ align-items: center;
1263
+ flex-wrap: wrap;
1264
+ }
1265
+
1266
+ .rank-badge {
1267
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1268
+ color: white;
1269
+ padding: 4px 10px;
1270
+ border-radius: 15px;
1271
+ font-weight: 600;
1272
+ font-size: 0.85em;
1273
+ }
1274
+
1275
+ .source-info {
1276
+ background: #e9ecef;
1277
+ color: #495057;
1278
+ padding: 4px 8px;
1279
+ border-radius: 10px;
1280
+ font-size: 0.85em;
1281
+ font-weight: 500;
1282
+ }
1283
+
1284
+ .page-info {
1285
+ background: #d1ecf1;
1286
+ color: #0c5460;
1287
+ padding: 4px 8px;
1288
+ border-radius: 10px;
1289
+ font-size: 0.85em;
1290
+ }
1291
+
1292
+ .length-info {
1293
+ background: #f8f9fa;
1294
+ color: #6c757d;
1295
+ padding: 4px 8px;
1296
+ border-radius: 10px;
1297
+ font-size: 0.85em;
1298
+ }
1299
+
1300
+ .score-info {
1301
+ display: flex;
1302
+ gap: 10px;
1303
+ align-items: center;
1304
+ }
1305
+
1306
+ .confidence-badge {
1307
+ padding: 4px 8px;
1308
+ border-radius: 10px;
1309
+ font-weight: 600;
1310
+ font-size: 0.85em;
1311
+ }
1312
+
1313
+ .score-value {
1314
+ background: #2c3e50;
1315
+ color: white;
1316
+ padding: 6px 12px;
1317
+ border-radius: 15px;
1318
+ font-weight: 600;
1319
+ font-size: 0.9em;
1320
+ }
1321
+
1322
+ .result-content {
1323
+ padding: 20px;
1324
+ }
1325
+
1326
+ .content-text {
1327
+ line-height: 1.6;
1328
+ color: #2c3e50;
1329
+ border-left: 3px solid #667eea;
1330
+ padding-left: 15px;
1331
+ background: #f8f9fa;
1332
+ padding: 15px;
1333
+ border-radius: 0 8px 8px 0;
1334
+ max-height: 300px;
1335
+ overflow-y: auto;
1336
+ }
1337
+
1338
+ .result-actions {
1339
+ display: flex;
1340
+ gap: 10px;
1341
+ padding: 15px 20px;
1342
+ background: #f8f9fa;
1343
+ border-top: 1px solid #e9ecef;
1344
+ }
1345
+
1346
+ .action-btn {
1347
+ padding: 8px 16px;
1348
+ border: none;
1349
+ border-radius: 8px;
1350
+ font-weight: 500;
1351
+ cursor: pointer;
1352
+ transition: all 0.3s ease;
1353
+ font-size: 0.9em;
1354
+ display: flex;
1355
+ align-items: center;
1356
+ gap: 5px;
1357
+ }
1358
+
1359
+ .copy-btn {
1360
+ background: #17a2b8;
1361
+ color: white;
1362
+ }
1363
+
1364
+ .copy-btn:hover {
1365
+ background: #138496;
1366
+ transform: translateY(-1px);
1367
+ }
1368
+
1369
+ .info-btn {
1370
+ background: #6c757d;
1371
+ color: white;
1372
+ }
1373
+
1374
+ .info-btn:hover {
1375
+ background: #5a6268;
1376
+ transform: translateY(-1px);
1377
+ }
1378
+
1379
+ .ranker-methods {
1380
+ margin-top: 20px;
1381
+ padding-top: 15px;
1382
+ border-top: 1px solid #e9ecef;
1383
+ }
1384
+
1385
+ .methods-label {
1386
+ font-weight: 600;
1387
+ color: #495057;
1388
+ margin-bottom: 10px;
1389
+ font-size: 0.9em;
1390
+ }
1391
+
1392
+ .methods-list {
1393
+ display: flex;
1394
+ gap: 8px;
1395
+ flex-wrap: wrap;
1396
+ }
1397
+
1398
+ .method-tag {
1399
+ background: #e9ecef;
1400
+ color: #495057;
1401
+ padding: 4px 10px;
1402
+ border-radius: 12px;
1403
+ font-size: 0.8em;
1404
+ font-weight: 500;
1405
+ }
1406
+
1407
+ /* Ranker controls styling */
1408
+ .ranker-controls {
1409
+ background: #ffffff;
1410
+ border: 1px solid #e1e5e9;
1411
+ border-radius: 12px;
1412
+ padding: 20px;
1413
+ margin-bottom: 25px;
1414
+ box-shadow: 0 2px 10px rgba(0,0,0,0.05);
1415
+ }
1416
+
1417
+ .ranker-input-row {
1418
+ display: flex;
1419
+ gap: 15px;
1420
+ align-items: end;
1421
+ margin-bottom: 15px;
1422
+ }
1423
+
1424
+ .ranker-query-input {
1425
+ flex: 1;
1426
+ border: 2px solid #e1e5e9;
1427
+ border-radius: 25px;
1428
+ padding: 12px 20px;
1429
+ font-size: 1em;
1430
+ transition: all 0.3s ease;
1431
+ }
1432
+
1433
+ .ranker-query-input:focus {
1434
+ border-color: #667eea;
1435
+ box-shadow: 0 0 0 3px rgba(102, 126, 234, 0.1);
1436
+ outline: none;
1437
+ }
1438
+
1439
+ .ranker-search-btn {
1440
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
1441
+ color: white;
1442
+ border: none;
1443
+ border-radius: 12px;
1444
+ padding: 12px 24px;
1445
+ min-width: 100px;
1446
+ cursor: pointer;
1447
+ transition: all 0.3s ease;
1448
+ font-weight: 600;
1449
+ font-size: 1em;
1450
+ }
1451
+
1452
+ .ranker-search-btn:hover {
1453
+ transform: scale(1.05);
1454
+ box-shadow: 0 4px 15px rgba(102, 126, 234, 0.3);
1455
+ }
1456
+
1457
+ .ranker-options-row {
1458
+ display: flex;
1459
+ gap: 15px;
1460
+ align-items: center;
1461
+ }
1462
+
1463
+ /* Responsive design for ranker */
1464
+ @media (max-width: 768px) {
1465
+ .ranker-container {
1466
+ padding: 10px;
1467
+ }
1468
+
1469
+ .ranker-input-row {
1470
+ flex-direction: column;
1471
+ gap: 10px;
1472
+ }
1473
+
1474
+ .ranker-options-row {
1475
+ flex-direction: column;
1476
+ gap: 10px;
1477
+ align-items: stretch;
1478
+ }
1479
+
1480
+ .ranker-meta {
1481
+ justify-content: center;
1482
+ }
1483
+
1484
+ .rank-info {
1485
+ flex-direction: column;
1486
+ gap: 5px;
1487
+ align-items: flex-start;
1488
+ }
1489
+
1490
+ .result-header {
1491
+ flex-direction: column;
1492
+ gap: 10px;
1493
+ align-items: flex-start;
1494
+ }
1495
+
1496
+ .score-info {
1497
+ align-self: flex-end;
1498
+ }
1499
+
1500
+ .result-actions {
1501
+ flex-direction: column;
1502
+ gap: 8px;
1503
+ }
1504
+ }
1505
  """) as demo:
1506
  # Modern title with better styling
1507
  gr.Markdown("""
 
1512
  with gr.Tabs():
1513
  # Document Converter Tab
1514
  with gr.TabItem("πŸ“„ Document Converter"):
1515
+ with gr.Column(elem_classes=["chat-tab-container"]):
1516
+ # Modern header matching other tabs
1517
+ gr.HTML("""
1518
+ <div class="chat-header">
1519
+ <h2>πŸ“„ Document Converter</h2>
1520
+ <p>Convert documents to Markdown format with advanced OCR and AI processing</p>
1521
+ </div>
1522
+ """)
1523
+
1524
+ # State to track if cancellation is requested
1525
+ cancel_requested = gr.State(False)
1526
+ # State to store the conversion thread
1527
+ conversion_thread = gr.State(None)
1528
+ # State to store the output format (fixed to Markdown)
1529
+ output_format_state = gr.State("Markdown")
1530
 
1531
+ # Multi-file input (supports single and multiple files)
1532
+ files_input = gr.Files(
1533
+ label="Upload Document(s) - Single file or up to 5 files (20MB max combined)",
1534
+ file_count="multiple",
1535
+ file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls", ".txt", ".md", ".html", ".htm"]
1536
+ )
1537
+
1538
+ # Processing type selector (visible only for multiple files)
1539
+ processing_type_selector = gr.Radio(
1540
+ choices=["combined", "individual", "summary", "comparison"],
1541
+ value="combined",
1542
+ label="Multi-Document Processing Type",
1543
+ info="How to process multiple documents together",
1544
+ visible=False
1545
+ )
1546
+
1547
+ # Status text to show file count and processing mode
1548
+ file_status_text = gr.HTML(
1549
+ value="<div style='color: #666; font-style: italic;'>Upload documents to begin</div>",
1550
+ label=""
1551
+ )
1552
+
1553
+ # Provider and OCR options below the file input
1554
+ with gr.Row(elem_classes=["provider-options-row"]):
1555
+ with gr.Column(scale=1):
1556
+ parser_names = ParserRegistry.get_parser_names()
1557
+
1558
+ # Make MarkItDown the default parser if available
1559
+ default_parser = next((p for p in parser_names if p == "MarkItDown"), parser_names[0] if parser_names else "PyPdfium")
1560
+
1561
+ provider_dropdown = gr.Dropdown(
1562
+ label="Provider",
1563
+ choices=parser_names,
1564
+ value=default_parser,
1565
+ interactive=True
1566
+ )
1567
+ with gr.Column(scale=1):
1568
+ default_ocr_options = ParserRegistry.get_ocr_options(default_parser)
1569
+ default_ocr = default_ocr_options[0] if default_ocr_options else "No OCR"
1570
+
1571
+ ocr_dropdown = gr.Dropdown(
1572
+ label="OCR Options",
1573
+ choices=default_ocr_options,
1574
+ value=default_ocr,
1575
+ interactive=True
1576
+ )
1577
+
1578
+ # Processing controls row with consistent styling
1579
+ with gr.Row(elem_classes=["control-buttons"]):
1580
+ convert_button = gr.Button("πŸš€ Convert", elem_classes=["control-btn", "btn-primary"])
1581
+ cancel_button = gr.Button("⏹️ Cancel", elem_classes=["control-btn", "btn-clear-data"], visible=False)
1582
+
1583
+ # Simple output container with just one scrollbar
1584
+ file_display = gr.HTML(
1585
+ value="<div class='output-container'></div>",
1586
+ label="Converted Content"
1587
+ )
1588
+
1589
+ file_download = gr.File(label="Download File")
1590
 
1591
  # Event handlers for document converter
1592
 
 
1742
  outputs=[chatbot, session_info, status_display]
1743
  )
1744
 
1745
+ # Query Ranker Tab
1746
+ with gr.TabItem("πŸ” Query Ranker"):
1747
+ with gr.Column(elem_classes=["ranker-container"]):
1748
+ # Modern header
1749
+ gr.HTML("""
1750
+ <div class="chat-header">
1751
+ <h2>πŸ” Query Ranker</h2>
1752
+ <p>Search and rank document chunks with similarity scores</p>
1753
+ </div>
1754
+ """)
1755
+
1756
+ # Status section
1757
+ ranker_status_display = gr.HTML(value=get_ranker_status())
1758
+
1759
+ # Control buttons
1760
+ with gr.Row(elem_classes=["control-buttons"]):
1761
+ refresh_ranker_status_btn = gr.Button("πŸ”„ Refresh Status", elem_classes=["control-btn", "btn-refresh"])
1762
+ clear_results_btn = gr.Button("πŸ—‘οΈ Clear Results", elem_classes=["control-btn", "btn-clear-data"])
1763
+
1764
+ # Search controls
1765
+ with gr.Column(elem_classes=["ranker-controls"]):
1766
+ with gr.Row(elem_classes=["ranker-input-row"]):
1767
+ query_input = gr.Textbox(
1768
+ placeholder="Enter your search query...",
1769
+ show_label=False,
1770
+ elem_classes=["ranker-query-input"],
1771
+ scale=4
1772
+ )
1773
+ search_btn = gr.Button("πŸ” Search", elem_classes=["ranker-search-btn"], scale=0)
1774
+
1775
+ with gr.Row(elem_classes=["ranker-options-row"]):
1776
+ method_dropdown = gr.Dropdown(
1777
+ choices=[
1778
+ ("🎯 Similarity Search", "similarity"),
1779
+ ("πŸ”€ MMR (Diverse)", "mmr"),
1780
+ ("πŸ” BM25 (Keywords)", "bm25"),
1781
+ ("πŸ”— Hybrid (Recommended)", "hybrid")
1782
+ ],
1783
+ value="hybrid",
1784
+ label="Retrieval Method",
1785
+ scale=2
1786
+ )
1787
+ k_slider = gr.Slider(
1788
+ minimum=1,
1789
+ maximum=10,
1790
+ value=5,
1791
+ step=1,
1792
+ label="Number of Results",
1793
+ scale=1
1794
+ )
1795
+
1796
+ # Results display
1797
+ results_display = gr.HTML(
1798
+ value=handle_query_search("", "hybrid", 5), # Initial placeholder
1799
+ elem_classes=["ranker-results-container"]
1800
+ )
1801
+
1802
+ # Event handlers for Query Ranker
1803
+ def clear_ranker_results():
1804
+ """Clear the search results and reset to placeholder."""
1805
+ return handle_query_search("", "hybrid", 5), ""
1806
+
1807
+ def refresh_ranker_status():
1808
+ """Refresh the ranker status display."""
1809
+ return get_ranker_status()
1810
+
1811
+ # Search functionality
1812
+ query_input.submit(
1813
+ fn=handle_query_search,
1814
+ inputs=[query_input, method_dropdown, k_slider],
1815
+ outputs=[results_display]
1816
+ )
1817
+
1818
+ search_btn.click(
1819
+ fn=handle_query_search,
1820
+ inputs=[query_input, method_dropdown, k_slider],
1821
+ outputs=[results_display]
1822
+ )
1823
+
1824
+ # Control button handlers
1825
+ refresh_ranker_status_btn.click(
1826
+ fn=refresh_ranker_status,
1827
+ inputs=[],
1828
+ outputs=[ranker_status_display]
1829
+ )
1830
+
1831
+ clear_results_btn.click(
1832
+ fn=clear_ranker_results,
1833
+ inputs=[],
1834
+ outputs=[results_display, query_input]
1835
+ )
1836
+
1837
+ # Update results when method or k changes
1838
+ method_dropdown.change(
1839
+ fn=handle_query_search,
1840
+ inputs=[query_input, method_dropdown, k_slider],
1841
+ outputs=[results_display]
1842
+ )
1843
+
1844
+ k_slider.change(
1845
+ fn=handle_query_search,
1846
+ inputs=[query_input, method_dropdown, k_slider],
1847
+ outputs=[results_display]
1848
+ )
1849
+
1850
  return demo
1851
 
1852