Spaces:

Ansemin101
/

Markit_v2

Running

AnseMin commited on Jun 24

Commit

a4f1c9e

1 Parent(s): 575f1c7

Update .gitignore and enhance README with data management instructions

- Added log files to .gitignore to prevent unnecessary tracking.
- Expanded README to include new commands for clearing data and running the application with fresh data.
- Clarified what data gets cleared during testing, improving usability for developers.

Files changed (8) hide show

.gitignore +3 -0
README.md +33 -3
run_app.py +88 -10
src/core/config.py +1 -1
src/parsers/gemini_flash_parser.py +2 -2
src/parsers/mistral_ocr_parser.py +4 -0
src/rag/chat_service.py +11 -10
src/ui/ui.py +31 -9

.gitignore CHANGED Viewed

@@ -100,3 +100,6 @@ app_backup.py
 # Ignore data folder
 /data/

 # Ignore data folder
 /data/
+# Ignore logs
+==*

README.md CHANGED Viewed

@@ -159,8 +159,38 @@ The application uses centralized configuration management. You can enhance funct
    # For local development (faster startup)
    python run_app.py
    ```
 ### 🧪 **Development Features:**
 - **Automatic Environment Setup**: Dependencies are checked and installed automatically
 - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
@@ -175,13 +205,13 @@ The application uses centralized configuration management. You can enhance funct
 # Markit: Document to Markdown Converter
-[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Ansemin101/Markit)
 **Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
 ## Project Links
-- **GitHub Repository**: [github.com/ansemin/Markit_HF](https://github.com/ansemin/Markit_HF)
-- **Hugging Face Space**: [huggingface.co/spaces/Ansemin101/Markit](https://huggingface.co/spaces/Ansemin101/Markit)
 ## Overview
 Markit is a powerful tool that converts various document formats (PDF, DOCX, images, etc.) to Markdown format. It uses different parsing engines and OCR methods to extract text from documents and convert them to clean, readable Markdown formats.

    # For local development (faster startup)
    python run_app.py
+   # For testing with clean data (clears chat history and vector store)
+   python run_app.py --clear-data-and-run
+   # To only clear data without running the app
+   python run_app.py --clear-data
    ```
+### 🧹 **Data Management for Testing:**
+For local development and testing, you can easily clear all stored data:
+```bash
+# Clear all data and exit (useful for quick cleanup)
+python run_app.py --clear-data
+# Clear all data then run the app (useful for fresh testing)
+python run_app.py --clear-data-and-run
+# Show all available options
+python run_app.py --help
+```
+**What gets cleared:**
+- `data/chat_history/*` - All saved chat sessions
+- `data/vector_store/*` - All document embeddings and vector database
+This is particularly useful when:
+- Testing new RAG features with fresh data
+- Clearing old chat sessions and documents
+- Resetting the system to a clean state
+- Debugging document ingestion issues
 ### 🧪 **Development Features:**
 - **Automatic Environment Setup**: Dependencies are checked and installed automatically
 - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
 # Markit: Document to Markdown Converter
+[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Ansemin101/Markit_v2)
 **Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
 ## Project Links
+- **GitHub Repository**: [github.com/ansemin/Markit_v2](https://github.com/ansemin/Markit_v2)
+- **Hugging Face Space**: [huggingface.co/spaces/Ansemin101/Markit_v2](https://huggingface.co/spaces/Ansemin101/Markit_v2)
 ## Overview
 Markit is a powerful tool that converts various document formats (PDF, DOCX, images, etc.) to Markdown format. It uses different parsing engines and OCR methods to extract text from documents and convert them to clean, readable Markdown formats.

run_app.py CHANGED Viewed

@@ -5,21 +5,99 @@ Use this for local development when dependencies are already installed.
 """
 import sys
 import os
 # Get the current directory and setup Python path
 current_dir = os.path.dirname(os.path.abspath(__file__))
 sys.path.append(current_dir)
-# Load environment variables from .env file
-try:
-    from dotenv import load_dotenv
-    load_dotenv()
-    print("Loaded environment variables from .env file")
-except ImportError:
-    print("python-dotenv not installed, skipping .env file loading")
-# Import and run main directly
-from src.main import main
 if __name__ == "__main__":
-    main()

 """
 import sys
 import os
+import argparse
+import shutil
+from pathlib import Path
 # Get the current directory and setup Python path
 current_dir = os.path.dirname(os.path.abspath(__file__))
 sys.path.append(current_dir)
+def clear_data_directories():
+    """Clear all data directories (chat_history and vector_store)."""
+    data_dir = Path(current_dir) / "data"
+    directories_to_clear = [
+        data_dir / "chat_history",
+        data_dir / "vector_store"
+    ]
+    cleared_count = 0
+    for directory in directories_to_clear:
+        if directory.exists():
+            try:
+                # Remove all contents of the directory
+                for item in directory.iterdir():
+                    if item.is_file():
+                        item.unlink()
+                        print(f"🗑️  Removed file: {item}")
+                    elif item.is_dir():
+                        shutil.rmtree(item)
+                        print(f"🗑️  Removed directory: {item}")
+                cleared_count += len(list(directory.glob("*")))
+                print(f"✅ Cleared directory: {directory}")
+            except Exception as e:
+                print(f"❌ Error clearing {directory}: {e}")
+        else:
+            print(f"ℹ️  Directory doesn't exist: {directory}")
+    if cleared_count == 0:
+        print("ℹ️  No data found to clear.")
+    else:
+        print(f"🎉 Successfully cleared {cleared_count} items from data directories!")
+def main_with_args():
+    """Main function with command line argument parsing."""
+    parser = argparse.ArgumentParser(
+        description="Markit v2 - Document to Markdown Converter with RAG Chat",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python run_app.py                    # Run the app normally
+  python run_app.py --clear-data       # Clear all data and exit
+  python run_app.py --clear-data-and-run  # Clear data then run the app
+        """
+    )
+    parser.add_argument(
+        "--clear-data",
+        action="store_true",
+        help="Clear all data directories (chat_history, vector_store) and exit"
+    )
+    parser.add_argument(
+        "--clear-data-and-run",
+        action="store_true",
+        help="Clear all data directories then run the app"
+    )
+    args = parser.parse_args()
+    # Handle data clearing options
+    if args.clear_data or args.clear_data_and_run:
+        print("🧹 Clearing data directories...")
+        print("=" * 50)
+        clear_data_directories()
+        print("=" * 50)
+        if args.clear_data:
+            print("✅ Data clearing completed. Exiting.")
+            return
+        elif args.clear_data_and_run:
+            print("✅ Data clearing completed. Starting app...")
+            print()
+    # Load environment variables from .env file
+    try:
+        from dotenv import load_dotenv
+        load_dotenv()
+        print("Loaded environment variables from .env file")
+    except ImportError:
+        print("python-dotenv not installed, skipping .env file loading")
+    # Import and run main directly
+    from src.main import main
+    main()
 if __name__ == "__main__":
+    main_with_args()

src/core/config.py CHANGED Viewed

@@ -104,7 +104,7 @@ class RAGConfig:
     # LLM settings for RAG
     rag_model: str = "gemini-2.5-flash"
     rag_temperature: float = 0.1
-    rag_max_tokens: int = 4096
     def __post_init__(self):
         """Load RAG configuration from environment variables."""

     # LLM settings for RAG
     rag_model: str = "gemini-2.5-flash"
     rag_temperature: float = 0.1
+    rag_max_tokens: int = 32768
     def __post_init__(self):
         """Load RAG configuration from environment variables."""

src/parsers/gemini_flash_parser.py CHANGED Viewed

@@ -93,10 +93,10 @@ class GeminiFlashParser(DocumentParser):
                     )
                 ],
                 config={
-                    "temperature": 0.1,
                     "top_p": 0.95,
                     "top_k": 40,
-                    "max_output_tokens": 8192,
                 }
             )

                     )
                 ],
                 config={
+                    "temperature": config.model.temperature,
                     "top_p": 0.95,
                     "top_k": 40,
+                    "max_output_tokens": config.model.max_tokens,
                 }
             )

src/parsers/mistral_ocr_parser.py CHANGED Viewed

@@ -256,6 +256,8 @@ class MistralOcrParser(DocumentParser):
                     # Send to chat completion API with document understanding prompt
                     chat_response = client.chat.complete(
                         model="mistral-large-latest",
                         messages=[
                             {
                                 "role": "user",
@@ -290,6 +292,8 @@ class MistralOcrParser(DocumentParser):
                 # Use the chat API with the image for document understanding
                 chat_response = client.chat.complete(
                     model="mistral-large-latest",
                     messages=[
                         {
                             "role": "user",

                     # Send to chat completion API with document understanding prompt
                     chat_response = client.chat.complete(
                         model="mistral-large-latest",
+                        max_tokens=config.model.max_tokens,
+                        temperature=config.model.temperature,
                         messages=[
                             {
                                 "role": "user",
                 # Use the chat API with the image for document understanding
                 chat_response = client.chat.complete(
                     model="mistral-large-latest",
+                    max_tokens=config.model.max_tokens,
+                    temperature=config.model.temperature,
                     messages=[
                         {
                             "role": "user",

src/rag/chat_service.py CHANGED Viewed

@@ -119,8 +119,8 @@ class RAGChatService:
                 self._llm = ChatGoogleGenerativeAI(
                     model="gemini-2.5-flash",  # Latest Gemini model
                     google_api_key=google_api_key,
-                    temperature=0.1,
-                    max_tokens=4096,
                     disable_streaming=False  # Enable streaming (new parameter name)
                 )
@@ -144,15 +144,16 @@ class RAGChatService:
                 # Create a prompt template for RAG
                 prompt_template = ChatPromptTemplate.from_template("""
-You are a helpful assistant that answers questions based on the provided document context.
 Instructions:
-1. Use the context provided to answer the user's question
-2. If the information is not in the context, say "I don't have enough information in the provided documents to answer that question"
-3. Always cite which parts of the documents you used for your answer
-4. Be concise but comprehensive
-5. If you find relevant tables or code blocks, include them in your response
-6. Maintain a conversational tone
 Context from documents:
 {context}
@@ -160,7 +161,7 @@ Context from documents:
 Chat History:
 {chat_history}
-User Question: {question}
 """)
                 def format_docs(docs: List[Document]) -> str:

                 self._llm = ChatGoogleGenerativeAI(
                     model="gemini-2.5-flash",  # Latest Gemini model
                     google_api_key=google_api_key,
+                    temperature=config.rag.rag_temperature,
+                    max_tokens=config.rag.rag_max_tokens,
                     disable_streaming=False  # Enable streaming (new parameter name)
                 )
                 # Create a prompt template for RAG
                 prompt_template = ChatPromptTemplate.from_template("""
+You are a helpful assistant that can chat naturally while specializing in answering questions about uploaded documents.
 Instructions:
+1. For document-related questions: Use the provided context to give comprehensive answers and always cite your sources
+2. For conversational interactions (greetings, introductions, clarifications, follow-ups): Respond naturally and helpfully
+3. For questions about topics not covered in the documents: Politely explain that you specialize in the uploaded documents but can still have a conversation
+4. When using document information, always cite which parts of the documents you referenced
+5. Include relevant tables and code blocks when they help answer the question
+6. Be conversational, friendly, and helpful
+7. Remember information shared in our conversation (like names, preferences, etc.)
 Context from documents:
 {context}
 Chat History:
 {chat_history}
+User Message: {question}
 """)
                 def format_docs(docs: List[Document]) -> str:

src/ui/ui.py CHANGED Viewed

@@ -191,7 +191,7 @@ def handle_convert(file_path, parser_name, ocr_method_name, output_format, is_ca
 def handle_chat_message(message, history):
     """Handle a new chat message with streaming response."""
     if not message or not message.strip():
-        return "", history
     try:
         # Add user message to history
@@ -207,10 +207,16 @@ def handle_chat_message(message, history):
             response_text += chunk
             # Update the last message in history with the current response
             history[-1]["content"] = response_text
-            yield "", history
         logger.info(f"Chat response completed for message: {message[:50]}...")
     except Exception as e:
         error_msg = f"Error generating response: {str(e)}"
         logger.error(error_msg)
@@ -221,7 +227,9 @@ def handle_chat_message(message, history):
                 {"role": "user", "content": message},
                 {"role": "assistant", "content": f"❌ {error_msg}"}
             ]
-        yield "", history
 def start_new_chat_session():
     """Start a new chat session."""
@@ -455,20 +463,33 @@ def create_ui():
             font-weight: 500;
             flex: 1;
             min-width: 200px;
         }
         .service-ready {
             background: #d4edda;
-            color: #155724;
             border: 1px solid #c3e6cb;
         }
         .service-error {
             background: #f8d7da;
-            color: #721c24;
             border: 1px solid #f5c6cb;
         }
         .service-icon {
             font-size: 1.2em;
         }
@@ -826,25 +847,26 @@ def create_ui():
                 msg_input.submit(
                     fn=handle_chat_message,
                     inputs=[msg_input, chatbot],
-                    outputs=[msg_input, chatbot]
                 )
                 send_btn.click(
                     fn=handle_chat_message,
                     inputs=[msg_input, chatbot],
-                    outputs=[msg_input, chatbot]
                 )
                 # New session handler with improved feedback
                 def enhanced_new_session():
                     history, info = start_new_chat_session()
                     session_html = f'<div class="session-info">{info}</div>'
-                    return history, session_html
                 new_session_btn.click(
                     fn=enhanced_new_session,
                     inputs=[],
-                    outputs=[chatbot, session_info]
                 )
                 # Refresh status handler

 def handle_chat_message(message, history):
     """Handle a new chat message with streaming response."""
     if not message or not message.strip():
+        return "", history, gr.update()
     try:
         # Add user message to history
             response_text += chunk
             # Update the last message in history with the current response
             history[-1]["content"] = response_text
+            # Update status in real-time during streaming
+            updated_status = get_chat_status()
+            yield "", history, updated_status
         logger.info(f"Chat response completed for message: {message[:50]}...")
+        # Final status update after message completion
+        final_status = get_chat_status()
+        yield "", history, final_status
     except Exception as e:
         error_msg = f"Error generating response: {str(e)}"
         logger.error(error_msg)
                 {"role": "user", "content": message},
                 {"role": "assistant", "content": f"❌ {error_msg}"}
             ]
+        # Update status even on error
+        error_status = get_chat_status()
+        yield "", history, error_status
 def start_new_chat_session():
     """Start a new chat session."""
             font-weight: 500;
             flex: 1;
             min-width: 200px;
+            color: #2c3e50 !important;
+        }
+        .service-status span {
+            color: #2c3e50 !important;
         }
         .service-ready {
             background: #d4edda;
+            color: #2c3e50 !important;
             border: 1px solid #c3e6cb;
         }
+        .service-ready span {
+            color: #2c3e50 !important;
+        }
         .service-error {
             background: #f8d7da;
+            color: #2c3e50 !important;
             border: 1px solid #f5c6cb;
         }
+        .service-error span {
+            color: #2c3e50 !important;
+        }
         .service-icon {
             font-size: 1.2em;
         }
                 msg_input.submit(
                     fn=handle_chat_message,
                     inputs=[msg_input, chatbot],
+                    outputs=[msg_input, chatbot, status_display]
                 )
                 send_btn.click(
                     fn=handle_chat_message,
                     inputs=[msg_input, chatbot],
+                    outputs=[msg_input, chatbot, status_display]
                 )
                 # New session handler with improved feedback
                 def enhanced_new_session():
                     history, info = start_new_chat_session()
                     session_html = f'<div class="session-info">{info}</div>'
+                    updated_status = get_chat_status()
+                    return history, session_html, updated_status
                 new_session_btn.click(
                     fn=enhanced_new_session,
                     inputs=[],
+                    outputs=[chatbot, session_info, status_display]
                 )
                 # Refresh status handler