Spaces:

CultriX
/

RAG-Scraper

Running

App Files Files Community

CultriX commited on May 29

Commit

2d6afaa

1 Parent(s): 726d91f

Deploy RAG-Scraper application to HuggingFace Space

Browse files

Files changed (3) hide show

Dockerfile +37 -0
README.md +58 -88
app.py +245 -143

Dockerfile ADDED Viewed

	@@ -0,0 +1,37 @@

+# Use an official Python runtime as a parent image
+FROM python:3.10-slim
+# Set the working directory in the container
+WORKDIR /app
+# Install system dependencies for Node.js installation
+RUN apt-get update && apt-get install -y \
+    curl \
+    gnupg \
+    && rm -rf /var/lib/apt/lists/*
+# Add Node.js LTS repository and install Node.js and npm
+RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
+    && apt-get install -y nodejs
+# Install repomix globally using npm
+RUN npm install -g repomix
+# Copy the requirements file into the container
+COPY requirements.txt .
+# Install any needed packages specified in requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application code into the container
+COPY . .
+# Make port 7860 available to the world outside this container
+EXPOSE 7860
+# Define environment variable for Gradio server
+ENV GRADIO_SERVER_NAME="0.0.0.0"
+ENV GRADIO_SERVER_PORT="7860"
+# Run app.py when the container launches
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,115 +1,85 @@
 ---
-title: RAG-Scraper
-emoji: 🥳
 colorFrom: blue
-colorTo: gray
-sdk: gradio
-sdk_version: 5.29.1
 app_file: app.py
 pinned: false
-license: creativeml-openrail-m
-short_description: Scrape webpages for RAG purposes
 ---
-# RAG-Scraper
-RAG-Scraper is a Python tool designed for efficient and intelligent scraping of web documentation and content. It's tailored for Retrieval-Augmented Generation systems, extracting and preprocessing text into structured, machine-learning-ready formats.
 ## Features
-- **Web Scraping**: Scrape web content and convert it to Markdown format
-- **Recursive Depth**: Control how deep the scraper should follow links
-- **GitHub Repository Support**: Process GitHub repositories using Repomix to create AI-friendly outputs (when run locally)
-- **Gradio Interface**: Easy-to-use web interface for all functionality
-- **HuggingFace Spaces Compatible**: Can be deployed as a HuggingFace Space (with limited functionality)
-## Requirements
 - Python 3.10+
-- Node.js (for Repomix GitHub repository processing)
-- Repomix (installed via npm or used with npx)
-## Installation
-1. Clone the repository:
-```bash
-git clone https://github.com/yourusername/RAG-Scraper.git
-cd RAG-Scraper
-```
-2. Install Python dependencies:
-```bash
-pip install -r requirements.txt
-```
-3. For GitHub repository processing, ensure Node.js is installed and either:
-   - Install Repomix globally: `npm install -g repomix`
-   - Or use npx to run it without installation (the app supports this)
-## Usage
-### Running the Gradio Interface
-```bash
-python app.py
-```
-This will start the Gradio web interface, accessible at http://localhost:7860 by default.
-### Using the Interface
-1. **Enter a URL or GitHub Repository**:
-   - For websites: Enter a complete URL (e.g., `https://example.com`)
-   - For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand notation (e.g., `username/repo`)
-2. **Set Search Depth** (for websites only):
-   - 0: Only scrape the main page
-   - 1-3: Follow links recursively to the specified depth
-3. **Select Input Type**:
-   - Auto: Automatically detect if the input is a website or GitHub repository
-   - Website: Force processing as a website
-   - GitHub: Force processing as a GitHub repository
-4. **Click Submit** to process the input and view the results
 ## How It Works
-### Web Scraping
-For websites, RAG-Scraper:
-1. Fetches the HTML content from the URL
-2. Converts the HTML to Markdown
-3. If depth > 0, extracts internal links and repeats the process for each link
 ### GitHub Repository Processing
-For GitHub repositories, RAG-Scraper:
-1. Detects if the input is a GitHub repository URL or ID
-2. Uses Repomix to fetch and process the repository
-3. Returns the repository content in a structured, AI-friendly format
-## Examples
-The interface includes example inputs to demonstrate both web scraping and GitHub repository processing:
-- `https://example.com` - Basic website example
-- `yamadashy/repomix` - GitHub repository using shorthand notation
-- `https://github.com/yamadashy/repomix` - GitHub repository using full URL
-## HuggingFace Spaces Deployment
-This application can be deployed as a HuggingFace Space, but with some limitations:
-- **Web Scraping**: Fully functional for scraping websites and converting to Markdown
-- **GitHub Repository Processing**: Not available on HuggingFace Spaces due to the lack of Node.js and npm/npx command execution capabilities
-- **User Experience**: The interface will provide clear messages about feature availability
-When deployed on HuggingFace Spaces, the application will automatically detect the environment and provide appropriate messages to users attempting to use the GitHub repository processing feature.
-To use the full functionality including GitHub repository processing with Repomix, run the application locally following the installation instructions above.
 ## License
 This project is licensed under the MIT License.

 ---
+title: RAG-Ready Content Scraper
+emoji: 🚀
 colorFrom: blue
+colorTo: green
+sdk: docker
 app_file: app.py
 pinned: false
+license: MIT
+short_description: Scrape webpages or GitHub repos to generate RAG-ready datasets.
 ---
+# RAG-Ready Content Scraper
+RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.
+This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.
 ## Features
+- **Dual Scraping Modes**:
+    - **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
+    - **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs.
+- **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV.
+- **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
+- **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
+- **Pre-configured Examples**: Includes example inputs for quick testing.
+- **In-UI Documentation**: "How it Works" section provides guidance.
+## Requirements for Local Development (Optional)
 - Python 3.10+
+- Node.js and npm (for Repomix GitHub repository processing)
+- Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted)
+- Project dependencies: `pip install -r requirements.txt`
+## HuggingFace Space Deployment
+This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**.
+1.  **Create a new HuggingFace Space.**
+2.  Choose **"Docker"** as the Space SDK.
+3.  Select **"Use an existing Dockerfile"**.
+4.  Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository.
+5.  The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.
+## Using the Interface
+1.  **Enter URL or GitHub Repository ID**:
+    *   For websites: Enter a complete URL (e.g., `https://example.com`).
+    *   For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
+2.  **Select Source Type**:
+    *   Choose "Webpage" or "GitHub Repository".
+3.  **Set Scraping Depth** (for Webpages only):
+    *   0: Only scrape the main page.
+    *   1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
+4.  **Select Output Format**:
+    *   Choose "Markdown", "JSON", or "CSV".
+5.  **Click "Process Content"**.
+6.  **View Status and Preview**: Monitor progress and see a preview of the extracted content.
+7.  **Download File**: Download the generated dataset in your chosen format.
 ## How It Works
+### Webpage Scraping
+1.  Fetches HTML content from the provided URL.
+2.  Converts HTML to clean Markdown.
+3.  If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
+4.  Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).
 ### GitHub Repository Processing
+1.  Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository.
+2.  RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
+3.  This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).
+## Source Code
+The source code for this project is available on HuggingFace Spaces:
+[https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper)
 ## License
 This project is licensed under the MIT License.

app.py CHANGED Viewed

@@ -3,6 +3,8 @@ import subprocess
 import os
 import re
 import tempfile
 from rag_scraper.scraper import Scraper
 from rag_scraper.converter import Converter
 from rag_scraper.link_extractor import LinkExtractor, LinkType
@@ -10,195 +12,295 @@ from rag_scraper.utils import URLUtils
 def is_github_repo(url_or_id):
     """Check if the input is a GitHub repository URL or ID."""
-    # Check for GitHub URL
     if "github.com" in url_or_id:
         return True
-    # Check for shorthand notation (username/repo)
     if re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', url_or_id):
         return True
     return False
-def extract_repo_info(url_or_id):
-    """Extract repository owner and name from URL or ID."""
-    # Handle GitHub URLs
-    github_url_pattern = r'github\.com/([a-zA-Z0-9_.-]+)/([a-zA-Z0-9_.-]+)'
-    match = re.search(github_url_pattern, url_or_id)
-    if match:
-        return match.group(1), match.group(2)
-    # Handle shorthand notation (username/repo)
-    if '/' in url_or_id and not url_or_id.startswith('http'):
-        parts = url_or_id.split('/')
-        if len(parts) == 2:
-            return parts[0], parts[1]
-    return None, None
-def is_running_on_huggingface():
-    """Check if the app is running on HuggingFace Spaces."""
-    return os.environ.get('SPACE_ID') is not None
 def check_repomix_installed():
     """Check if Repomix is installed."""
-    # If running on HuggingFace Spaces, Repomix is likely not available
-    if is_running_on_huggingface():
-        return False
     try:
-        result = subprocess.run(["npx", "repomix", "--version"],
                                capture_output=True, text=True, check=False)
         return result.returncode == 0
     except Exception:
         return False
-def run_repomix(repo_url_or_id, output_format="markdown"):
     """Run Repomix on the GitHub repository and return the content."""
     try:
-        # Create a temporary directory for the output
         with tempfile.TemporaryDirectory() as temp_dir:
-            output_file = os.path.join(temp_dir, f"repomix-output.{output_format}")
-            # Prepare the command
             if '/' in repo_url_or_id and not repo_url_or_id.startswith('http'):
-                # Handle shorthand notation
                 repo_url = f"https://github.com/{repo_url_or_id}"
             else:
                 repo_url = repo_url_or_id
-            # Run Repomix
             cmd = [
-                "npx", "repomix",
                 "--remote", repo_url,
-                "--output", output_file,
-                "--style", output_format,
-                "--compress"  # Use compression for better token efficiency
             ]
             process = subprocess.run(cmd, capture_output=True, text=True, check=False)
             if process.returncode != 0:
-                return f"Error running Repomix: {process.stderr}"
-            # Read the output file
-            if os.path.exists(output_file):
-                with open(output_file, 'r', encoding='utf-8') as f:
-                    return f.read()
             else:
-                return f"Error: Repomix did not generate an output file."
     except Exception as e:
-        return f"Error processing GitHub repository: {str(e)}"
-def process_input(url_or_id, depth, input_type="auto"):
-    """Process the input based on its type."""
-    try:
-        # Determine if this is a GitHub repository
-        is_github = is_github_repo(url_or_id) if input_type == "auto" else (input_type == "github")
-        if is_github:
-            # Check if running on HuggingFace Spaces
-            if is_running_on_huggingface():
-                return (
-                    "GitHub repository processing with Repomix is not available on HuggingFace Spaces. "
-                    "This feature requires Node.js and the ability to run npm/npx commands, "
-                    "which are typically not available in the HuggingFace Spaces environment.\n\n"
-                    "You can still use the web scraping functionality for regular websites, "
-                    "or run this application locally to use the Repomix feature."
-                )
-            # Check if Repomix is installed
-            if not check_repomix_installed():
-                return (
-                    "Repomix is not installed or not accessible. "
-                    "Please install it using: npm install -g repomix\n"
-                    "Or you can run it without installation using: npx repomix"
-                )
-            # Process GitHub repository with Repomix
-            return run_repomix(url_or_id, output_format="markdown")
-        else:
-            # Process regular URL with web scraping
-            return scrape_and_convert(url_or_id, depth)
-    except Exception as e:
-        return f"Error: {str(e)}"
-def scrape_and_convert(url, depth):
-    """Fetch HTML content, extract links recursively (up to given depth), and convert to Markdown."""
     try:
-        visited_urls = set()
-        def recursive_scrape(url, current_depth):
-            """Recursively scrape and convert pages up to the given depth."""
-            if url in visited_urls or current_depth < 0:
-                return ""
-            visited_urls.add(url)
-            # Fetch HTML content
-            try:
-                html_content = Scraper.fetch_html(url)
-            except Exception as e:
-                return f"Error fetching {url}: {str(e)}\n"
-            # Convert to Markdown
-            markdown_content = f"## Extracted from: {url}\n\n"
-            markdown_content += Converter.html_to_markdown(
-                html=html_content,
-                base_url=url,
-                parser_features='html.parser',
-                ignore_links=True
             )
-            # If depth > 0, extract links and process them
-            if current_depth > 0:
-                links = LinkExtractor.scrape_url(url, link_type=LinkType.INTERNAL)
-                for link in links:
-                    if link not in visited_urls:
-                        markdown_content += f"\n\n### Extracted from: {link}\n"
-                        markdown_content += recursive_scrape(link, current_depth - 1)
-            return markdown_content
-        # Start the recursive scraping process
-        result = recursive_scrape(url, depth)
-        return result
-    except Exception as e:
-        return f"Error: {str(e)}"
-# Define Gradio interface
-iface = gr.Interface(
-    fn=process_input,
-    inputs=[
-        gr.Textbox(label="Enter URL or GitHub Repository",
-                  placeholder="https://example.com or username/repo"),
-        gr.Slider(minimum=0, maximum=3, step=1, value=0,
-                 label="Search Depth (0 = Only main page, ignored for GitHub repos)"),
-        gr.Radio(
-            choices=["auto", "website", "github"],
-            value="auto",
-            label="Input Type",
-            info="Auto will detect GitHub repos automatically"
         )
-    ],
-    outputs=gr.Code(label="Output", language="markdown"),
-    title="RAGScraper with GitHub Repository Support",
-    description=(
-        "Enter a URL to scrape a website, or a GitHub repository URL/ID (e.g., 'username/repo') "
-        "to use Repomix for repository processing. "
-        "For websites, you can specify the search depth for recursive scraping."
-    ),
-    examples=[
-        ["https://example.com", 0, "auto"],
-        ["yamadashy/repomix", 0, "auto"],
-        ["https://github.com/yamadashy/repomix", 0, "auto"]
-    ]
-)
-# Launch the Gradio app
 if __name__ == "__main__":
     iface.launch()

 import os
 import re
 import tempfile
+import json
+import csv
 from rag_scraper.scraper import Scraper
 from rag_scraper.converter import Converter
 from rag_scraper.link_extractor import LinkExtractor, LinkType
 def is_github_repo(url_or_id):
     """Check if the input is a GitHub repository URL or ID."""
     if "github.com" in url_or_id:
         return True
     if re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', url_or_id):
         return True
     return False
 def check_repomix_installed():
     """Check if Repomix is installed."""
     try:
+        result = subprocess.run(["repomix", "--version"],
                                capture_output=True, text=True, check=False)
         return result.returncode == 0
     except Exception:
         return False
+def run_repomix(repo_url_or_id, progress=gr.Progress(track_tqdm=True)):
     """Run Repomix on the GitHub repository and return the content."""
+    progress(0, desc="Starting Repomix processing...")
     try:
         with tempfile.TemporaryDirectory() as temp_dir:
+            # RepoMix typically outputs a zip file if not specifying a single output style,
+            # or a specific file if --style is used.
+            # For simplicity, let's assume we want markdown and it outputs to a known file or stdout.
+            # The current repomix command in the original script uses --style markdown and --output.
+            output_file_name = "repomix-output.md" # Assuming markdown output
+            output_file_path = os.path.join(temp_dir, output_file_name)
             if '/' in repo_url_or_id and not repo_url_or_id.startswith('http'):
                 repo_url = f"https://github.com/{repo_url_or_id}"
             else:
                 repo_url = repo_url_or_id
+            progress(0.2, desc=f"Running Repomix on {repo_url}...")
             cmd = [
+                "repomix",
                 "--remote", repo_url,
+                "--output", output_file_path, # Direct output to a file
+                "--style", "markdown", # Explicitly request markdown
+                "--compress"
             ]
             process = subprocess.run(cmd, capture_output=True, text=True, check=False)
+            progress(0.8, desc="Repomix command executed.")
             if process.returncode != 0:
+                return f"Error running Repomix: {process.stderr}", None
+            if os.path.exists(output_file_path):
+                with open(output_file_path, 'r', encoding='utf-8') as f:
+                    content = f.read()
+                progress(1, desc="Repomix output processed.")
+                return content, output_file_path # Return content and path for potential download
             else:
+                return "Error: Repomix did not generate an output file.", None
     except Exception as e:
+        progress(1, desc="Error during Repomix processing.")
+        return f"Error processing GitHub repository: {str(e)}", None
+def scrape_and_convert_website(url, depth, progress=gr.Progress(track_tqdm=True)):
+    """Fetch HTML, extract links, convert to Markdown."""
+    progress(0, desc=f"Starting web scrape for {url}...")
+    visited_urls = set()
+    all_markdown_content = ""
+    def recursive_scrape(current_url, current_depth, total_links_estimate=1, link_index=0):
+        if current_url in visited_urls or current_depth < 0:
+            return ""
+        visited_urls.add(current_url)
+        try:
+            progress_val = link_index / total_links_estimate if total_links_estimate > 0 else 0
+            progress(progress_val, desc=f"Scraping: {current_url} (Depth: {depth - current_depth})")
+            html_content = Scraper.fetch_html(current_url)
+        except Exception as e:
+            return f"Error fetching {current_url}: {str(e)}\n"
+        markdown_content = f"## Extracted from: {current_url}\n\n"
+        markdown_content += Converter.html_to_markdown(
+            html=html_content,
+            base_url=current_url,
+            parser_features='html.parser',
+            ignore_links=True
+        )
+        page_content = markdown_content + "\n\n"
+        if current_depth > 0:
+            try:
+                links = LinkExtractor.scrape_url(current_url, link_type=LinkType.INTERNAL)
+                # Filter out already visited links and external links more carefully
+                valid_links = [
+                    link for link in links
+                    if URLUtils.is_internal(link, current_url) and link not in visited_urls
+                ]
+                num_links = len(valid_links)
+                for i, link_url in enumerate(valid_links):
+                    page_content += recursive_scrape(link_url, current_depth - 1, num_links, i)
+            except Exception as e:
+                page_content += f"Error extracting links from {current_url}: {str(e)}\n"
+        return page_content
+    all_markdown_content = recursive_scrape(url, depth)
+    progress(1, desc="Web scraping complete.")
+    # For web scraping, we create a temporary file with the content for download
+    with tempfile.NamedTemporaryFile(mode="w+", delete=False, suffix=".md", encoding="utf-8") as tmp_file:
+        tmp_file.write(all_markdown_content)
+        return all_markdown_content, tmp_file.name
+# --- Data Conversion Functions (Stubs for now) ---
+def convert_to_json(markdown_content, source_url_or_id):
+    """Converts markdown content to a JSON string."""
+    # Basic implementation: create a JSON object with source and content
+    # More sophisticated parsing can be added later
+    data = {"source": source_url_or_id, "content": markdown_content}
+    return json.dumps(data, indent=2)
+def convert_to_csv(markdown_content, source_url_or_id):
+    """Converts markdown content to a CSV string."""
+    # Basic implementation: create a CSV with source and content
+    # This is a simplified CSV; real CSVs might need more structure
+    output = tempfile.NamedTemporaryFile(mode='w+', delete=False, newline='', suffix=".csv", encoding="utf-8")
+    writer = csv.writer(output)
+    writer.writerow(["source", "content"]) # Header
+    # Split content into manageable chunks or lines if necessary for CSV
+    # For now, putting all content in one cell.
+    writer.writerow([source_url_or_id, markdown_content])
+    output.close()
+    return output.name # Return path to the CSV file
+def save_output_to_file(content, output_format, source_url_or_id):
+    """Saves content to a temporary file based on format and returns its path."""
+    suffix = f".{output_format.lower()}"
+    if output_format == "JSON":
+        processed_content = convert_to_json(content, source_url_or_id)
+    elif output_format == "CSV":
+        # convert_to_csv now returns a path directly
+        return convert_to_csv(content, source_url_or_id)
+    else: # Markdown/Text
+        processed_content = content
+        suffix = ".md"
+    with tempfile.NamedTemporaryFile(mode="w+", delete=False, suffix=suffix, encoding="utf-8") as tmp_file:
+        tmp_file.write(processed_content)
+        return tmp_file.name
+# --- Main Processing Function ---
+def process_input_updated(url_or_id, source_type, depth, output_format_selection, progress=gr.Progress(track_tqdm=True)):
+    """Main function to process URL or GitHub repo based on selected type and format."""
+    progress(0, desc="Initializing...")
+    raw_content = ""
+    error_message = ""
+    output_file_path = None
+    if source_type == "GitHub Repository":
+        if not check_repomix_installed():
+            error_message = "Repomix is not installed or not accessible. Please ensure it's installed globally in your Docker environment."
+            return error_message, None, None # Text output, Preview, File output
+        raw_content, _ = run_repomix(url_or_id, progress=progress) # Repomix returns content and its original path
+        if "Error" in raw_content: # Simple error check
+            error_message = raw_content
+            raw_content = ""
+    elif source_type == "Webpage":
+        raw_content, _ = scrape_and_convert_website(url_or_id, depth, progress=progress)
+        if "Error" in raw_content: # Simple error check
+            error_message = raw_content
+            raw_content = ""
+    else:
+        error_message = "Invalid source type selected."
+        return error_message, None, None
+    if error_message:
+        return error_message, None, None # Error text, no preview, no file
+    # Save raw_content (which is markdown) to a file of the chosen output_format
+    # This will handle conversion if necessary
     try:
+        progress(0.9, desc=f"Converting to {output_format_selection}...")
+        output_file_path = save_output_to_file(raw_content, output_format_selection, url_or_id)
+        # For preview, we'll show the raw markdown, or a snippet of JSON/CSV
+        preview_content = raw_content # Default to markdown
+        if output_format_selection == "JSON":
+            preview_content = convert_to_json(raw_content, url_or_id)
+        elif output_format_selection == "CSV":
+            # For CSV preview, maybe just show a message or first few lines
+            preview_content = f"CSV file generated. Path: {output_file_path}\nFirst few lines might be shown here in a real app."
+            # Or read a bit of the CSV for preview:
+            # with open(output_file_path, 'r', encoding='utf-8') as f_csv:
+            # preview_content = "".join(f_csv.readlines()[:5])
+        progress(1, desc="Processing complete.")
+        return f"Successfully processed: {url_or_id}", preview_content, output_file_path
+    except Exception as e:
+        return f"Error during file conversion/saving: {str(e)}", raw_content, None
+# --- Gradio Interface Definition ---
+with gr.Blocks(theme=gr.themes.Soft()) as iface:
+    gr.Markdown("# RAG-Ready Content Scraper")
+    gr.Markdown(
+        "Scrape webpage content (using RAG-scraper) or GitHub repositories (using RepoMix) "
+        "to generate RAG-ready datasets. Uses Docker for full functionality on HuggingFace Spaces."
+    )
+    with gr.Row():
+        with gr.Column(scale=2):
+            url_input = gr.Textbox(
+                label="Enter URL or GitHub Repository ID",
+                placeholder="e.g., https://example.com OR username/repo"
+            )
+            source_type_input = gr.Radio(
+                choices=["Webpage", "GitHub Repository"],
+                value="Webpage",
+                label="Select Source Type"
+            )
+            depth_input = gr.Slider(
+                minimum=0, maximum=3, step=1, value=0,
+                label="Scraping Depth (for Webpages)",
+                info="0: Only main page. Ignored for GitHub repos."
+            )
+            output_format_input = gr.Dropdown(
+                choices=["Markdown", "JSON", "CSV"], # Markdown is like text file
+                value="Markdown",
+                label="Select Output Format"
             )
+            submit_button = gr.Button("Process Content", variant="primary")
+        with gr.Column(scale=3):
+            status_output = gr.Textbox(label="Status", interactive=False)
+            preview_output = gr.Code(label="Preview Content", language="markdown", interactive=False) # Default to markdown, can show JSON too
+            file_download_output = gr.File(label="Download Processed File", interactive=False)
+    progress_bar = gr.Progress(track_tqdm=True)
+    # --- Examples ---
+    gr.Examples(
+        examples=[
+            ["https://gradio.app/docs/js", "Webpage", 1, "Markdown"],
+            ["gradio-app/gradio", "GitHub Repository", 0, "Markdown"],
+            ["https://en.wikipedia.org/wiki/Retrieval-augmented_generation", "Webpage", 0, "JSON"],
+        ],
+        inputs=[url_input, source_type_input, depth_input, output_format_input],
+        outputs=[status_output, preview_output, file_download_output], # Function needs to match this
+        fn=process_input_updated, # Make sure the function signature matches
+        cache_examples=False # For development, disable caching
+    )
+    # --- How it Works & GitHub Link ---
+    with gr.Accordion("How it Works & More Info", open=False):
+        gr.Markdown(
+            """
+            **Webpage Scraping:**
+            1. Enter a full URL (e.g., `https://example.com`).
+            2. Select "Webpage" as the source type.
+            3. Set the desired scraping depth (how many levels of internal links to follow).
+            4. Choose your output format.
+            5. The tool fetches HTML, converts it to Markdown, and follows internal links up to the specified depth.
+            **GitHub Repository Processing:**
+            1. Enter a GitHub repository URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
+            2. Select "GitHub Repository" as the source type. (Scraping depth is ignored).
+            3. Choose your output format.
+            4. The tool uses **RepoMix** to fetch and process the repository into a structured Markdown format.
+            **Output Formats:**
+            - **Markdown:** Plain text Markdown file, suitable for direct reading or further processing.
+            - **JSON:** Structured JSON output, typically with fields like `source` and `content`.
+            - **CSV:** Comma-Separated Values file, useful for tabular data or importing into spreadsheets.
+            **Note on HuggingFace Spaces:** This application is designed to run in a Docker-based HuggingFace Space,
+            which allows the use of `RepoMix` for GitHub repositories.
+            [View Source Code on HuggingFace Spaces](https://huggingface.co/spaces/CultriX/RAG-Scraper)
+            """
         )
+    submit_button.click(
+        fn=process_input_updated,
+        inputs=[url_input, source_type_input, depth_input, output_format_input, progress_bar],
+        outputs=[status_output, preview_output, file_download_output]
+    )
 if __name__ == "__main__":
     iface.launch()