CultriX commited on
Commit
2d6afaa
·
1 Parent(s): 726d91f

Deploy RAG-Scraper application to HuggingFace Space

Browse files
Files changed (3) hide show
  1. Dockerfile +37 -0
  2. README.md +58 -88
  3. app.py +245 -143
Dockerfile ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use an official Python runtime as a parent image
2
+ FROM python:3.10-slim
3
+
4
+ # Set the working directory in the container
5
+ WORKDIR /app
6
+
7
+ # Install system dependencies for Node.js installation
8
+ RUN apt-get update && apt-get install -y \
9
+ curl \
10
+ gnupg \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ # Add Node.js LTS repository and install Node.js and npm
14
+ RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
15
+ && apt-get install -y nodejs
16
+
17
+ # Install repomix globally using npm
18
+ RUN npm install -g repomix
19
+
20
+ # Copy the requirements file into the container
21
+ COPY requirements.txt .
22
+
23
+ # Install any needed packages specified in requirements.txt
24
+ RUN pip install --no-cache-dir -r requirements.txt
25
+
26
+ # Copy the rest of the application code into the container
27
+ COPY . .
28
+
29
+ # Make port 7860 available to the world outside this container
30
+ EXPOSE 7860
31
+
32
+ # Define environment variable for Gradio server
33
+ ENV GRADIO_SERVER_NAME="0.0.0.0"
34
+ ENV GRADIO_SERVER_PORT="7860"
35
+
36
+ # Run app.py when the container launches
37
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,115 +1,85 @@
1
  ---
2
- title: RAG-Scraper
3
- emoji: 🥳
4
  colorFrom: blue
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 5.29.1
8
  app_file: app.py
9
  pinned: false
10
- license: creativeml-openrail-m
11
- short_description: Scrape webpages for RAG purposes
12
  ---
13
 
 
14
 
15
- # RAG-Scraper
16
 
17
- RAG-Scraper is a Python tool designed for efficient and intelligent scraping of web documentation and content. It's tailored for Retrieval-Augmented Generation systems, extracting and preprocessing text into structured, machine-learning-ready formats.
18
 
19
  ## Features
20
 
21
- - **Web Scraping**: Scrape web content and convert it to Markdown format
22
- - **Recursive Depth**: Control how deep the scraper should follow links
23
- - **GitHub Repository Support**: Process GitHub repositories using Repomix to create AI-friendly outputs (when run locally)
24
- - **Gradio Interface**: Easy-to-use web interface for all functionality
25
- - **HuggingFace Spaces Compatible**: Can be deployed as a HuggingFace Space (with limited functionality)
 
 
 
26
 
27
- ## Requirements
28
 
29
  - Python 3.10+
30
- - Node.js (for Repomix GitHub repository processing)
31
- - Repomix (installed via npm or used with npx)
32
-
33
- ## Installation
34
-
35
- 1. Clone the repository:
36
- ```bash
37
- git clone https://github.com/yourusername/RAG-Scraper.git
38
- cd RAG-Scraper
39
- ```
40
-
41
- 2. Install Python dependencies:
42
- ```bash
43
- pip install -r requirements.txt
44
- ```
45
-
46
- 3. For GitHub repository processing, ensure Node.js is installed and either:
47
- - Install Repomix globally: `npm install -g repomix`
48
- - Or use npx to run it without installation (the app supports this)
49
-
50
- ## Usage
51
-
52
- ### Running the Gradio Interface
53
-
54
- ```bash
55
- python app.py
56
- ```
57
-
58
- This will start the Gradio web interface, accessible at http://localhost:7860 by default.
59
-
60
- ### Using the Interface
61
-
62
- 1. **Enter a URL or GitHub Repository**:
63
- - For websites: Enter a complete URL (e.g., `https://example.com`)
64
- - For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand notation (e.g., `username/repo`)
65
-
66
- 2. **Set Search Depth** (for websites only):
67
- - 0: Only scrape the main page
68
- - 1-3: Follow links recursively to the specified depth
69
-
70
- 3. **Select Input Type**:
71
- - Auto: Automatically detect if the input is a website or GitHub repository
72
- - Website: Force processing as a website
73
- - GitHub: Force processing as a GitHub repository
74
-
75
- 4. **Click Submit** to process the input and view the results
76
 
77
  ## How It Works
78
 
79
- ### Web Scraping
80
 
81
- For websites, RAG-Scraper:
82
- 1. Fetches the HTML content from the URL
83
- 2. Converts the HTML to Markdown
84
- 3. If depth > 0, extracts internal links and repeats the process for each link
85
 
86
  ### GitHub Repository Processing
87
 
88
- For GitHub repositories, RAG-Scraper:
89
- 1. Detects if the input is a GitHub repository URL or ID
90
- 2. Uses Repomix to fetch and process the repository
91
- 3. Returns the repository content in a structured, AI-friendly format
92
-
93
- ## Examples
94
-
95
- The interface includes example inputs to demonstrate both web scraping and GitHub repository processing:
96
- - `https://example.com` - Basic website example
97
- - `yamadashy/repomix` - GitHub repository using shorthand notation
98
- - `https://github.com/yamadashy/repomix` - GitHub repository using full URL
99
 
100
- ## HuggingFace Spaces Deployment
101
 
102
- This application can be deployed as a HuggingFace Space, but with some limitations:
103
-
104
- - **Web Scraping**: Fully functional for scraping websites and converting to Markdown
105
- - **GitHub Repository Processing**: Not available on HuggingFace Spaces due to the lack of Node.js and npm/npx command execution capabilities
106
- - **User Experience**: The interface will provide clear messages about feature availability
107
-
108
- When deployed on HuggingFace Spaces, the application will automatically detect the environment and provide appropriate messages to users attempting to use the GitHub repository processing feature.
109
-
110
- To use the full functionality including GitHub repository processing with Repomix, run the application locally following the installation instructions above.
111
 
112
  ## License
113
 
114
  This project is licensed under the MIT License.
115
-
 
1
  ---
2
+ title: RAG-Ready Content Scraper
3
+ emoji: 🚀
4
  colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
 
7
  app_file: app.py
8
  pinned: false
9
+ license: MIT
10
+ short_description: Scrape webpages or GitHub repos to generate RAG-ready datasets.
11
  ---
12
 
13
+ # RAG-Ready Content Scraper
14
 
15
+ RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.
16
 
17
+ This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.
18
 
19
  ## Features
20
 
21
+ - **Dual Scraping Modes**:
22
+ - **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
23
+ - **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs.
24
+ - **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV.
25
+ - **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
26
+ - **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
27
+ - **Pre-configured Examples**: Includes example inputs for quick testing.
28
+ - **In-UI Documentation**: "How it Works" section provides guidance.
29
 
30
+ ## Requirements for Local Development (Optional)
31
 
32
  - Python 3.10+
33
+ - Node.js and npm (for Repomix GitHub repository processing)
34
+ - Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted)
35
+ - Project dependencies: `pip install -r requirements.txt`
36
+
37
+ ## HuggingFace Space Deployment
38
+
39
+ This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**.
40
+
41
+ 1. **Create a new HuggingFace Space.**
42
+ 2. Choose **"Docker"** as the Space SDK.
43
+ 3. Select **"Use an existing Dockerfile"**.
44
+ 4. Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository.
45
+ 5. The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.
46
+
47
+ ## Using the Interface
48
+
49
+ 1. **Enter URL or GitHub Repository ID**:
50
+ * For websites: Enter a complete URL (e.g., `https://example.com`).
51
+ * For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
52
+ 2. **Select Source Type**:
53
+ * Choose "Webpage" or "GitHub Repository".
54
+ 3. **Set Scraping Depth** (for Webpages only):
55
+ * 0: Only scrape the main page.
56
+ * 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
57
+ 4. **Select Output Format**:
58
+ * Choose "Markdown", "JSON", or "CSV".
59
+ 5. **Click "Process Content"**.
60
+ 6. **View Status and Preview**: Monitor progress and see a preview of the extracted content.
61
+ 7. **Download File**: Download the generated dataset in your chosen format.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## How It Works
64
 
65
+ ### Webpage Scraping
66
 
67
+ 1. Fetches HTML content from the provided URL.
68
+ 2. Converts HTML to clean Markdown.
69
+ 3. If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
70
+ 4. Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).
71
 
72
  ### GitHub Repository Processing
73
 
74
+ 1. Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository.
75
+ 2. RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
76
+ 3. This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).
 
 
 
 
 
 
 
 
77
 
78
+ ## Source Code
79
 
80
+ The source code for this project is available on HuggingFace Spaces:
81
+ [https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper)
 
 
 
 
 
 
 
82
 
83
  ## License
84
 
85
  This project is licensed under the MIT License.
 
app.py CHANGED
@@ -3,6 +3,8 @@ import subprocess
3
  import os
4
  import re
5
  import tempfile
 
 
6
  from rag_scraper.scraper import Scraper
7
  from rag_scraper.converter import Converter
8
  from rag_scraper.link_extractor import LinkExtractor, LinkType
@@ -10,195 +12,295 @@ from rag_scraper.utils import URLUtils
10
 
11
  def is_github_repo(url_or_id):
12
  """Check if the input is a GitHub repository URL or ID."""
13
- # Check for GitHub URL
14
  if "github.com" in url_or_id:
15
  return True
16
-
17
- # Check for shorthand notation (username/repo)
18
  if re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', url_or_id):
19
  return True
20
-
21
  return False
22
 
23
- def extract_repo_info(url_or_id):
24
- """Extract repository owner and name from URL or ID."""
25
- # Handle GitHub URLs
26
- github_url_pattern = r'github\.com/([a-zA-Z0-9_.-]+)/([a-zA-Z0-9_.-]+)'
27
- match = re.search(github_url_pattern, url_or_id)
28
- if match:
29
- return match.group(1), match.group(2)
30
-
31
- # Handle shorthand notation (username/repo)
32
- if '/' in url_or_id and not url_or_id.startswith('http'):
33
- parts = url_or_id.split('/')
34
- if len(parts) == 2:
35
- return parts[0], parts[1]
36
-
37
- return None, None
38
-
39
- def is_running_on_huggingface():
40
- """Check if the app is running on HuggingFace Spaces."""
41
- return os.environ.get('SPACE_ID') is not None
42
-
43
  def check_repomix_installed():
44
  """Check if Repomix is installed."""
45
- # If running on HuggingFace Spaces, Repomix is likely not available
46
- if is_running_on_huggingface():
47
- return False
48
-
49
  try:
50
- result = subprocess.run(["npx", "repomix", "--version"],
51
  capture_output=True, text=True, check=False)
52
  return result.returncode == 0
53
  except Exception:
54
  return False
55
 
56
- def run_repomix(repo_url_or_id, output_format="markdown"):
57
  """Run Repomix on the GitHub repository and return the content."""
 
58
  try:
59
- # Create a temporary directory for the output
60
  with tempfile.TemporaryDirectory() as temp_dir:
61
- output_file = os.path.join(temp_dir, f"repomix-output.{output_format}")
 
 
 
 
 
62
 
63
- # Prepare the command
64
  if '/' in repo_url_or_id and not repo_url_or_id.startswith('http'):
65
- # Handle shorthand notation
66
  repo_url = f"https://github.com/{repo_url_or_id}"
67
  else:
68
  repo_url = repo_url_or_id
69
 
70
- # Run Repomix
71
  cmd = [
72
- "npx", "repomix",
73
  "--remote", repo_url,
74
- "--output", output_file,
75
- "--style", output_format,
76
- "--compress" # Use compression for better token efficiency
77
  ]
78
 
79
  process = subprocess.run(cmd, capture_output=True, text=True, check=False)
80
-
 
81
  if process.returncode != 0:
82
- return f"Error running Repomix: {process.stderr}"
83
 
84
- # Read the output file
85
- if os.path.exists(output_file):
86
- with open(output_file, 'r', encoding='utf-8') as f:
87
- return f.read()
 
88
  else:
89
- return f"Error: Repomix did not generate an output file."
90
 
91
  except Exception as e:
92
- return f"Error processing GitHub repository: {str(e)}"
 
93
 
94
- def process_input(url_or_id, depth, input_type="auto"):
95
- """Process the input based on its type."""
96
- try:
97
- # Determine if this is a GitHub repository
98
- is_github = is_github_repo(url_or_id) if input_type == "auto" else (input_type == "github")
 
 
 
 
99
 
100
- if is_github:
101
- # Check if running on HuggingFace Spaces
102
- if is_running_on_huggingface():
103
- return (
104
- "GitHub repository processing with Repomix is not available on HuggingFace Spaces. "
105
- "This feature requires Node.js and the ability to run npm/npx commands, "
106
- "which are typically not available in the HuggingFace Spaces environment.\n\n"
107
- "You can still use the web scraping functionality for regular websites, "
108
- "or run this application locally to use the Repomix feature."
109
- )
110
-
111
- # Check if Repomix is installed
112
- if not check_repomix_installed():
113
- return (
114
- "Repomix is not installed or not accessible. "
115
- "Please install it using: npm install -g repomix\n"
116
- "Or you can run it without installation using: npx repomix"
117
- )
118
-
119
- # Process GitHub repository with Repomix
120
- return run_repomix(url_or_id, output_format="markdown")
121
- else:
122
- # Process regular URL with web scraping
123
- return scrape_and_convert(url_or_id, depth)
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
- except Exception as e:
126
- return f"Error: {str(e)}"
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- def scrape_and_convert(url, depth):
129
- """Fetch HTML content, extract links recursively (up to given depth), and convert to Markdown."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  try:
131
- visited_urls = set()
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
- def recursive_scrape(url, current_depth):
134
- """Recursively scrape and convert pages up to the given depth."""
135
- if url in visited_urls or current_depth < 0:
136
- return ""
137
-
138
- visited_urls.add(url)
139
 
140
- # Fetch HTML content
141
- try:
142
- html_content = Scraper.fetch_html(url)
143
- except Exception as e:
144
- return f"Error fetching {url}: {str(e)}\n"
145
-
146
- # Convert to Markdown
147
- markdown_content = f"## Extracted from: {url}\n\n"
148
- markdown_content += Converter.html_to_markdown(
149
- html=html_content,
150
- base_url=url,
151
- parser_features='html.parser',
152
- ignore_links=True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  )
 
 
 
 
 
 
154
 
155
- # If depth > 0, extract links and process them
156
- if current_depth > 0:
157
- links = LinkExtractor.scrape_url(url, link_type=LinkType.INTERNAL)
158
 
159
- for link in links:
160
- if link not in visited_urls:
161
- markdown_content += f"\n\n### Extracted from: {link}\n"
162
- markdown_content += recursive_scrape(link, current_depth - 1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
 
164
- return markdown_content
 
 
 
 
165
 
166
- # Start the recursive scraping process
167
- result = recursive_scrape(url, depth)
168
- return result
 
169
 
170
- except Exception as e:
171
- return f"Error: {str(e)}"
172
-
173
- # Define Gradio interface
174
- iface = gr.Interface(
175
- fn=process_input,
176
- inputs=[
177
- gr.Textbox(label="Enter URL or GitHub Repository",
178
- placeholder="https://example.com or username/repo"),
179
- gr.Slider(minimum=0, maximum=3, step=1, value=0,
180
- label="Search Depth (0 = Only main page, ignored for GitHub repos)"),
181
- gr.Radio(
182
- choices=["auto", "website", "github"],
183
- value="auto",
184
- label="Input Type",
185
- info="Auto will detect GitHub repos automatically"
186
  )
187
- ],
188
- outputs=gr.Code(label="Output", language="markdown"),
189
- title="RAGScraper with GitHub Repository Support",
190
- description=(
191
- "Enter a URL to scrape a website, or a GitHub repository URL/ID (e.g., 'username/repo') "
192
- "to use Repomix for repository processing. "
193
- "For websites, you can specify the search depth for recursive scraping."
194
- ),
195
- examples=[
196
- ["https://example.com", 0, "auto"],
197
- ["yamadashy/repomix", 0, "auto"],
198
- ["https://github.com/yamadashy/repomix", 0, "auto"]
199
- ]
200
- )
201
-
202
- # Launch the Gradio app
203
  if __name__ == "__main__":
204
  iface.launch()
 
3
  import os
4
  import re
5
  import tempfile
6
+ import json
7
+ import csv
8
  from rag_scraper.scraper import Scraper
9
  from rag_scraper.converter import Converter
10
  from rag_scraper.link_extractor import LinkExtractor, LinkType
 
12
 
13
  def is_github_repo(url_or_id):
14
  """Check if the input is a GitHub repository URL or ID."""
 
15
  if "github.com" in url_or_id:
16
  return True
 
 
17
  if re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', url_or_id):
18
  return True
 
19
  return False
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  def check_repomix_installed():
22
  """Check if Repomix is installed."""
 
 
 
 
23
  try:
24
+ result = subprocess.run(["repomix", "--version"],
25
  capture_output=True, text=True, check=False)
26
  return result.returncode == 0
27
  except Exception:
28
  return False
29
 
30
+ def run_repomix(repo_url_or_id, progress=gr.Progress(track_tqdm=True)):
31
  """Run Repomix on the GitHub repository and return the content."""
32
+ progress(0, desc="Starting Repomix processing...")
33
  try:
 
34
  with tempfile.TemporaryDirectory() as temp_dir:
35
+ # RepoMix typically outputs a zip file if not specifying a single output style,
36
+ # or a specific file if --style is used.
37
+ # For simplicity, let's assume we want markdown and it outputs to a known file or stdout.
38
+ # The current repomix command in the original script uses --style markdown and --output.
39
+ output_file_name = "repomix-output.md" # Assuming markdown output
40
+ output_file_path = os.path.join(temp_dir, output_file_name)
41
 
 
42
  if '/' in repo_url_or_id and not repo_url_or_id.startswith('http'):
 
43
  repo_url = f"https://github.com/{repo_url_or_id}"
44
  else:
45
  repo_url = repo_url_or_id
46
 
47
+ progress(0.2, desc=f"Running Repomix on {repo_url}...")
48
  cmd = [
49
+ "repomix",
50
  "--remote", repo_url,
51
+ "--output", output_file_path, # Direct output to a file
52
+ "--style", "markdown", # Explicitly request markdown
53
+ "--compress"
54
  ]
55
 
56
  process = subprocess.run(cmd, capture_output=True, text=True, check=False)
57
+ progress(0.8, desc="Repomix command executed.")
58
+
59
  if process.returncode != 0:
60
+ return f"Error running Repomix: {process.stderr}", None
61
 
62
+ if os.path.exists(output_file_path):
63
+ with open(output_file_path, 'r', encoding='utf-8') as f:
64
+ content = f.read()
65
+ progress(1, desc="Repomix output processed.")
66
+ return content, output_file_path # Return content and path for potential download
67
  else:
68
+ return "Error: Repomix did not generate an output file.", None
69
 
70
  except Exception as e:
71
+ progress(1, desc="Error during Repomix processing.")
72
+ return f"Error processing GitHub repository: {str(e)}", None
73
 
74
+ def scrape_and_convert_website(url, depth, progress=gr.Progress(track_tqdm=True)):
75
+ """Fetch HTML, extract links, convert to Markdown."""
76
+ progress(0, desc=f"Starting web scrape for {url}...")
77
+ visited_urls = set()
78
+ all_markdown_content = ""
79
+
80
+ def recursive_scrape(current_url, current_depth, total_links_estimate=1, link_index=0):
81
+ if current_url in visited_urls or current_depth < 0:
82
+ return ""
83
 
84
+ visited_urls.add(current_url)
85
+
86
+ try:
87
+ progress_val = link_index / total_links_estimate if total_links_estimate > 0 else 0
88
+ progress(progress_val, desc=f"Scraping: {current_url} (Depth: {depth - current_depth})")
89
+ html_content = Scraper.fetch_html(current_url)
90
+ except Exception as e:
91
+ return f"Error fetching {current_url}: {str(e)}\n"
92
+
93
+ markdown_content = f"## Extracted from: {current_url}\n\n"
94
+ markdown_content += Converter.html_to_markdown(
95
+ html=html_content,
96
+ base_url=current_url,
97
+ parser_features='html.parser',
98
+ ignore_links=True
99
+ )
100
+
101
+ page_content = markdown_content + "\n\n"
102
+
103
+ if current_depth > 0:
104
+ try:
105
+ links = LinkExtractor.scrape_url(current_url, link_type=LinkType.INTERNAL)
106
+ # Filter out already visited links and external links more carefully
107
+ valid_links = [
108
+ link for link in links
109
+ if URLUtils.is_internal(link, current_url) and link not in visited_urls
110
+ ]
111
+
112
+ num_links = len(valid_links)
113
+ for i, link_url in enumerate(valid_links):
114
+ page_content += recursive_scrape(link_url, current_depth - 1, num_links, i)
115
+ except Exception as e:
116
+ page_content += f"Error extracting links from {current_url}: {str(e)}\n"
117
+ return page_content
118
+
119
+ all_markdown_content = recursive_scrape(url, depth)
120
+ progress(1, desc="Web scraping complete.")
121
 
122
+ # For web scraping, we create a temporary file with the content for download
123
+ with tempfile.NamedTemporaryFile(mode="w+", delete=False, suffix=".md", encoding="utf-8") as tmp_file:
124
+ tmp_file.write(all_markdown_content)
125
+ return all_markdown_content, tmp_file.name
126
+
127
+
128
+ # --- Data Conversion Functions (Stubs for now) ---
129
+ def convert_to_json(markdown_content, source_url_or_id):
130
+ """Converts markdown content to a JSON string."""
131
+ # Basic implementation: create a JSON object with source and content
132
+ # More sophisticated parsing can be added later
133
+ data = {"source": source_url_or_id, "content": markdown_content}
134
+ return json.dumps(data, indent=2)
135
 
136
+ def convert_to_csv(markdown_content, source_url_or_id):
137
+ """Converts markdown content to a CSV string."""
138
+ # Basic implementation: create a CSV with source and content
139
+ # This is a simplified CSV; real CSVs might need more structure
140
+ output = tempfile.NamedTemporaryFile(mode='w+', delete=False, newline='', suffix=".csv", encoding="utf-8")
141
+ writer = csv.writer(output)
142
+ writer.writerow(["source", "content"]) # Header
143
+
144
+ # Split content into manageable chunks or lines if necessary for CSV
145
+ # For now, putting all content in one cell.
146
+ writer.writerow([source_url_or_id, markdown_content])
147
+ output.close()
148
+ return output.name # Return path to the CSV file
149
+
150
+ def save_output_to_file(content, output_format, source_url_or_id):
151
+ """Saves content to a temporary file based on format and returns its path."""
152
+ suffix = f".{output_format.lower()}"
153
+ if output_format == "JSON":
154
+ processed_content = convert_to_json(content, source_url_or_id)
155
+ elif output_format == "CSV":
156
+ # convert_to_csv now returns a path directly
157
+ return convert_to_csv(content, source_url_or_id)
158
+ else: # Markdown/Text
159
+ processed_content = content
160
+ suffix = ".md"
161
+
162
+ with tempfile.NamedTemporaryFile(mode="w+", delete=False, suffix=suffix, encoding="utf-8") as tmp_file:
163
+ tmp_file.write(processed_content)
164
+ return tmp_file.name
165
+
166
+ # --- Main Processing Function ---
167
+ def process_input_updated(url_or_id, source_type, depth, output_format_selection, progress=gr.Progress(track_tqdm=True)):
168
+ """Main function to process URL or GitHub repo based on selected type and format."""
169
+ progress(0, desc="Initializing...")
170
+ raw_content = ""
171
+ error_message = ""
172
+ output_file_path = None
173
+
174
+ if source_type == "GitHub Repository":
175
+ if not check_repomix_installed():
176
+ error_message = "Repomix is not installed or not accessible. Please ensure it's installed globally in your Docker environment."
177
+ return error_message, None, None # Text output, Preview, File output
178
+
179
+ raw_content, _ = run_repomix(url_or_id, progress=progress) # Repomix returns content and its original path
180
+ if "Error" in raw_content: # Simple error check
181
+ error_message = raw_content
182
+ raw_content = ""
183
+
184
+ elif source_type == "Webpage":
185
+ raw_content, _ = scrape_and_convert_website(url_or_id, depth, progress=progress)
186
+ if "Error" in raw_content: # Simple error check
187
+ error_message = raw_content
188
+ raw_content = ""
189
+ else:
190
+ error_message = "Invalid source type selected."
191
+ return error_message, None, None
192
+
193
+ if error_message:
194
+ return error_message, None, None # Error text, no preview, no file
195
+
196
+ # Save raw_content (which is markdown) to a file of the chosen output_format
197
+ # This will handle conversion if necessary
198
  try:
199
+ progress(0.9, desc=f"Converting to {output_format_selection}...")
200
+ output_file_path = save_output_to_file(raw_content, output_format_selection, url_or_id)
201
+
202
+ # For preview, we'll show the raw markdown, or a snippet of JSON/CSV
203
+ preview_content = raw_content # Default to markdown
204
+ if output_format_selection == "JSON":
205
+ preview_content = convert_to_json(raw_content, url_or_id)
206
+ elif output_format_selection == "CSV":
207
+ # For CSV preview, maybe just show a message or first few lines
208
+ preview_content = f"CSV file generated. Path: {output_file_path}\nFirst few lines might be shown here in a real app."
209
+ # Or read a bit of the CSV for preview:
210
+ # with open(output_file_path, 'r', encoding='utf-8') as f_csv:
211
+ # preview_content = "".join(f_csv.readlines()[:5])
212
 
213
+ progress(1, desc="Processing complete.")
214
+ return f"Successfully processed: {url_or_id}", preview_content, output_file_path
215
+ except Exception as e:
216
+ return f"Error during file conversion/saving: {str(e)}", raw_content, None
 
 
217
 
218
+
219
+ # --- Gradio Interface Definition ---
220
+ with gr.Blocks(theme=gr.themes.Soft()) as iface:
221
+ gr.Markdown("# RAG-Ready Content Scraper")
222
+ gr.Markdown(
223
+ "Scrape webpage content (using RAG-scraper) or GitHub repositories (using RepoMix) "
224
+ "to generate RAG-ready datasets. Uses Docker for full functionality on HuggingFace Spaces."
225
+ )
226
+
227
+ with gr.Row():
228
+ with gr.Column(scale=2):
229
+ url_input = gr.Textbox(
230
+ label="Enter URL or GitHub Repository ID",
231
+ placeholder="e.g., https://example.com OR username/repo"
232
+ )
233
+ source_type_input = gr.Radio(
234
+ choices=["Webpage", "GitHub Repository"],
235
+ value="Webpage",
236
+ label="Select Source Type"
237
+ )
238
+ depth_input = gr.Slider(
239
+ minimum=0, maximum=3, step=1, value=0,
240
+ label="Scraping Depth (for Webpages)",
241
+ info="0: Only main page. Ignored for GitHub repos."
242
+ )
243
+ output_format_input = gr.Dropdown(
244
+ choices=["Markdown", "JSON", "CSV"], # Markdown is like text file
245
+ value="Markdown",
246
+ label="Select Output Format"
247
  )
248
+ submit_button = gr.Button("Process Content", variant="primary")
249
+
250
+ with gr.Column(scale=3):
251
+ status_output = gr.Textbox(label="Status", interactive=False)
252
+ preview_output = gr.Code(label="Preview Content", language="markdown", interactive=False) # Default to markdown, can show JSON too
253
+ file_download_output = gr.File(label="Download Processed File", interactive=False)
254
 
255
+ progress_bar = gr.Progress(track_tqdm=True)
 
 
256
 
257
+ # --- Examples ---
258
+ gr.Examples(
259
+ examples=[
260
+ ["https://gradio.app/docs/js", "Webpage", 1, "Markdown"],
261
+ ["gradio-app/gradio", "GitHub Repository", 0, "Markdown"],
262
+ ["https://en.wikipedia.org/wiki/Retrieval-augmented_generation", "Webpage", 0, "JSON"],
263
+ ],
264
+ inputs=[url_input, source_type_input, depth_input, output_format_input],
265
+ outputs=[status_output, preview_output, file_download_output], # Function needs to match this
266
+ fn=process_input_updated, # Make sure the function signature matches
267
+ cache_examples=False # For development, disable caching
268
+ )
269
+
270
+ # --- How it Works & GitHub Link ---
271
+ with gr.Accordion("How it Works & More Info", open=False):
272
+ gr.Markdown(
273
+ """
274
+ **Webpage Scraping:**
275
+ 1. Enter a full URL (e.g., `https://example.com`).
276
+ 2. Select "Webpage" as the source type.
277
+ 3. Set the desired scraping depth (how many levels of internal links to follow).
278
+ 4. Choose your output format.
279
+ 5. The tool fetches HTML, converts it to Markdown, and follows internal links up to the specified depth.
280
 
281
+ **GitHub Repository Processing:**
282
+ 1. Enter a GitHub repository URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
283
+ 2. Select "GitHub Repository" as the source type. (Scraping depth is ignored).
284
+ 3. Choose your output format.
285
+ 4. The tool uses **RepoMix** to fetch and process the repository into a structured Markdown format.
286
 
287
+ **Output Formats:**
288
+ - **Markdown:** Plain text Markdown file, suitable for direct reading or further processing.
289
+ - **JSON:** Structured JSON output, typically with fields like `source` and `content`.
290
+ - **CSV:** Comma-Separated Values file, useful for tabular data or importing into spreadsheets.
291
 
292
+ **Note on HuggingFace Spaces:** This application is designed to run in a Docker-based HuggingFace Space,
293
+ which allows the use of `RepoMix` for GitHub repositories.
294
+
295
+ [View Source Code on HuggingFace Spaces](https://huggingface.co/spaces/CultriX/RAG-Scraper)
296
+ """
 
 
 
 
 
 
 
 
 
 
 
297
  )
298
+
299
+ submit_button.click(
300
+ fn=process_input_updated,
301
+ inputs=[url_input, source_type_input, depth_input, output_format_input, progress_bar],
302
+ outputs=[status_output, preview_output, file_download_output]
303
+ )
304
+
 
 
 
 
 
 
 
 
 
305
  if __name__ == "__main__":
306
  iface.launch()