Spaces:
Running
Running
Deploy RAG-Scraper application to HuggingFace Space
Browse files- Dockerfile +37 -0
- README.md +58 -88
- app.py +245 -143
Dockerfile
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Use an official Python runtime as a parent image
|
2 |
+
FROM python:3.10-slim
|
3 |
+
|
4 |
+
# Set the working directory in the container
|
5 |
+
WORKDIR /app
|
6 |
+
|
7 |
+
# Install system dependencies for Node.js installation
|
8 |
+
RUN apt-get update && apt-get install -y \
|
9 |
+
curl \
|
10 |
+
gnupg \
|
11 |
+
&& rm -rf /var/lib/apt/lists/*
|
12 |
+
|
13 |
+
# Add Node.js LTS repository and install Node.js and npm
|
14 |
+
RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
|
15 |
+
&& apt-get install -y nodejs
|
16 |
+
|
17 |
+
# Install repomix globally using npm
|
18 |
+
RUN npm install -g repomix
|
19 |
+
|
20 |
+
# Copy the requirements file into the container
|
21 |
+
COPY requirements.txt .
|
22 |
+
|
23 |
+
# Install any needed packages specified in requirements.txt
|
24 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
25 |
+
|
26 |
+
# Copy the rest of the application code into the container
|
27 |
+
COPY . .
|
28 |
+
|
29 |
+
# Make port 7860 available to the world outside this container
|
30 |
+
EXPOSE 7860
|
31 |
+
|
32 |
+
# Define environment variable for Gradio server
|
33 |
+
ENV GRADIO_SERVER_NAME="0.0.0.0"
|
34 |
+
ENV GRADIO_SERVER_PORT="7860"
|
35 |
+
|
36 |
+
# Run app.py when the container launches
|
37 |
+
CMD ["python", "app.py"]
|
README.md
CHANGED
@@ -1,115 +1,85 @@
|
|
1 |
---
|
2 |
-
title: RAG-Scraper
|
3 |
-
emoji:
|
4 |
colorFrom: blue
|
5 |
-
colorTo:
|
6 |
-
sdk:
|
7 |
-
sdk_version: 5.29.1
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
-
license:
|
11 |
-
short_description: Scrape webpages
|
12 |
---
|
13 |
|
|
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
18 |
|
19 |
## Features
|
20 |
|
21 |
-
- **
|
22 |
-
- **
|
23 |
-
- **GitHub Repository
|
24 |
-
- **
|
25 |
-
- **
|
|
|
|
|
|
|
26 |
|
27 |
-
## Requirements
|
28 |
|
29 |
- Python 3.10+
|
30 |
-
- Node.js (for Repomix GitHub repository processing)
|
31 |
-
- Repomix (installed
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
### Using the Interface
|
61 |
-
|
62 |
-
1. **Enter a URL or GitHub Repository**:
|
63 |
-
- For websites: Enter a complete URL (e.g., `https://example.com`)
|
64 |
-
- For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand notation (e.g., `username/repo`)
|
65 |
-
|
66 |
-
2. **Set Search Depth** (for websites only):
|
67 |
-
- 0: Only scrape the main page
|
68 |
-
- 1-3: Follow links recursively to the specified depth
|
69 |
-
|
70 |
-
3. **Select Input Type**:
|
71 |
-
- Auto: Automatically detect if the input is a website or GitHub repository
|
72 |
-
- Website: Force processing as a website
|
73 |
-
- GitHub: Force processing as a GitHub repository
|
74 |
-
|
75 |
-
4. **Click Submit** to process the input and view the results
|
76 |
|
77 |
## How It Works
|
78 |
|
79 |
-
###
|
80 |
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
|
86 |
### GitHub Repository Processing
|
87 |
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
3. Returns the repository content in a structured, AI-friendly format
|
92 |
-
|
93 |
-
## Examples
|
94 |
-
|
95 |
-
The interface includes example inputs to demonstrate both web scraping and GitHub repository processing:
|
96 |
-
- `https://example.com` - Basic website example
|
97 |
-
- `yamadashy/repomix` - GitHub repository using shorthand notation
|
98 |
-
- `https://github.com/yamadashy/repomix` - GitHub repository using full URL
|
99 |
|
100 |
-
##
|
101 |
|
102 |
-
|
103 |
-
|
104 |
-
- **Web Scraping**: Fully functional for scraping websites and converting to Markdown
|
105 |
-
- **GitHub Repository Processing**: Not available on HuggingFace Spaces due to the lack of Node.js and npm/npx command execution capabilities
|
106 |
-
- **User Experience**: The interface will provide clear messages about feature availability
|
107 |
-
|
108 |
-
When deployed on HuggingFace Spaces, the application will automatically detect the environment and provide appropriate messages to users attempting to use the GitHub repository processing feature.
|
109 |
-
|
110 |
-
To use the full functionality including GitHub repository processing with Repomix, run the application locally following the installation instructions above.
|
111 |
|
112 |
## License
|
113 |
|
114 |
This project is licensed under the MIT License.
|
115 |
-
|
|
|
1 |
---
|
2 |
+
title: RAG-Ready Content Scraper
|
3 |
+
emoji: 🚀
|
4 |
colorFrom: blue
|
5 |
+
colorTo: green
|
6 |
+
sdk: docker
|
|
|
7 |
app_file: app.py
|
8 |
pinned: false
|
9 |
+
license: MIT
|
10 |
+
short_description: Scrape webpages or GitHub repos to generate RAG-ready datasets.
|
11 |
---
|
12 |
|
13 |
+
# RAG-Ready Content Scraper
|
14 |
|
15 |
+
RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.
|
16 |
|
17 |
+
This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.
|
18 |
|
19 |
## Features
|
20 |
|
21 |
+
- **Dual Scraping Modes**:
|
22 |
+
- **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
|
23 |
+
- **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs.
|
24 |
+
- **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV.
|
25 |
+
- **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
|
26 |
+
- **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
|
27 |
+
- **Pre-configured Examples**: Includes example inputs for quick testing.
|
28 |
+
- **In-UI Documentation**: "How it Works" section provides guidance.
|
29 |
|
30 |
+
## Requirements for Local Development (Optional)
|
31 |
|
32 |
- Python 3.10+
|
33 |
+
- Node.js and npm (for Repomix GitHub repository processing)
|
34 |
+
- Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted)
|
35 |
+
- Project dependencies: `pip install -r requirements.txt`
|
36 |
+
|
37 |
+
## HuggingFace Space Deployment
|
38 |
+
|
39 |
+
This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**.
|
40 |
+
|
41 |
+
1. **Create a new HuggingFace Space.**
|
42 |
+
2. Choose **"Docker"** as the Space SDK.
|
43 |
+
3. Select **"Use an existing Dockerfile"**.
|
44 |
+
4. Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository.
|
45 |
+
5. The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.
|
46 |
+
|
47 |
+
## Using the Interface
|
48 |
+
|
49 |
+
1. **Enter URL or GitHub Repository ID**:
|
50 |
+
* For websites: Enter a complete URL (e.g., `https://example.com`).
|
51 |
+
* For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
|
52 |
+
2. **Select Source Type**:
|
53 |
+
* Choose "Webpage" or "GitHub Repository".
|
54 |
+
3. **Set Scraping Depth** (for Webpages only):
|
55 |
+
* 0: Only scrape the main page.
|
56 |
+
* 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
|
57 |
+
4. **Select Output Format**:
|
58 |
+
* Choose "Markdown", "JSON", or "CSV".
|
59 |
+
5. **Click "Process Content"**.
|
60 |
+
6. **View Status and Preview**: Monitor progress and see a preview of the extracted content.
|
61 |
+
7. **Download File**: Download the generated dataset in your chosen format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
## How It Works
|
64 |
|
65 |
+
### Webpage Scraping
|
66 |
|
67 |
+
1. Fetches HTML content from the provided URL.
|
68 |
+
2. Converts HTML to clean Markdown.
|
69 |
+
3. If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
|
70 |
+
4. Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).
|
71 |
|
72 |
### GitHub Repository Processing
|
73 |
|
74 |
+
1. Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository.
|
75 |
+
2. RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
|
76 |
+
3. This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
+
## Source Code
|
79 |
|
80 |
+
The source code for this project is available on HuggingFace Spaces:
|
81 |
+
[https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
|
83 |
## License
|
84 |
|
85 |
This project is licensed under the MIT License.
|
|
app.py
CHANGED
@@ -3,6 +3,8 @@ import subprocess
|
|
3 |
import os
|
4 |
import re
|
5 |
import tempfile
|
|
|
|
|
6 |
from rag_scraper.scraper import Scraper
|
7 |
from rag_scraper.converter import Converter
|
8 |
from rag_scraper.link_extractor import LinkExtractor, LinkType
|
@@ -10,195 +12,295 @@ from rag_scraper.utils import URLUtils
|
|
10 |
|
11 |
def is_github_repo(url_or_id):
|
12 |
"""Check if the input is a GitHub repository URL or ID."""
|
13 |
-
# Check for GitHub URL
|
14 |
if "github.com" in url_or_id:
|
15 |
return True
|
16 |
-
|
17 |
-
# Check for shorthand notation (username/repo)
|
18 |
if re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', url_or_id):
|
19 |
return True
|
20 |
-
|
21 |
return False
|
22 |
|
23 |
-
def extract_repo_info(url_or_id):
|
24 |
-
"""Extract repository owner and name from URL or ID."""
|
25 |
-
# Handle GitHub URLs
|
26 |
-
github_url_pattern = r'github\.com/([a-zA-Z0-9_.-]+)/([a-zA-Z0-9_.-]+)'
|
27 |
-
match = re.search(github_url_pattern, url_or_id)
|
28 |
-
if match:
|
29 |
-
return match.group(1), match.group(2)
|
30 |
-
|
31 |
-
# Handle shorthand notation (username/repo)
|
32 |
-
if '/' in url_or_id and not url_or_id.startswith('http'):
|
33 |
-
parts = url_or_id.split('/')
|
34 |
-
if len(parts) == 2:
|
35 |
-
return parts[0], parts[1]
|
36 |
-
|
37 |
-
return None, None
|
38 |
-
|
39 |
-
def is_running_on_huggingface():
|
40 |
-
"""Check if the app is running on HuggingFace Spaces."""
|
41 |
-
return os.environ.get('SPACE_ID') is not None
|
42 |
-
|
43 |
def check_repomix_installed():
|
44 |
"""Check if Repomix is installed."""
|
45 |
-
# If running on HuggingFace Spaces, Repomix is likely not available
|
46 |
-
if is_running_on_huggingface():
|
47 |
-
return False
|
48 |
-
|
49 |
try:
|
50 |
-
result = subprocess.run(["
|
51 |
capture_output=True, text=True, check=False)
|
52 |
return result.returncode == 0
|
53 |
except Exception:
|
54 |
return False
|
55 |
|
56 |
-
def run_repomix(repo_url_or_id,
|
57 |
"""Run Repomix on the GitHub repository and return the content."""
|
|
|
58 |
try:
|
59 |
-
# Create a temporary directory for the output
|
60 |
with tempfile.TemporaryDirectory() as temp_dir:
|
61 |
-
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
-
# Prepare the command
|
64 |
if '/' in repo_url_or_id and not repo_url_or_id.startswith('http'):
|
65 |
-
# Handle shorthand notation
|
66 |
repo_url = f"https://github.com/{repo_url_or_id}"
|
67 |
else:
|
68 |
repo_url = repo_url_or_id
|
69 |
|
70 |
-
|
71 |
cmd = [
|
72 |
-
"
|
73 |
"--remote", repo_url,
|
74 |
-
"--output",
|
75 |
-
"--style",
|
76 |
-
"--compress"
|
77 |
]
|
78 |
|
79 |
process = subprocess.run(cmd, capture_output=True, text=True, check=False)
|
80 |
-
|
|
|
81 |
if process.returncode != 0:
|
82 |
-
return f"Error running Repomix: {process.stderr}"
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
|
|
88 |
else:
|
89 |
-
return
|
90 |
|
91 |
except Exception as e:
|
92 |
-
|
|
|
93 |
|
94 |
-
def
|
95 |
-
"""
|
96 |
-
|
97 |
-
|
98 |
-
|
|
|
|
|
|
|
|
|
99 |
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
|
125 |
-
|
126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
|
128 |
-
def
|
129 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
try:
|
131 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
visited_urls.add(url)
|
139 |
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
|
152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
|
155 |
-
|
156 |
-
if current_depth > 0:
|
157 |
-
links = LinkExtractor.scrape_url(url, link_type=LinkType.INTERNAL)
|
158 |
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
163 |
|
164 |
-
|
|
|
|
|
|
|
|
|
165 |
|
166 |
-
|
167 |
-
|
168 |
-
|
|
|
169 |
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
fn=process_input,
|
176 |
-
inputs=[
|
177 |
-
gr.Textbox(label="Enter URL or GitHub Repository",
|
178 |
-
placeholder="https://example.com or username/repo"),
|
179 |
-
gr.Slider(minimum=0, maximum=3, step=1, value=0,
|
180 |
-
label="Search Depth (0 = Only main page, ignored for GitHub repos)"),
|
181 |
-
gr.Radio(
|
182 |
-
choices=["auto", "website", "github"],
|
183 |
-
value="auto",
|
184 |
-
label="Input Type",
|
185 |
-
info="Auto will detect GitHub repos automatically"
|
186 |
)
|
187 |
-
|
188 |
-
|
189 |
-
|
190 |
-
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
),
|
195 |
-
examples=[
|
196 |
-
["https://example.com", 0, "auto"],
|
197 |
-
["yamadashy/repomix", 0, "auto"],
|
198 |
-
["https://github.com/yamadashy/repomix", 0, "auto"]
|
199 |
-
]
|
200 |
-
)
|
201 |
-
|
202 |
-
# Launch the Gradio app
|
203 |
if __name__ == "__main__":
|
204 |
iface.launch()
|
|
|
3 |
import os
|
4 |
import re
|
5 |
import tempfile
|
6 |
+
import json
|
7 |
+
import csv
|
8 |
from rag_scraper.scraper import Scraper
|
9 |
from rag_scraper.converter import Converter
|
10 |
from rag_scraper.link_extractor import LinkExtractor, LinkType
|
|
|
12 |
|
13 |
def is_github_repo(url_or_id):
|
14 |
"""Check if the input is a GitHub repository URL or ID."""
|
|
|
15 |
if "github.com" in url_or_id:
|
16 |
return True
|
|
|
|
|
17 |
if re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', url_or_id):
|
18 |
return True
|
|
|
19 |
return False
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
def check_repomix_installed():
|
22 |
"""Check if Repomix is installed."""
|
|
|
|
|
|
|
|
|
23 |
try:
|
24 |
+
result = subprocess.run(["repomix", "--version"],
|
25 |
capture_output=True, text=True, check=False)
|
26 |
return result.returncode == 0
|
27 |
except Exception:
|
28 |
return False
|
29 |
|
30 |
+
def run_repomix(repo_url_or_id, progress=gr.Progress(track_tqdm=True)):
|
31 |
"""Run Repomix on the GitHub repository and return the content."""
|
32 |
+
progress(0, desc="Starting Repomix processing...")
|
33 |
try:
|
|
|
34 |
with tempfile.TemporaryDirectory() as temp_dir:
|
35 |
+
# RepoMix typically outputs a zip file if not specifying a single output style,
|
36 |
+
# or a specific file if --style is used.
|
37 |
+
# For simplicity, let's assume we want markdown and it outputs to a known file or stdout.
|
38 |
+
# The current repomix command in the original script uses --style markdown and --output.
|
39 |
+
output_file_name = "repomix-output.md" # Assuming markdown output
|
40 |
+
output_file_path = os.path.join(temp_dir, output_file_name)
|
41 |
|
|
|
42 |
if '/' in repo_url_or_id and not repo_url_or_id.startswith('http'):
|
|
|
43 |
repo_url = f"https://github.com/{repo_url_or_id}"
|
44 |
else:
|
45 |
repo_url = repo_url_or_id
|
46 |
|
47 |
+
progress(0.2, desc=f"Running Repomix on {repo_url}...")
|
48 |
cmd = [
|
49 |
+
"repomix",
|
50 |
"--remote", repo_url,
|
51 |
+
"--output", output_file_path, # Direct output to a file
|
52 |
+
"--style", "markdown", # Explicitly request markdown
|
53 |
+
"--compress"
|
54 |
]
|
55 |
|
56 |
process = subprocess.run(cmd, capture_output=True, text=True, check=False)
|
57 |
+
progress(0.8, desc="Repomix command executed.")
|
58 |
+
|
59 |
if process.returncode != 0:
|
60 |
+
return f"Error running Repomix: {process.stderr}", None
|
61 |
|
62 |
+
if os.path.exists(output_file_path):
|
63 |
+
with open(output_file_path, 'r', encoding='utf-8') as f:
|
64 |
+
content = f.read()
|
65 |
+
progress(1, desc="Repomix output processed.")
|
66 |
+
return content, output_file_path # Return content and path for potential download
|
67 |
else:
|
68 |
+
return "Error: Repomix did not generate an output file.", None
|
69 |
|
70 |
except Exception as e:
|
71 |
+
progress(1, desc="Error during Repomix processing.")
|
72 |
+
return f"Error processing GitHub repository: {str(e)}", None
|
73 |
|
74 |
+
def scrape_and_convert_website(url, depth, progress=gr.Progress(track_tqdm=True)):
|
75 |
+
"""Fetch HTML, extract links, convert to Markdown."""
|
76 |
+
progress(0, desc=f"Starting web scrape for {url}...")
|
77 |
+
visited_urls = set()
|
78 |
+
all_markdown_content = ""
|
79 |
+
|
80 |
+
def recursive_scrape(current_url, current_depth, total_links_estimate=1, link_index=0):
|
81 |
+
if current_url in visited_urls or current_depth < 0:
|
82 |
+
return ""
|
83 |
|
84 |
+
visited_urls.add(current_url)
|
85 |
+
|
86 |
+
try:
|
87 |
+
progress_val = link_index / total_links_estimate if total_links_estimate > 0 else 0
|
88 |
+
progress(progress_val, desc=f"Scraping: {current_url} (Depth: {depth - current_depth})")
|
89 |
+
html_content = Scraper.fetch_html(current_url)
|
90 |
+
except Exception as e:
|
91 |
+
return f"Error fetching {current_url}: {str(e)}\n"
|
92 |
+
|
93 |
+
markdown_content = f"## Extracted from: {current_url}\n\n"
|
94 |
+
markdown_content += Converter.html_to_markdown(
|
95 |
+
html=html_content,
|
96 |
+
base_url=current_url,
|
97 |
+
parser_features='html.parser',
|
98 |
+
ignore_links=True
|
99 |
+
)
|
100 |
+
|
101 |
+
page_content = markdown_content + "\n\n"
|
102 |
+
|
103 |
+
if current_depth > 0:
|
104 |
+
try:
|
105 |
+
links = LinkExtractor.scrape_url(current_url, link_type=LinkType.INTERNAL)
|
106 |
+
# Filter out already visited links and external links more carefully
|
107 |
+
valid_links = [
|
108 |
+
link for link in links
|
109 |
+
if URLUtils.is_internal(link, current_url) and link not in visited_urls
|
110 |
+
]
|
111 |
+
|
112 |
+
num_links = len(valid_links)
|
113 |
+
for i, link_url in enumerate(valid_links):
|
114 |
+
page_content += recursive_scrape(link_url, current_depth - 1, num_links, i)
|
115 |
+
except Exception as e:
|
116 |
+
page_content += f"Error extracting links from {current_url}: {str(e)}\n"
|
117 |
+
return page_content
|
118 |
+
|
119 |
+
all_markdown_content = recursive_scrape(url, depth)
|
120 |
+
progress(1, desc="Web scraping complete.")
|
121 |
|
122 |
+
# For web scraping, we create a temporary file with the content for download
|
123 |
+
with tempfile.NamedTemporaryFile(mode="w+", delete=False, suffix=".md", encoding="utf-8") as tmp_file:
|
124 |
+
tmp_file.write(all_markdown_content)
|
125 |
+
return all_markdown_content, tmp_file.name
|
126 |
+
|
127 |
+
|
128 |
+
# --- Data Conversion Functions (Stubs for now) ---
|
129 |
+
def convert_to_json(markdown_content, source_url_or_id):
|
130 |
+
"""Converts markdown content to a JSON string."""
|
131 |
+
# Basic implementation: create a JSON object with source and content
|
132 |
+
# More sophisticated parsing can be added later
|
133 |
+
data = {"source": source_url_or_id, "content": markdown_content}
|
134 |
+
return json.dumps(data, indent=2)
|
135 |
|
136 |
+
def convert_to_csv(markdown_content, source_url_or_id):
|
137 |
+
"""Converts markdown content to a CSV string."""
|
138 |
+
# Basic implementation: create a CSV with source and content
|
139 |
+
# This is a simplified CSV; real CSVs might need more structure
|
140 |
+
output = tempfile.NamedTemporaryFile(mode='w+', delete=False, newline='', suffix=".csv", encoding="utf-8")
|
141 |
+
writer = csv.writer(output)
|
142 |
+
writer.writerow(["source", "content"]) # Header
|
143 |
+
|
144 |
+
# Split content into manageable chunks or lines if necessary for CSV
|
145 |
+
# For now, putting all content in one cell.
|
146 |
+
writer.writerow([source_url_or_id, markdown_content])
|
147 |
+
output.close()
|
148 |
+
return output.name # Return path to the CSV file
|
149 |
+
|
150 |
+
def save_output_to_file(content, output_format, source_url_or_id):
|
151 |
+
"""Saves content to a temporary file based on format and returns its path."""
|
152 |
+
suffix = f".{output_format.lower()}"
|
153 |
+
if output_format == "JSON":
|
154 |
+
processed_content = convert_to_json(content, source_url_or_id)
|
155 |
+
elif output_format == "CSV":
|
156 |
+
# convert_to_csv now returns a path directly
|
157 |
+
return convert_to_csv(content, source_url_or_id)
|
158 |
+
else: # Markdown/Text
|
159 |
+
processed_content = content
|
160 |
+
suffix = ".md"
|
161 |
+
|
162 |
+
with tempfile.NamedTemporaryFile(mode="w+", delete=False, suffix=suffix, encoding="utf-8") as tmp_file:
|
163 |
+
tmp_file.write(processed_content)
|
164 |
+
return tmp_file.name
|
165 |
+
|
166 |
+
# --- Main Processing Function ---
|
167 |
+
def process_input_updated(url_or_id, source_type, depth, output_format_selection, progress=gr.Progress(track_tqdm=True)):
|
168 |
+
"""Main function to process URL or GitHub repo based on selected type and format."""
|
169 |
+
progress(0, desc="Initializing...")
|
170 |
+
raw_content = ""
|
171 |
+
error_message = ""
|
172 |
+
output_file_path = None
|
173 |
+
|
174 |
+
if source_type == "GitHub Repository":
|
175 |
+
if not check_repomix_installed():
|
176 |
+
error_message = "Repomix is not installed or not accessible. Please ensure it's installed globally in your Docker environment."
|
177 |
+
return error_message, None, None # Text output, Preview, File output
|
178 |
+
|
179 |
+
raw_content, _ = run_repomix(url_or_id, progress=progress) # Repomix returns content and its original path
|
180 |
+
if "Error" in raw_content: # Simple error check
|
181 |
+
error_message = raw_content
|
182 |
+
raw_content = ""
|
183 |
+
|
184 |
+
elif source_type == "Webpage":
|
185 |
+
raw_content, _ = scrape_and_convert_website(url_or_id, depth, progress=progress)
|
186 |
+
if "Error" in raw_content: # Simple error check
|
187 |
+
error_message = raw_content
|
188 |
+
raw_content = ""
|
189 |
+
else:
|
190 |
+
error_message = "Invalid source type selected."
|
191 |
+
return error_message, None, None
|
192 |
+
|
193 |
+
if error_message:
|
194 |
+
return error_message, None, None # Error text, no preview, no file
|
195 |
+
|
196 |
+
# Save raw_content (which is markdown) to a file of the chosen output_format
|
197 |
+
# This will handle conversion if necessary
|
198 |
try:
|
199 |
+
progress(0.9, desc=f"Converting to {output_format_selection}...")
|
200 |
+
output_file_path = save_output_to_file(raw_content, output_format_selection, url_or_id)
|
201 |
+
|
202 |
+
# For preview, we'll show the raw markdown, or a snippet of JSON/CSV
|
203 |
+
preview_content = raw_content # Default to markdown
|
204 |
+
if output_format_selection == "JSON":
|
205 |
+
preview_content = convert_to_json(raw_content, url_or_id)
|
206 |
+
elif output_format_selection == "CSV":
|
207 |
+
# For CSV preview, maybe just show a message or first few lines
|
208 |
+
preview_content = f"CSV file generated. Path: {output_file_path}\nFirst few lines might be shown here in a real app."
|
209 |
+
# Or read a bit of the CSV for preview:
|
210 |
+
# with open(output_file_path, 'r', encoding='utf-8') as f_csv:
|
211 |
+
# preview_content = "".join(f_csv.readlines()[:5])
|
212 |
|
213 |
+
progress(1, desc="Processing complete.")
|
214 |
+
return f"Successfully processed: {url_or_id}", preview_content, output_file_path
|
215 |
+
except Exception as e:
|
216 |
+
return f"Error during file conversion/saving: {str(e)}", raw_content, None
|
|
|
|
|
217 |
|
218 |
+
|
219 |
+
# --- Gradio Interface Definition ---
|
220 |
+
with gr.Blocks(theme=gr.themes.Soft()) as iface:
|
221 |
+
gr.Markdown("# RAG-Ready Content Scraper")
|
222 |
+
gr.Markdown(
|
223 |
+
"Scrape webpage content (using RAG-scraper) or GitHub repositories (using RepoMix) "
|
224 |
+
"to generate RAG-ready datasets. Uses Docker for full functionality on HuggingFace Spaces."
|
225 |
+
)
|
226 |
+
|
227 |
+
with gr.Row():
|
228 |
+
with gr.Column(scale=2):
|
229 |
+
url_input = gr.Textbox(
|
230 |
+
label="Enter URL or GitHub Repository ID",
|
231 |
+
placeholder="e.g., https://example.com OR username/repo"
|
232 |
+
)
|
233 |
+
source_type_input = gr.Radio(
|
234 |
+
choices=["Webpage", "GitHub Repository"],
|
235 |
+
value="Webpage",
|
236 |
+
label="Select Source Type"
|
237 |
+
)
|
238 |
+
depth_input = gr.Slider(
|
239 |
+
minimum=0, maximum=3, step=1, value=0,
|
240 |
+
label="Scraping Depth (for Webpages)",
|
241 |
+
info="0: Only main page. Ignored for GitHub repos."
|
242 |
+
)
|
243 |
+
output_format_input = gr.Dropdown(
|
244 |
+
choices=["Markdown", "JSON", "CSV"], # Markdown is like text file
|
245 |
+
value="Markdown",
|
246 |
+
label="Select Output Format"
|
247 |
)
|
248 |
+
submit_button = gr.Button("Process Content", variant="primary")
|
249 |
+
|
250 |
+
with gr.Column(scale=3):
|
251 |
+
status_output = gr.Textbox(label="Status", interactive=False)
|
252 |
+
preview_output = gr.Code(label="Preview Content", language="markdown", interactive=False) # Default to markdown, can show JSON too
|
253 |
+
file_download_output = gr.File(label="Download Processed File", interactive=False)
|
254 |
|
255 |
+
progress_bar = gr.Progress(track_tqdm=True)
|
|
|
|
|
256 |
|
257 |
+
# --- Examples ---
|
258 |
+
gr.Examples(
|
259 |
+
examples=[
|
260 |
+
["https://gradio.app/docs/js", "Webpage", 1, "Markdown"],
|
261 |
+
["gradio-app/gradio", "GitHub Repository", 0, "Markdown"],
|
262 |
+
["https://en.wikipedia.org/wiki/Retrieval-augmented_generation", "Webpage", 0, "JSON"],
|
263 |
+
],
|
264 |
+
inputs=[url_input, source_type_input, depth_input, output_format_input],
|
265 |
+
outputs=[status_output, preview_output, file_download_output], # Function needs to match this
|
266 |
+
fn=process_input_updated, # Make sure the function signature matches
|
267 |
+
cache_examples=False # For development, disable caching
|
268 |
+
)
|
269 |
+
|
270 |
+
# --- How it Works & GitHub Link ---
|
271 |
+
with gr.Accordion("How it Works & More Info", open=False):
|
272 |
+
gr.Markdown(
|
273 |
+
"""
|
274 |
+
**Webpage Scraping:**
|
275 |
+
1. Enter a full URL (e.g., `https://example.com`).
|
276 |
+
2. Select "Webpage" as the source type.
|
277 |
+
3. Set the desired scraping depth (how many levels of internal links to follow).
|
278 |
+
4. Choose your output format.
|
279 |
+
5. The tool fetches HTML, converts it to Markdown, and follows internal links up to the specified depth.
|
280 |
|
281 |
+
**GitHub Repository Processing:**
|
282 |
+
1. Enter a GitHub repository URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
|
283 |
+
2. Select "GitHub Repository" as the source type. (Scraping depth is ignored).
|
284 |
+
3. Choose your output format.
|
285 |
+
4. The tool uses **RepoMix** to fetch and process the repository into a structured Markdown format.
|
286 |
|
287 |
+
**Output Formats:**
|
288 |
+
- **Markdown:** Plain text Markdown file, suitable for direct reading or further processing.
|
289 |
+
- **JSON:** Structured JSON output, typically with fields like `source` and `content`.
|
290 |
+
- **CSV:** Comma-Separated Values file, useful for tabular data or importing into spreadsheets.
|
291 |
|
292 |
+
**Note on HuggingFace Spaces:** This application is designed to run in a Docker-based HuggingFace Space,
|
293 |
+
which allows the use of `RepoMix` for GitHub repositories.
|
294 |
+
|
295 |
+
[View Source Code on HuggingFace Spaces](https://huggingface.co/spaces/CultriX/RAG-Scraper)
|
296 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
297 |
)
|
298 |
+
|
299 |
+
submit_button.click(
|
300 |
+
fn=process_input_updated,
|
301 |
+
inputs=[url_input, source_type_input, depth_input, output_format_input, progress_bar],
|
302 |
+
outputs=[status_output, preview_output, file_download_output]
|
303 |
+
)
|
304 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
305 |
if __name__ == "__main__":
|
306 |
iface.launch()
|