File size: 4,116 Bytes
f0afe8b
2d6afaa
 
80d31c8
2d6afaa
 
80d31c8
 
85fd625
 
f0afe8b
c09533d
2d6afaa
c09533d
2d6afaa
c09533d
2d6afaa
c09533d
 
 
2d6afaa
 
 
 
 
 
 
 
c09533d
2d6afaa
c09533d
 
2d6afaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c09533d
 
 
2d6afaa
c09533d
2d6afaa
 
 
 
c09533d
 
 
2d6afaa
 
 
c09533d
2d6afaa
c09533d
2d6afaa
 
c09533d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: RAG-Ready Content Scraper
emoji: πŸš€
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
license: mit
short_description: Scrape web/GitHub for RAG-ready datasets.
---

# RAG-Ready Content Scraper

RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.

This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.

## Features

- **Dual Scraping Modes**:
    - **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
    - **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs.
- **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV.
- **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
- **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
- **Pre-configured Examples**: Includes example inputs for quick testing.
- **In-UI Documentation**: "How it Works" section provides guidance.

## Requirements for Local Development (Optional)

- Python 3.10+
- Node.js and npm (for Repomix GitHub repository processing)
- Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted)
- Project dependencies: `pip install -r requirements.txt`

## HuggingFace Space Deployment

This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**.

1.  **Create a new HuggingFace Space.**
2.  Choose **"Docker"** as the Space SDK.
3.  Select **"Use an existing Dockerfile"**.
4.  Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository.
5.  The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.

## Using the Interface

1.  **Enter URL or GitHub Repository ID**:
    *   For websites: Enter a complete URL (e.g., `https://example.com`).
    *   For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
2.  **Select Source Type**:
    *   Choose "Webpage" or "GitHub Repository".
3.  **Set Scraping Depth** (for Webpages only):
    *   0: Only scrape the main page.
    *   1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
4.  **Select Output Format**:
    *   Choose "Markdown", "JSON", or "CSV".
5.  **Click "Process Content"**.
6.  **View Status and Preview**: Monitor progress and see a preview of the extracted content.
7.  **Download File**: Download the generated dataset in your chosen format.

## How It Works

### Webpage Scraping

1.  Fetches HTML content from the provided URL.
2.  Converts HTML to clean Markdown.
3.  If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
4.  Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).

### GitHub Repository Processing

1.  Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository.
2.  RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
3.  This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).

## Source Code

The source code for this project is available on HuggingFace Spaces:
[https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper)

## License

This project is licensed under the MIT License.