Spaces:
Running
Running
title: RAG-Ready Content Scraper | |
emoji: π | |
colorFrom: blue | |
colorTo: green | |
sdk: docker | |
app_file: app.py | |
pinned: false | |
license: mit | |
short_description: Scrape web/GitHub for RAG-ready datasets. | |
# RAG-Ready Content Scraper | |
RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV. | |
This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix. | |
## Features | |
- **Dual Scraping Modes**: | |
- **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links. | |
- **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs. | |
- **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV. | |
- **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results. | |
- **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available. | |
- **Pre-configured Examples**: Includes example inputs for quick testing. | |
- **In-UI Documentation**: "How it Works" section provides guidance. | |
## Requirements for Local Development (Optional) | |
- Python 3.10+ | |
- Node.js and npm (for Repomix GitHub repository processing) | |
- Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted) | |
- Project dependencies: `pip install -r requirements.txt` | |
## HuggingFace Space Deployment | |
This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**. | |
1. **Create a new HuggingFace Space.** | |
2. Choose **"Docker"** as the Space SDK. | |
3. Select **"Use an existing Dockerfile"**. | |
4. Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository. | |
5. The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available. | |
## Using the Interface | |
1. **Enter URL or GitHub Repository ID**: | |
* For websites: Enter a complete URL (e.g., `https://example.com`). | |
* For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`). | |
2. **Select Source Type**: | |
* Choose "Webpage" or "GitHub Repository". | |
3. **Set Scraping Depth** (for Webpages only): | |
* 0: Only scrape the main page. | |
* 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos). | |
4. **Select Output Format**: | |
* Choose "Markdown", "JSON", or "CSV". | |
5. **Click "Process Content"**. | |
6. **View Status and Preview**: Monitor progress and see a preview of the extracted content. | |
7. **Download File**: Download the generated dataset in your chosen format. | |
## How It Works | |
### Webpage Scraping | |
1. Fetches HTML content from the provided URL. | |
2. Converts HTML to clean Markdown. | |
3. If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link. | |
4. Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown). | |
### GitHub Repository Processing | |
1. Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository. | |
2. RepoMix analyzes the repository structure and content, generating a consolidated Markdown output. | |
3. This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown). | |
## Source Code | |
The source code for this project is available on HuggingFace Spaces: | |
[https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper) | |
## License | |
This project is licensed under the MIT License. | |