metadata

title: RAG-Ready Content Scraper
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
license: mit
short_description: Scrape web/GitHub for RAG-ready datasets.

RAG-Ready Content Scraper

RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.

This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.

Features

Dual Scraping Modes:
- Webpage Scraping: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
- GitHub Repository Processing: Processes GitHub repositories using RepoMix to create AI-friendly outputs.
Multiple Output Formats: Generate datasets in Markdown, JSON, or CSV.
Interactive Gradio Interface: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
HuggingFace Spaces Ready (Docker): Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
Pre-configured Examples: Includes example inputs for quick testing.
In-UI Documentation: "How it Works" section provides guidance.

Requirements for Local Development (Optional)

Python 3.10+
Node.js and npm (for Repomix GitHub repository processing)
Repomix (can be installed globally: npm install -g repomix, or used via npx repomix if app.py is adjusted)
Project dependencies: pip install -r requirements.txt

HuggingFace Space Deployment

This application is intended to be deployed as a HuggingFace Space using the Docker SDK.

Create a new HuggingFace Space.
Choose "Docker" as the Space SDK.
Select "Use an existing Dockerfile".
Push this repository (including the Dockerfile, app.py, requirements.txt, and the rag_scraper directory) to the HuggingFace Space repository.
The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.

Using the Interface

Enter URL or GitHub Repository ID:
- For websites: Enter a complete URL (e.g., https://example.com).
- For GitHub repositories: Enter a URL (e.g., https://github.com/username/repo) or shorthand ID (e.g., username/repo).
Select Source Type:
- Choose "Webpage" or "GitHub Repository".
Set Scraping Depth (for Webpages only):
- 0: Only scrape the main page.
- 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
Select Output Format:
- Choose "Markdown", "JSON", or "CSV".
Click "Process Content".
View Status and Preview: Monitor progress and see a preview of the extracted content.
Download File: Download the generated dataset in your chosen format.

How It Works

Webpage Scraping

Fetches HTML content from the provided URL.
Converts HTML to clean Markdown.
If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).

GitHub Repository Processing

Uses RepoMix (a Node.js tool) to fetch and process the specified GitHub repository.
RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).

Source Code

The source code for this project is available on HuggingFace Spaces: https://huggingface.co/spaces/CultriX/RAG-Scraper

License

This project is licensed under the MIT License.