Spaces:
Running
Running
metadata
title: RAG-Ready Content Scraper
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
license: mit
short_description: Scrape web/GitHub for RAG-ready datasets.
RAG-Ready Content Scraper
RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.
This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.
Features
- Dual Scraping Modes:
- Webpage Scraping: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
- GitHub Repository Processing: Processes GitHub repositories using RepoMix to create AI-friendly outputs.
- Multiple Output Formats: Generate datasets in Markdown, JSON, or CSV.
- Interactive Gradio Interface: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
- HuggingFace Spaces Ready (Docker): Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
- Pre-configured Examples: Includes example inputs for quick testing.
- In-UI Documentation: "How it Works" section provides guidance.
Requirements for Local Development (Optional)
- Python 3.10+
- Node.js and npm (for Repomix GitHub repository processing)
- Repomix (can be installed globally:
npm install -g repomix
, or used vianpx repomix
ifapp.py
is adjusted) - Project dependencies:
pip install -r requirements.txt
HuggingFace Space Deployment
This application is intended to be deployed as a HuggingFace Space using the Docker SDK.
- Create a new HuggingFace Space.
- Choose "Docker" as the Space SDK.
- Select "Use an existing Dockerfile".
- Push this repository (including the
Dockerfile
,app.py
,requirements.txt
, and therag_scraper
directory) to the HuggingFace Space repository. - The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.
Using the Interface
- Enter URL or GitHub Repository ID:
- For websites: Enter a complete URL (e.g.,
https://example.com
). - For GitHub repositories: Enter a URL (e.g.,
https://github.com/username/repo
) or shorthand ID (e.g.,username/repo
).
- For websites: Enter a complete URL (e.g.,
- Select Source Type:
- Choose "Webpage" or "GitHub Repository".
- Set Scraping Depth (for Webpages only):
- 0: Only scrape the main page.
- 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
- Select Output Format:
- Choose "Markdown", "JSON", or "CSV".
- Click "Process Content".
- View Status and Preview: Monitor progress and see a preview of the extracted content.
- Download File: Download the generated dataset in your chosen format.
How It Works
Webpage Scraping
- Fetches HTML content from the provided URL.
- Converts HTML to clean Markdown.
- If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
- Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).
GitHub Repository Processing
- Uses RepoMix (a Node.js tool) to fetch and process the specified GitHub repository.
- RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
- This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).
Source Code
The source code for this project is available on HuggingFace Spaces: https://huggingface.co/spaces/CultriX/RAG-Scraper
License
This project is licensed under the MIT License.