RAG-Scraper / README.md
CultriX's picture
Fix metadata in README.md for HuggingFace Hub
85fd625
---
title: RAG-Ready Content Scraper
emoji: πŸš€
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
license: mit
short_description: Scrape web/GitHub for RAG-ready datasets.
---
# RAG-Ready Content Scraper
RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.
This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.
## Features
- **Dual Scraping Modes**:
- **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
- **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs.
- **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV.
- **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
- **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
- **Pre-configured Examples**: Includes example inputs for quick testing.
- **In-UI Documentation**: "How it Works" section provides guidance.
## Requirements for Local Development (Optional)
- Python 3.10+
- Node.js and npm (for Repomix GitHub repository processing)
- Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted)
- Project dependencies: `pip install -r requirements.txt`
## HuggingFace Space Deployment
This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**.
1. **Create a new HuggingFace Space.**
2. Choose **"Docker"** as the Space SDK.
3. Select **"Use an existing Dockerfile"**.
4. Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository.
5. The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.
## Using the Interface
1. **Enter URL or GitHub Repository ID**:
* For websites: Enter a complete URL (e.g., `https://example.com`).
* For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
2. **Select Source Type**:
* Choose "Webpage" or "GitHub Repository".
3. **Set Scraping Depth** (for Webpages only):
* 0: Only scrape the main page.
* 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
4. **Select Output Format**:
* Choose "Markdown", "JSON", or "CSV".
5. **Click "Process Content"**.
6. **View Status and Preview**: Monitor progress and see a preview of the extracted content.
7. **Download File**: Download the generated dataset in your chosen format.
## How It Works
### Webpage Scraping
1. Fetches HTML content from the provided URL.
2. Converts HTML to clean Markdown.
3. If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
4. Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).
### GitHub Repository Processing
1. Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository.
2. RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
3. This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).
## Source Code
The source code for this project is available on HuggingFace Spaces:
[https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper)
## License
This project is licensed under the MIT License.