Spaces:

CultriX
/

RAG-Scraper

Running

App Files Files Community

CultriX commited on May 17

Commit

c09533d

verified ·

1 Parent(s): ad147d8

Update README.md

Browse files

Files changed (1) hide show

README.md +105 -4

README.md CHANGED Viewed

@@ -4,11 +4,112 @@ emoji: 🥳
 colorFrom: blue
 colorTo: gray
 sdk: gradio
-sdk_version: 4.44.1
 app_file: app.py
 pinned: false
 license: creativeml-openrail-m
-short_description: 'Scrape webpages for RAG purposes'
-#thumbnail: >-
-#  https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/YQdpDtR9myOBCOzUDLaAE.png
 ---

 colorFrom: blue
 colorTo: gray
 sdk: gradio
+sdk_version: 5.29.1
 app_file: app.py
 pinned: false
 license: creativeml-openrail-m
+short_description: Scrape webpages for RAG purposes
 ---
+# RAG-Scraper
+RAG-Scraper is a Python tool designed for efficient and intelligent scraping of web documentation and content. It's tailored for Retrieval-Augmented Generation systems, extracting and preprocessing text into structured, machine-learning-ready formats.
+## Features
+- **Web Scraping**: Scrape web content and convert it to Markdown format
+- **Recursive Depth**: Control how deep the scraper should follow links
+- **GitHub Repository Support**: Process GitHub repositories using Repomix to create AI-friendly outputs (when run locally)
+- **Gradio Interface**: Easy-to-use web interface for all functionality
+- **HuggingFace Spaces Compatible**: Can be deployed as a HuggingFace Space (with limited functionality)
+## Requirements
+- Python 3.10+
+- Node.js (for Repomix GitHub repository processing)
+- Repomix (installed via npm or used with npx)
+## Installation
+1. Clone the repository:
+```bash
+git clone https://github.com/yourusername/RAG-Scraper.git
+cd RAG-Scraper
+```
+2. Install Python dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. For GitHub repository processing, ensure Node.js is installed and either:
+   - Install Repomix globally: `npm install -g repomix`
+   - Or use npx to run it without installation (the app supports this)
+## Usage
+### Running the Gradio Interface
+```bash
+python app.py
+```
+This will start the Gradio web interface, accessible at http://localhost:7860 by default.
+### Using the Interface
+1. **Enter a URL or GitHub Repository**:
+   - For websites: Enter a complete URL (e.g., `https://example.com`)
+   - For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand notation (e.g., `username/repo`)
+2. **Set Search Depth** (for websites only):
+   - 0: Only scrape the main page
+   - 1-3: Follow links recursively to the specified depth
+3. **Select Input Type**:
+   - Auto: Automatically detect if the input is a website or GitHub repository
+   - Website: Force processing as a website
+   - GitHub: Force processing as a GitHub repository
+4. **Click Submit** to process the input and view the results
+## How It Works
+### Web Scraping
+For websites, RAG-Scraper:
+1. Fetches the HTML content from the URL
+2. Converts the HTML to Markdown
+3. If depth > 0, extracts internal links and repeats the process for each link
+### GitHub Repository Processing
+For GitHub repositories, RAG-Scraper:
+1. Detects if the input is a GitHub repository URL or ID
+2. Uses Repomix to fetch and process the repository
+3. Returns the repository content in a structured, AI-friendly format
+## Examples
+The interface includes example inputs to demonstrate both web scraping and GitHub repository processing:
+- `https://example.com` - Basic website example
+- `yamadashy/repomix` - GitHub repository using shorthand notation
+- `https://github.com/yamadashy/repomix` - GitHub repository using full URL
+## HuggingFace Spaces Deployment
+This application can be deployed as a HuggingFace Space, but with some limitations:
+- **Web Scraping**: Fully functional for scraping websites and converting to Markdown
+- **GitHub Repository Processing**: Not available on HuggingFace Spaces due to the lack of Node.js and npm/npx command execution capabilities
+- **User Experience**: The interface will provide clear messages about feature availability
+When deployed on HuggingFace Spaces, the application will automatically detect the environment and provide appropriate messages to users attempting to use the GitHub repository processing feature.
+To use the full functionality including GitHub repository processing with Repomix, run the application locally following the installation instructions above.
+## License
+This project is licensed under the MIT License.