CultriX commited on
Commit
c09533d
·
verified ·
1 Parent(s): ad147d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -4
README.md CHANGED
@@ -4,11 +4,112 @@ emoji: 🥳
4
  colorFrom: blue
5
  colorTo: gray
6
  sdk: gradio
7
- sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
10
  license: creativeml-openrail-m
11
- short_description: 'Scrape webpages for RAG purposes'
12
- #thumbnail: >-
13
- # https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/YQdpDtR9myOBCOzUDLaAE.png
14
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  colorFrom: blue
5
  colorTo: gray
6
  sdk: gradio
7
+ sdk_version: 5.29.1
8
  app_file: app.py
9
  pinned: false
10
  license: creativeml-openrail-m
11
+ short_description: Scrape webpages for RAG purposes
 
 
12
  ---
13
+
14
+
15
+ # RAG-Scraper
16
+
17
+ RAG-Scraper is a Python tool designed for efficient and intelligent scraping of web documentation and content. It's tailored for Retrieval-Augmented Generation systems, extracting and preprocessing text into structured, machine-learning-ready formats.
18
+
19
+ ## Features
20
+
21
+ - **Web Scraping**: Scrape web content and convert it to Markdown format
22
+ - **Recursive Depth**: Control how deep the scraper should follow links
23
+ - **GitHub Repository Support**: Process GitHub repositories using Repomix to create AI-friendly outputs (when run locally)
24
+ - **Gradio Interface**: Easy-to-use web interface for all functionality
25
+ - **HuggingFace Spaces Compatible**: Can be deployed as a HuggingFace Space (with limited functionality)
26
+
27
+ ## Requirements
28
+
29
+ - Python 3.10+
30
+ - Node.js (for Repomix GitHub repository processing)
31
+ - Repomix (installed via npm or used with npx)
32
+
33
+ ## Installation
34
+
35
+ 1. Clone the repository:
36
+ ```bash
37
+ git clone https://github.com/yourusername/RAG-Scraper.git
38
+ cd RAG-Scraper
39
+ ```
40
+
41
+ 2. Install Python dependencies:
42
+ ```bash
43
+ pip install -r requirements.txt
44
+ ```
45
+
46
+ 3. For GitHub repository processing, ensure Node.js is installed and either:
47
+ - Install Repomix globally: `npm install -g repomix`
48
+ - Or use npx to run it without installation (the app supports this)
49
+
50
+ ## Usage
51
+
52
+ ### Running the Gradio Interface
53
+
54
+ ```bash
55
+ python app.py
56
+ ```
57
+
58
+ This will start the Gradio web interface, accessible at http://localhost:7860 by default.
59
+
60
+ ### Using the Interface
61
+
62
+ 1. **Enter a URL or GitHub Repository**:
63
+ - For websites: Enter a complete URL (e.g., `https://example.com`)
64
+ - For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand notation (e.g., `username/repo`)
65
+
66
+ 2. **Set Search Depth** (for websites only):
67
+ - 0: Only scrape the main page
68
+ - 1-3: Follow links recursively to the specified depth
69
+
70
+ 3. **Select Input Type**:
71
+ - Auto: Automatically detect if the input is a website or GitHub repository
72
+ - Website: Force processing as a website
73
+ - GitHub: Force processing as a GitHub repository
74
+
75
+ 4. **Click Submit** to process the input and view the results
76
+
77
+ ## How It Works
78
+
79
+ ### Web Scraping
80
+
81
+ For websites, RAG-Scraper:
82
+ 1. Fetches the HTML content from the URL
83
+ 2. Converts the HTML to Markdown
84
+ 3. If depth > 0, extracts internal links and repeats the process for each link
85
+
86
+ ### GitHub Repository Processing
87
+
88
+ For GitHub repositories, RAG-Scraper:
89
+ 1. Detects if the input is a GitHub repository URL or ID
90
+ 2. Uses Repomix to fetch and process the repository
91
+ 3. Returns the repository content in a structured, AI-friendly format
92
+
93
+ ## Examples
94
+
95
+ The interface includes example inputs to demonstrate both web scraping and GitHub repository processing:
96
+ - `https://example.com` - Basic website example
97
+ - `yamadashy/repomix` - GitHub repository using shorthand notation
98
+ - `https://github.com/yamadashy/repomix` - GitHub repository using full URL
99
+
100
+ ## HuggingFace Spaces Deployment
101
+
102
+ This application can be deployed as a HuggingFace Space, but with some limitations:
103
+
104
+ - **Web Scraping**: Fully functional for scraping websites and converting to Markdown
105
+ - **GitHub Repository Processing**: Not available on HuggingFace Spaces due to the lack of Node.js and npm/npx command execution capabilities
106
+ - **User Experience**: The interface will provide clear messages about feature availability
107
+
108
+ When deployed on HuggingFace Spaces, the application will automatically detect the environment and provide appropriate messages to users attempting to use the GitHub repository processing feature.
109
+
110
+ To use the full functionality including GitHub repository processing with Repomix, run the application locally following the installation instructions above.
111
+
112
+ ## License
113
+
114
+ This project is licensed under the MIT License.
115
+