spagestic commited on
Commit
89b22f4
Β·
verified Β·
1 Parent(s): 0c66d86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md CHANGED
@@ -10,3 +10,147 @@ pinned: false
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+ # Web Scraper & Sitemap Generator
15
+
16
+ A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.
17
+
18
+ ## Features
19
+
20
+ - πŸ•·οΈ **Web Scraping**: Extract text content from any website
21
+ - πŸ“ **Markdown Conversion**: Convert scraped HTML content to clean markdown format
22
+ - πŸ—ΊοΈ **Sitemap Generation**: Create organized sitemaps based on all links found on the page
23
+ - 🌐 **User-Friendly Interface**: Easy-to-use Gradio web interface
24
+ - πŸ”— **Link Organization**: Separate internal and external links for better navigation
25
+ - πŸ€– **MCP Server**: Expose scraping tools for AI assistants and LLMs
26
+
27
+ ## Installation
28
+
29
+ 1. Install Python dependencies:
30
+
31
+ ```bash
32
+ pip install -r requirements.txt
33
+ ```
34
+
35
+ ## Usage
36
+
37
+ ### Web Interface
38
+
39
+ 1. Run the web application:
40
+
41
+ ```bash
42
+ python app.py
43
+ ```
44
+
45
+ 2. Open your browser and navigate to `http://localhost:7861`
46
+
47
+ 3. Enter a URL in the input field and click "Scrape Website"
48
+
49
+ 4. View the results:
50
+ - **Status**: Shows success/error messages
51
+ - **Scraped Content**: Website content converted to markdown
52
+ - **Sitemap**: Organized list of all links found on the page
53
+
54
+ ### MCP Server
55
+
56
+ 1. Run the MCP server:
57
+
58
+ ```bash
59
+ python mcp_server.py
60
+ ```
61
+
62
+ 2. The server will be available at `http://localhost:7862`
63
+
64
+ 3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse`
65
+
66
+ #### Available MCP Tools
67
+
68
+ - **scrape_content**: Extract and format website content as markdown
69
+ - **generate_sitemap**: Generate a sitemap of all links found on a webpage
70
+ - **analyze_website**: Complete website analysis with both content and sitemap
71
+
72
+ #### MCP Client Configuration
73
+
74
+ To use with Claude Desktop or other MCP clients, add this to your configuration:
75
+
76
+ ```json
77
+ {
78
+ "mcpServers": {
79
+ "web-scraper": {
80
+ "url": "http://localhost:7862/gradio_api/mcp/sse"
81
+ }
82
+ }
83
+ }
84
+ ```
85
+
86
+ ## Dependencies
87
+
88
+ - `gradio[mcp]`: Web interface framework with MCP support
89
+ - `requests`: HTTP library for making web requests
90
+ - `beautifulsoup4`: HTML parsing library
91
+ - `markdownify`: HTML to markdown conversion
92
+ - `lxml`: XML and HTML parser
93
+
94
+ ## Project Structure
95
+
96
+ ```
97
+ web-scraper/
98
+ β”œβ”€β”€ app.py # Main web interface application
99
+ β”œβ”€β”€ mcp_server.py # MCP server with exposed tools
100
+ β”œβ”€β”€ requirements.txt # Python dependencies
101
+ β”œβ”€β”€ requirements.txt # Python dependencies
102
+ β”œβ”€β”€ README.md # Project documentation
103
+ β”œβ”€β”€ .github/
104
+ β”‚ └── copilot-instructions.md
105
+ └── .vscode/
106
+ └── tasks.json # VS Code tasks
107
+ ```
108
+
109
+ ## Features Details
110
+
111
+ ### Web Scraping
112
+
113
+ - Handles both HTTP and HTTPS URLs
114
+ - Automatically adds protocol if missing
115
+ - Removes unwanted elements (scripts, styles, navigation)
116
+ - Focuses on main content areas
117
+
118
+ ### Markdown Conversion
119
+
120
+ - Converts HTML to clean markdown format
121
+ - Preserves heading structure
122
+ - Removes empty links and excessive whitespace
123
+ - Adds page title as main heading
124
+
125
+ ### Sitemap Generation
126
+
127
+ - Extracts all links from the page
128
+ - Converts relative URLs to absolute URLs
129
+ - Organizes links by domain (internal vs external)
130
+ - Limits display to prevent overwhelming output
131
+ - Filters out unwanted links (anchors, javascript, etc.)
132
+
133
+ ## Example URLs to Try
134
+
135
+ - `https://httpbin.org/html` - Simple test page
136
+ - `https://example.com` - Basic example site
137
+ - `https://python.org` - Python official website
138
+
139
+ ## Error Handling
140
+
141
+ The application includes comprehensive error handling for:
142
+
143
+ - Invalid URLs
144
+ - Network timeouts
145
+ - HTTP errors
146
+ - Content parsing issues
147
+
148
+ ## Customization
149
+
150
+ You can customize the scraper by modifying:
151
+
152
+ - User-Agent string in the `WebScraper` class
153
+ - Content extraction selectors
154
+ - Markdown formatting rules
155
+ - Link filtering criteria
156
+