|
--- |
|
title: Web Scraper |
|
emoji: π |
|
colorFrom: yellow |
|
colorTo: green |
|
sdk: gradio |
|
sdk_version: 5.32.1 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|
# Web Scraper & Sitemap Generator |
|
|
|
A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration. |
|
|
|
## Features |
|
|
|
- π·οΈ **Web Scraping**: Extract text content from any website |
|
- π **Markdown Conversion**: Convert scraped HTML content to clean markdown format |
|
- πΊοΈ **Sitemap Generation**: Create organized sitemaps based on all links found on the page |
|
- π **User-Friendly Interface**: Easy-to-use Gradio web interface |
|
- π **Link Organization**: Separate internal and external links for better navigation |
|
- π€ **MCP Server**: Expose scraping tools for AI assistants and LLMs |
|
|
|
## Installation |
|
|
|
1. Install Python dependencies: |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Usage |
|
|
|
### Web Interface |
|
|
|
1. Run the web application: |
|
|
|
```bash |
|
python app.py |
|
``` |
|
|
|
2. Open your browser and navigate to `http://localhost:7861` |
|
|
|
3. Enter a URL in the input field and click "Scrape Website" |
|
|
|
4. View the results: |
|
- **Status**: Shows success/error messages |
|
- **Scraped Content**: Website content converted to markdown |
|
- **Sitemap**: Organized list of all links found on the page |
|
|
|
### MCP Server |
|
|
|
1. Run the MCP server: |
|
|
|
```bash |
|
python mcp_server.py |
|
``` |
|
|
|
2. The server will be available at `http://localhost:7862` |
|
|
|
3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse` |
|
|
|
#### Available MCP Tools |
|
|
|
- **scrape_content**: Extract and format website content as markdown |
|
- **generate_sitemap**: Generate a sitemap of all links found on a webpage |
|
- **analyze_website**: Complete website analysis with both content and sitemap |
|
|
|
#### MCP Client Configuration |
|
|
|
To use with Claude Desktop or other MCP clients, add this to your configuration: |
|
|
|
```json |
|
{ |
|
"mcpServers": { |
|
"web-scraper": { |
|
"url": "http://localhost:7862/gradio_api/mcp/sse" |
|
} |
|
} |
|
} |
|
``` |
|
|
|
## Dependencies |
|
|
|
- `gradio[mcp]`: Web interface framework with MCP support |
|
- `requests`: HTTP library for making web requests |
|
- `beautifulsoup4`: HTML parsing library |
|
- `markdownify`: HTML to markdown conversion |
|
- `lxml`: XML and HTML parser |
|
|
|
## Project Structure |
|
|
|
``` |
|
web-scraper/ |
|
βββ app.py # Main web interface application |
|
βββ mcp_server.py # MCP server with exposed tools |
|
βββ requirements.txt # Python dependencies |
|
βββ requirements.txt # Python dependencies |
|
βββ README.md # Project documentation |
|
βββ .github/ |
|
β βββ copilot-instructions.md |
|
βββ .vscode/ |
|
βββ tasks.json # VS Code tasks |
|
``` |
|
|
|
## Features Details |
|
|
|
### Web Scraping |
|
|
|
- Handles both HTTP and HTTPS URLs |
|
- Automatically adds protocol if missing |
|
- Removes unwanted elements (scripts, styles, navigation) |
|
- Focuses on main content areas |
|
|
|
### Markdown Conversion |
|
|
|
- Converts HTML to clean markdown format |
|
- Preserves heading structure |
|
- Removes empty links and excessive whitespace |
|
- Adds page title as main heading |
|
|
|
### Sitemap Generation |
|
|
|
- Extracts all links from the page |
|
- Converts relative URLs to absolute URLs |
|
- Organizes links by domain (internal vs external) |
|
- Limits display to prevent overwhelming output |
|
- Filters out unwanted links (anchors, javascript, etc.) |
|
|
|
## Example URLs to Try |
|
|
|
- `https://httpbin.org/html` - Simple test page |
|
- `https://example.com` - Basic example site |
|
- `https://python.org` - Python official website |
|
|
|
## Error Handling |
|
|
|
The application includes comprehensive error handling for: |
|
|
|
- Invalid URLs |
|
- Network timeouts |
|
- HTTP errors |
|
- Content parsing issues |
|
|
|
## Customization |
|
|
|
You can customize the scraper by modifying: |
|
|
|
- User-Agent string in the `WebScraper` class |
|
- Content extraction selectors |
|
- Markdown formatting rules |
|
- Link filtering criteria |
|
|
|
|