A newer version of the Gradio SDK is available:
5.34.2
title: Web Scraper
emoji: π
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Web Scraper & Sitemap Generator
A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.
Features
- π·οΈ Web Scraping: Extract text content from any website
- π Markdown Conversion: Convert scraped HTML content to clean markdown format
- πΊοΈ Sitemap Generation: Create organized sitemaps based on all links found on the page
- π User-Friendly Interface: Easy-to-use Gradio web interface
- π Link Organization: Separate internal and external links for better navigation
- π€ MCP Server: Expose scraping tools for AI assistants and LLMs
Installation
- Install Python dependencies:
pip install -r requirements.txt
Usage
Web Interface
- Run the web application:
python app.py
Open your browser and navigate to
http://localhost:7861
Enter a URL in the input field and click "Scrape Website"
View the results:
- Status: Shows success/error messages
- Scraped Content: Website content converted to markdown
- Sitemap: Organized list of all links found on the page
MCP Server
- Run the MCP server:
python mcp_server.py
The server will be available at
http://localhost:7862
MCP Endpoint:
http://localhost:7862/gradio_api/mcp/sse
Available MCP Tools
- scrape_content: Extract and format website content as markdown
- generate_sitemap: Generate a sitemap of all links found on a webpage
- analyze_website: Complete website analysis with both content and sitemap
MCP Client Configuration
To use with Claude Desktop or other MCP clients, add this to your configuration:
{
"mcpServers": {
"web-scraper": {
"url": "http://localhost:7862/gradio_api/mcp/sse"
}
}
}
Dependencies
gradio[mcp]
: Web interface framework with MCP supportrequests
: HTTP library for making web requestsbeautifulsoup4
: HTML parsing librarymarkdownify
: HTML to markdown conversionlxml
: XML and HTML parser
Project Structure
web-scraper/
βββ app.py # Main web interface application
βββ mcp_server.py # MCP server with exposed tools
βββ requirements.txt # Python dependencies
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ .github/
β βββ copilot-instructions.md
βββ .vscode/
βββ tasks.json # VS Code tasks
Features Details
Web Scraping
- Handles both HTTP and HTTPS URLs
- Automatically adds protocol if missing
- Removes unwanted elements (scripts, styles, navigation)
- Focuses on main content areas
Markdown Conversion
- Converts HTML to clean markdown format
- Preserves heading structure
- Removes empty links and excessive whitespace
- Adds page title as main heading
Sitemap Generation
- Extracts all links from the page
- Converts relative URLs to absolute URLs
- Organizes links by domain (internal vs external)
- Limits display to prevent overwhelming output
- Filters out unwanted links (anchors, javascript, etc.)
Example URLs to Try
https://httpbin.org/html
- Simple test pagehttps://example.com
- Basic example sitehttps://python.org
- Python official website
Error Handling
The application includes comprehensive error handling for:
- Invalid URLs
- Network timeouts
- HTTP errors
- Content parsing issues
Customization
You can customize the scraper by modifying:
- User-Agent string in the
WebScraper
class - Content extraction selectors
- Markdown formatting rules
- Link filtering criteria