metadata

title: Web Scraper
emoji: 🚀
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Web Scraper & Sitemap Generator

A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.

Features

🕷️ Web Scraping: Extract text content from any website
📝 Markdown Conversion: Convert scraped HTML content to clean markdown format
🗺️ Sitemap Generation: Create organized sitemaps based on all links found on the page
🌐 User-Friendly Interface: Easy-to-use Gradio web interface
🔗 Link Organization: Separate internal and external links for better navigation
🤖 MCP Server: Expose scraping tools for AI assistants and LLMs

Installation

Install Python dependencies:

pip install -r requirements.txt

Usage

Web Interface

Run the web application:

python app.py

Open your browser and navigate to http://localhost:7861
Enter a URL in the input field and click "Scrape Website"
View the results:
- Status: Shows success/error messages
- Scraped Content: Website content converted to markdown
- Sitemap: Organized list of all links found on the page

MCP Server

Run the MCP server:

python mcp_server.py

The server will be available at http://localhost:7862
MCP Endpoint: http://localhost:7862/gradio_api/mcp/sse

Available MCP Tools

scrape_content: Extract and format website content as markdown
generate_sitemap: Generate a sitemap of all links found on a webpage
analyze_website: Complete website analysis with both content and sitemap

MCP Client Configuration

To use with Claude Desktop or other MCP clients, add this to your configuration:

{
  "mcpServers": {
    "web-scraper": {
      "url": "http://localhost:7862/gradio_api/mcp/sse"
    }
  }
}

Dependencies

gradio[mcp]: Web interface framework with MCP support
requests: HTTP library for making web requests
beautifulsoup4: HTML parsing library
markdownify: HTML to markdown conversion
lxml: XML and HTML parser

Project Structure

web-scraper/
├── app.py                 # Main web interface application
├── mcp_server.py         # MCP server with exposed tools
├── requirements.txt       # Python dependencies
├── requirements.txt       # Python dependencies
├── README.md             # Project documentation
├── .github/
│   └── copilot-instructions.md
└── .vscode/
    └── tasks.json        # VS Code tasks

Features Details

Web Scraping

Handles both HTTP and HTTPS URLs
Automatically adds protocol if missing
Removes unwanted elements (scripts, styles, navigation)
Focuses on main content areas

Markdown Conversion

Converts HTML to clean markdown format
Preserves heading structure
Removes empty links and excessive whitespace
Adds page title as main heading

Sitemap Generation

Extracts all links from the page
Converts relative URLs to absolute URLs
Organizes links by domain (internal vs external)
Limits display to prevent overwhelming output
Filters out unwanted links (anchors, javascript, etc.)

Example URLs to Try

https://httpbin.org/html - Simple test page
https://example.com - Basic example site
https://python.org - Python official website

Error Handling

The application includes comprehensive error handling for:

Invalid URLs
Network timeouts
HTTP errors
Content parsing issues

Customization

You can customize the scraper by modifying:

User-Agent string in the WebScraper class
Content extraction selectors
Markdown formatting rules
Link filtering criteria