web-scraper / README.md
spagestic's picture
Update README.md
89b22f4 verified

A newer version of the Gradio SDK is available: 5.34.2

Upgrade
metadata
title: Web Scraper
emoji: πŸš€
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Web Scraper & Sitemap Generator

A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.

Features

  • πŸ•·οΈ Web Scraping: Extract text content from any website
  • πŸ“ Markdown Conversion: Convert scraped HTML content to clean markdown format
  • πŸ—ΊοΈ Sitemap Generation: Create organized sitemaps based on all links found on the page
  • 🌐 User-Friendly Interface: Easy-to-use Gradio web interface
  • πŸ”— Link Organization: Separate internal and external links for better navigation
  • πŸ€– MCP Server: Expose scraping tools for AI assistants and LLMs

Installation

  1. Install Python dependencies:
pip install -r requirements.txt

Usage

Web Interface

  1. Run the web application:
python app.py
  1. Open your browser and navigate to http://localhost:7861

  2. Enter a URL in the input field and click "Scrape Website"

  3. View the results:

    • Status: Shows success/error messages
    • Scraped Content: Website content converted to markdown
    • Sitemap: Organized list of all links found on the page

MCP Server

  1. Run the MCP server:
python mcp_server.py
  1. The server will be available at http://localhost:7862

  2. MCP Endpoint: http://localhost:7862/gradio_api/mcp/sse

Available MCP Tools

  • scrape_content: Extract and format website content as markdown
  • generate_sitemap: Generate a sitemap of all links found on a webpage
  • analyze_website: Complete website analysis with both content and sitemap

MCP Client Configuration

To use with Claude Desktop or other MCP clients, add this to your configuration:

{
  "mcpServers": {
    "web-scraper": {
      "url": "http://localhost:7862/gradio_api/mcp/sse"
    }
  }
}

Dependencies

  • gradio[mcp]: Web interface framework with MCP support
  • requests: HTTP library for making web requests
  • beautifulsoup4: HTML parsing library
  • markdownify: HTML to markdown conversion
  • lxml: XML and HTML parser

Project Structure

web-scraper/
β”œβ”€β”€ app.py                 # Main web interface application
β”œβ”€β”€ mcp_server.py         # MCP server with exposed tools
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ README.md             # Project documentation
β”œβ”€β”€ .github/
β”‚   └── copilot-instructions.md
└── .vscode/
    └── tasks.json        # VS Code tasks

Features Details

Web Scraping

  • Handles both HTTP and HTTPS URLs
  • Automatically adds protocol if missing
  • Removes unwanted elements (scripts, styles, navigation)
  • Focuses on main content areas

Markdown Conversion

  • Converts HTML to clean markdown format
  • Preserves heading structure
  • Removes empty links and excessive whitespace
  • Adds page title as main heading

Sitemap Generation

  • Extracts all links from the page
  • Converts relative URLs to absolute URLs
  • Organizes links by domain (internal vs external)
  • Limits display to prevent overwhelming output
  • Filters out unwanted links (anchors, javascript, etc.)

Example URLs to Try

  • https://httpbin.org/html - Simple test page
  • https://example.com - Basic example site
  • https://python.org - Python official website

Error Handling

The application includes comprehensive error handling for:

  • Invalid URLs
  • Network timeouts
  • HTTP errors
  • Content parsing issues

Customization

You can customize the scraper by modifying:

  • User-Agent string in the WebScraper class
  • Content extraction selectors
  • Markdown formatting rules
  • Link filtering criteria