Spaces:

Agents-MCP-Hackathon
/

web-scraper

Running

App Files Files Community

spagestic commited on Jun 4

Commit

89b22f4

verified ·

1 Parent(s): 0c66d86

Update README.md

Browse files

Files changed (1) hide show

README.md +144 -0

README.md CHANGED Viewed

@@ -10,3 +10,147 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Web Scraper & Sitemap Generator
+A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.
+## Features
+- 🕷️ **Web Scraping**: Extract text content from any website
+- 📝 **Markdown Conversion**: Convert scraped HTML content to clean markdown format
+- 🗺️ **Sitemap Generation**: Create organized sitemaps based on all links found on the page
+- 🌐 **User-Friendly Interface**: Easy-to-use Gradio web interface
+- 🔗 **Link Organization**: Separate internal and external links for better navigation
+- 🤖 **MCP Server**: Expose scraping tools for AI assistants and LLMs
+## Installation
+1. Install Python dependencies:
+```bash
+pip install -r requirements.txt
+```
+## Usage
+### Web Interface
+1. Run the web application:
+```bash
+python app.py
+```
+2. Open your browser and navigate to `http://localhost:7861`
+3. Enter a URL in the input field and click "Scrape Website"
+4. View the results:
+   - **Status**: Shows success/error messages
+   - **Scraped Content**: Website content converted to markdown
+   - **Sitemap**: Organized list of all links found on the page
+### MCP Server
+1. Run the MCP server:
+```bash
+python mcp_server.py
+```
+2. The server will be available at `http://localhost:7862`
+3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse`
+#### Available MCP Tools
+- **scrape_content**: Extract and format website content as markdown
+- **generate_sitemap**: Generate a sitemap of all links found on a webpage
+- **analyze_website**: Complete website analysis with both content and sitemap
+#### MCP Client Configuration
+To use with Claude Desktop or other MCP clients, add this to your configuration:
+```json
+{
+  "mcpServers": {
+    "web-scraper": {
+      "url": "http://localhost:7862/gradio_api/mcp/sse"
+    }
+  }
+}
+```
+## Dependencies
+- `gradio[mcp]`: Web interface framework with MCP support
+- `requests`: HTTP library for making web requests
+- `beautifulsoup4`: HTML parsing library
+- `markdownify`: HTML to markdown conversion
+- `lxml`: XML and HTML parser
+## Project Structure
+```
+web-scraper/
+├── app.py                 # Main web interface application
+├── mcp_server.py         # MCP server with exposed tools
+├── requirements.txt       # Python dependencies
+├── requirements.txt       # Python dependencies
+├── README.md             # Project documentation
+├── .github/
+│   └── copilot-instructions.md
+└── .vscode/
+    └── tasks.json        # VS Code tasks
+```
+## Features Details
+### Web Scraping
+- Handles both HTTP and HTTPS URLs
+- Automatically adds protocol if missing
+- Removes unwanted elements (scripts, styles, navigation)
+- Focuses on main content areas
+### Markdown Conversion
+- Converts HTML to clean markdown format
+- Preserves heading structure
+- Removes empty links and excessive whitespace
+- Adds page title as main heading
+### Sitemap Generation
+- Extracts all links from the page
+- Converts relative URLs to absolute URLs
+- Organizes links by domain (internal vs external)
+- Limits display to prevent overwhelming output
+- Filters out unwanted links (anchors, javascript, etc.)
+## Example URLs to Try
+- `https://httpbin.org/html` - Simple test page
+- `https://example.com` - Basic example site
+- `https://python.org` - Python official website
+## Error Handling
+The application includes comprehensive error handling for:
+- Invalid URLs
+- Network timeouts
+- HTTP errors
+- Content parsing issues
+## Customization
+You can customize the scraper by modifying:
+- User-Agent string in the `WebScraper` class
+- Content extraction selectors
+- Markdown formatting rules
+- Link filtering criteria