Spaces:

benjamingonz98
/

WebCrawler

Running

WebCrawler / src /webui /components /documentation_tab.py

Carlos Gonzalez

Add application file

b1f90a5 about 1 month ago

32.7 kB

	import gradio as gr
	from gradio.components import Component
	from src.webui.webui_manager import WebuiManager


	def create_documentation_tab(webui_manager: WebuiManager):
	"""
	Creates a documentation tab with detailed project analysis.
	"""
	tab_components = {}

	with gr.Group():
	gr.Markdown(
	"""
	# Browser Use WebUI Documentation

	This documentation provides a comprehensive overview of the Browser Use WebUI project.
	""",
	elem_classes=["tab-header-text"],
	)

	with gr.Tabs() as doc_tabs:
	with gr.TabItem("Project Overview"):
	gr.Markdown(
	"""
	## Project Overview

	Browser Use WebUI is a Gradio-based interface for controlling and interacting with web browsers using AI assistance.
	It provides a user-friendly way to automate browser tasks and research using large language models.

	### Key Features

	- AI-Controlled Browser: Control Chrome or other browsers with AI assistance
	- OpenAI LLM Support: Compatible with OpenAI models including GPT-4 and GPT-3.5
	- Custom Browser Support: Use your own browser with persistent sessions
	- Deep Research Agent: Specialized agent for conducting in-depth web research

	### Recent Updates

	As of the latest version, the system has been streamlined to support only OpenAI as the LLM provider. This change:

	- Simplifies the codebase and reduces dependencies
	- Focuses development efforts on optimizing the OpenAI integration
	- Ensures consistent behavior across all agent interactions
	- Improves reliability and reduces potential configuration issues

	If you were using other LLM providers with previous versions, please update your configurations to use OpenAI.
	"""
	)

	with gr.TabItem("Submit Task Flow"):
	gr.Markdown(
	"""
	## BrowserUse Agent: Submit Task Flow Documentation

	This documentation provides a detailed overview of what happens when you click the "Submit Task" button in the BrowserUse agent tab.

	### Files Involved

	- browser_use_agent_tab.py: Creates the UI for the BrowserUse agent tab and handles the submit task workflow.
	- webui_manager.py: Maintains the state of the web UI and stores components and agent instances.
	- browser_use_agent.py: Implements the core BrowserUse agent functionality for running tasks.
	- custom_controller.py: Handles the execution of browser actions requested by the agent.
	- custom_browser.py: Custom browser implementation for the BrowserUse agent.
	- custom_context.py: Manages browser contexts for the BrowserUse agent.

	### Step-by-Step Process

	#### Step 1: User Submits a Task

	The process begins when a user enters a task in the text input field and clicks the "Submit Task" button, triggering the `handle_submit` function.

	#### Step 2: Task Initialization

	The `run_agent_task` function retrieves the user's task from UI components, updates the chat history, and initializes UI components for the task execution.

	#### Step 3: Browser and Context Setup

	The system initializes or reuses an existing browser instance and browser context, which provide the environment for the agent to interact with web pages.

	#### Step 4: Agent Initialization

	The system creates a new BrowserUseAgent instance or updates an existing one with the new task. It also registers callbacks for step updates and task completion.

	#### Step 5: Task Execution

	The system executes the agent's `run` method in a new task and waits for its completion, updating the UI with progress.

	#### Step 6: BrowserUseAgent Run Method

	The agent's `run` method is the core execution logic that performs the task through a series of steps, each interacting with the browser to accomplish the given task.

	#### Step 7: Step Processing Callback

	The `_handle_new_step` callback is called after each agent step, updating the UI with the latest screenshot and agent output.

	#### Step 8: Task Completion Callback

	The `_handle_done` callback is triggered when the agent completes the task (success or failure), updating the UI with the final results and metrics.

	### System Flow Diagram

	```
	User submits task → Task initialization → Browser setup → Agent initialization
	↓
	Task completion ← Agent run method ← Step processing callback ← Task execution
	```
	"""
	)

	with gr.TabItem("Architecture"):
	gr.Markdown(
	"""
	## System Architecture

	The project follows a modular architecture with clear separation of concerns:

	### Core Components

	1. WebUI Module (`src/webui/`):
	- Interface management using Gradio
	- Tab components for different functionalities
	- User input/output handling

	2. Browser Module (`src/browser/`):
	- Custom browser implementation extending browser-use library
	- Browser context management
	- Screenshot and session handling

	3. Agent Module (`src/agent/`):
	- Browser Use Agent: General-purpose browser automation
	- Deep Research Agent: Specialized for research tasks
	- Agent state and history management

	4. Controller Module (`src/controller/`):
	- Action registry for browser control
	- MCP client integration
	- Custom action implementations

	5. Utils Module (`src/utils/`):
	- OpenAI LLM integration
	- Configuration helpers
	- MCP client setup

	### Data Flow

	1. User inputs task via WebUI
	2. WebUI Manager initializes components
	3. Agent receives task and configures OpenAI LLM
	4. Browser is launched or connected
	5. Agent iteratively performs actions via controller
	6. Results display in WebUI with screenshots
	"""
	)

	with gr.TabItem("Browser Control"):
	gr.Markdown(
	"""
	## Browser Control System

	The browser control functionality is built on the browser-use library, with custom extensions:

	### Browser Features

	- Custom Browser Integration: Connect to existing browser instances
	- Browser Context Management: Create and manage browser contexts
	- Session Persistence: Keep browser open between tasks
	- Screenshot Capture: Take and display screenshots of browser state
	- DOM Interaction: Interact with web page elements
	- Action Registry: Comprehensive set of browser actions

	### Actions Supported

	- Navigate to URLs
	- Click elements
	- Input text
	- Extract content
	- Scroll pages
	- Search Google
	- Wait for page load
	- Handle alerts and dialogs
	- Upload files
	- And more through the registry system
	"""
	)

	with gr.TabItem("Agent System"):
	gr.Markdown(
	"""
	## Agent System

	The application provides two main agent types:

	### Browser Use Agent

	Extends the base Agent class from browser-use library to provide:

	- Task execution with dynamic tool selection
	- LLM integration with multiple providers
	- Browser control through registered actions
	- Error handling and recovery
	- Execution history tracking

	### Deep Research Agent

	Specialized agent using LangGraph for:

	- Research planning through LLM
	- Web search and content extraction
	- Information synthesis
	- Structured research report generation
	- Multi-browser parallel processing

	### Agent Components

	- State Management: Track agent state during execution
	- History Recording: Record steps and results
	- Output Formatting: Format results for display
	- Tool Calling: Different methods based on LLM capabilities
	"""
	)

	with gr.TabItem("LLM Integration"):
	gr.Markdown(
	"""
	## LLM Integration

	The system supports OpenAI LLM:

	### Supported Provider

	- OpenAI: GPT-4o, GPT-4, GPT-3.5

	### Integration Features

	- Vision Support: Vision capabilities with compatible models
	- Temperature Control: Adjust randomness in model outputs
	- Context Length Management: Handle different model context limits
	- API Key Management: Secure handling of API credentials
	- Tool Calling Methods: Different methods based on model capabilities (function_calling, json_mode, raw)
	"""
	)

	with gr.TabItem("Web UI Components"):
	gr.Markdown(
	"""
	## Web UI Components

	The interface is built with Gradio and organized into tabs:

	### Main Tabs

	1. Agent Settings: Configure OpenAI models and parameters
	2. Browser Settings: Set up browser preferences and options
	3. Run Agent: Execute browser tasks and view results
	4. Agent Marketplace: Access specialized agents like Deep Research
	5. Documentation: Comprehensive project documentation (you are here)
	6. Load & Save Config: Save and load UI configurations

	### Interface Features

	- Chatbot Interface: View agent interactions and results
	- Task Input: Submit tasks to the agent
	- Control Buttons: Start, stop, pause, and clear agent execution
	- Configuration Forms: Set up OpenAI and browser parameters
	- Results Display: View agent output including screenshots
	"""
	)

	with gr.TabItem("API & Libraries"):
	gr.Markdown(
	"""
	## Core Libraries & Dependencies

	The project relies on several key libraries:

	### Primary Dependencies

	- browser-use: Core browser automation library
	- gradio: Web UI framework
	- langchain: LLM integration framework
	- langgraph: Graph-based workflows for agents
	- playwright: Browser automation and control
	- pyperclip: Clipboard interaction
	- dotenv: Environment variable management

	### API Integration

	- LLM APIs: OpenAI, Google, Azure, Anthropic, etc.
	- MCP (Modular Coordination Protocol): Tool integration protocol
	- MainContentExtractor: Web content extraction

	### Browser APIs

	- CDP (Chrome DevTools Protocol): Browser communication
	- WSS: WebSocket connections for browser control
	"""
	)

	with gr.TabItem("File Structure"):
	gr.Markdown(
	"""
	## Project File Structure

	```
	web-ui/
	├── src/
	│ ├── agent/
	│ │ ├── browser_use/
	│ │ │ └── browser_use_agent.py
	│ │ └── deep_research/
	│ │ └── deep_research_agent.py
	│ ├── browser/
	│ │ ├── custom_browser.py
	│ │ └── custom_context.py
	│ ├── controller/
	│ │ └── custom_controller.py
	│ ├── utils/
	│ │ ├── config.py
	│ │ ├── llm_provider.py
	│ │ └── mcp_client.py
	│ ├── webui/
	│ │ ├── components/
	│ │ │ ├── agent_settings_tab.py
	│ │ │ ├── browser_settings_tab.py
	│ │ │ ├── browser_use_agent_tab.py
	│ │ │ ├── deep_research_agent_tab.py
	│ │ │ ├── documentation_tab.py
	│ │ │ └── load_save_config_tab.py
	│ │ ├── interface.py
	│ │ └── webui_manager.py
	│ └── __init__.py
	├── assets/
	├── tmp/
	├── tests/
	├── .venv/
	├── webui.py
	├── Dockerfile
	├── docker-compose.yml
	├── requirements.txt
	├── setup.py
	└── README.md
	```
	"""
	)

	with gr.TabItem("Setup & Usage"):
	gr.Markdown(
	"""
	## Setup & Usage Guide

	### Installation

	#### Local Installation

	1. Clone the repository
	```bash
	git clone https://github.com/browser-use/web-ui.git
	cd web-ui
	```

	2. Set up Python environment
	```bash
	uv venv --python 3.11
	source .venv/bin/activate # Linux/Mac
	.venv\\Scripts\\activate # Windows
	```

	3. Install dependencies
	```bash
	uv pip install -r requirements.txt
	playwright install --with-deps
	```

	4. Configure environment
	```bash
	cp .env.example .env
	# Edit .env to add your API keys
	```

	5. Run the application
	```bash
	python webui.py --ip 127.0.0.1 --port 7788
	```

	#### Docker Installation

	```bash
	docker compose up --build
	```

	### Usage Examples

	1. Simple Web Search
	- Configure LLM in Agent Settings
	- Configure browser in Browser Settings
	- In Run Agent tab, enter: "Search for the latest news about AI"
	- Click Submit Task

	2. Deep Research
	- Configure LLM in Agent Settings
	- Go to Agent Marketplace > Deep Research
	- Enter research topic: "Advances in renewable energy in 2023"
	- Click Run

	3. Using Custom Browser
	- In Browser Settings, check "Use Own Browser"
	- Configure paths to browser and user data
	- Submit tasks as normal
	"""
	)

	with gr.TabItem("Source Code Analysis"):
	gr.Markdown(
	"""
	## Detailed Source Code Analysis

	This section provides a deep dive into the code structure and implementation details of key components.

	### WebUI Manager Class

	The `WebuiManager` class in `src/webui/webui_manager.py` serves as the central component managing UI elements and application state:

	```python
	class WebuiManager:
	def __init__(self, settings_save_dir: str = "./tmp/webui_settings"):
	self.id_to_component: dict[str, Component] = {}
	self.component_to_id: dict[Component, str] = {}
	self.settings_save_dir = settings_save_dir
	os.makedirs(self.settings_save_dir, exist_ok=True)
	```

	Key functions:
	- `add_components()`: Registers UI components with unique IDs
	- `get_component_by_id()`: Retrieves components using their ID
	- `save_config()`: Serializes UI settings to JSON
	- `load_config()`: Loads settings from JSON
	- `init_browser_use_agent()`: Creates browser agent instances

	### Custom Browser Implementation

	The `CustomBrowser` class in `src/browser/custom_browser.py` extends the base `Browser` class from the browser-use library:

	```python
	class CustomBrowser(Browser):
	async def new_context(self, config: BrowserContextConfig \| None = None) -> CustomBrowserContext:
	browser_config = self.config.model_dump() if self.config else {}
	context_config = config.model_dump() if config else {}
	merged_config = {browser_config, context_config}
	return CustomBrowserContext(config=BrowserContextConfig(**merged_config), browser=self)
	```

	Key features:
	- Extends the browser-use Browser class
	- Creates custom browser contexts
	- Configures Chrome arguments for different environments
	- Handles screen resolution and window dimensions

	### Browser Use Agent

	The `BrowserUseAgent` class in `src/agent/browser_use/browser_use_agent.py` extends the Agent class:

	```python
	class BrowserUseAgent(Agent):
	def _set_tool_calling_method(self) -> ToolCallingMethod \| None:
	tool_calling_method = self.settings.tool_calling_method
	if tool_calling_method == 'auto':
	if is_model_without_tool_support(self.model_name):
	return 'raw'
	elif self.chat_model_library == 'ChatGoogleGenerativeAI':
	return None
	elif self.chat_model_library == 'ChatOpenAI':
	return 'function_calling'
	# Additional models...
	```

	Key capabilities:
	- Automatically selects tool calling method based on LLM
	- Handles agent execution with configurable steps
	- Provides pause/resume functionality
	- Manages execution history and state
	- Implements error handling and recovery

	### Deep Research Agent

	The `DeepResearchAgent` class in `src/agent/deep_research/deep_research_agent.py` implements a specialized research agent:

	```python
	class DeepResearchAgent:
	def __init__(
	self,
	llm: Any,
	browser_config: Dict[str, Any],
	mcp_server_config: Optional[Dict[str, Any]] = None,
	):
	# Initialize agent with LLM and browser config
	```

	Key components:
	- Uses LangGraph for structured research workflows
	- Implements planning, research, and synthesis nodes
	- Manages parallel browser instances for efficiency
	- Generates structured research reports
	- Handles task state persistence

	### Custom Controller

	The `CustomController` class in `src/controller/custom_controller.py` extends the Controller class:

	```python
	class CustomController(Controller):
	def __init__(self, exclude_actions: list[str] = [],
	output_model: Optional[Type[BaseModel]] = None,
	ask_assistant_callback: Optional[...] = None):
	super().__init__(exclude_actions=exclude_actions, output_model=output_model)
	self._register_custom_actions()
	self.ask_assistant_callback = ask_assistant_callback
	self.mcp_client = None
	self.mcp_server_config = None
	```

	Key features:
	- Registers custom browser actions
	- Integrates with MCP (Modular Coordination Protocol)
	- Provides file upload capabilities
	- Implements human assistance features
	- Handles action execution with error management

	### UI Components

	The UI is built using Gradio components:

	```python
	def create_ui(theme_name="Ocean"):
	with gr.Blocks(title="Browser Use WebUI", theme=theme_map[theme_name], css=css, js=js_func) as demo:
	with gr.Tabs() as tabs:
	with gr.TabItem("⚙️ Agent Settings"):
	create_agent_settings_tab(ui_manager)
	# Additional tabs...
	```

	Key UI features:
	- Modular tab-based interface
	- Customizable themes
	- Responsive layout
	- Dark mode support
	- Configuration persistence
	"""
	)

	with gr.TabItem("Technical Challenges"):
	gr.Markdown(
	"""
	## Technical Challenges & Solutions

	This section covers key technical challenges faced during development and the solutions implemented.

	### Browser Integration Challenges

	Challenge: Connecting to existing browser instances with proper user profiles.

	Solution: Custom implementation using CDP (Chrome DevTools Protocol) and WebSocket connections:

	```python
	# Implementation in custom_browser.py
	chrome_args = {
	f'--remote-debugging-port={self.config.chrome_remote_debugging_port}',
	*(CHROME_DOCKER_ARGS if IN_DOCKER else []),
	*(CHROME_HEADLESS_ARGS if self.config.headless else []),
	# Additional args...
	}

	# Check existing port conflicts
	with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
	if s.connect_ex(('localhost', self.config.chrome_remote_debugging_port)) == 0:
	chrome_args.remove(f'--remote-debugging-port={self.config.chrome_remote_debugging_port}')
	```

	### OpenAI LLM Integration

	Challenge: Configuring and optimizing OpenAI models for browser automation.

	Solution: Provider abstraction and method detection:

	```python
	# In browser_use_agent.py
	def _set_tool_calling_method(self) -> ToolCallingMethod \| None:
	tool_calling_method = self.settings.tool_calling_method
	if tool_calling_method == 'auto':
	if is_model_without_tool_support(self.model_name):
	return 'raw'
	else:
	return 'function_calling'
	```

	### Execution State Management

	Challenge: Maintaining agent state across steps and allowing pause/resume.

	Solution: Custom execution loop with state management:

	```python
	# In browser_use_agent.py
	async def run(self, max_steps: int = 100, on_step_start: AgentHookFunc \| None = None,
	on_step_end: AgentHookFunc \| None = None) -> AgentHistoryList:

	# Execution loop with state management
	for step in range(max_steps):
	# Check pause state
	if self.state.paused:
	signal_handler.wait_for_resume()
	signal_handler.reset()

	# Check for stop
	if self.state.stopped:
	logger.info('Agent stopped')
	break

	# Execute step with callbacks
	if on_step_start is not None:
	await on_step_start(self)

	step_info = AgentStepInfo(step_number=step, max_steps=max_steps)
	await self.step(step_info)

	if on_step_end is not None:
	await on_step_end(self)
	```

	### Multi-Browser Research Orchestration

	Challenge: Managing multiple parallel browser instances for research tasks.

	Solution: LangGraph-based workflow with parallel task execution:

	```python
	# In deep_research_agent.py
	async def _run_browser_search_tool(
	queries: List[str],
	task_id: str,
	llm: Any,
	browser_config: Dict[str, Any],
	stop_event: threading.Event,
	max_parallel_browsers: int = 1,
	) -> List[Dict[str, Any]]:

	# Execute tasks in parallel with limit
	tasks = []
	results = []

	semaphore = asyncio.Semaphore(max_parallel_browsers)

	async def task_wrapper(query):
	async with semaphore:
	return await run_single_browser_task(
	query, task_id, llm, browser_config, stop_event
	)

	# Create and gather tasks
	for query in queries:
	tasks.append(asyncio.create_task(task_wrapper(query)))

	results = await asyncio.gather(*tasks)
	return results
	```

	### UI State Synchronization

	Challenge: Keeping UI state synchronized with backend operations.

	Solution: Component tracking and event-based updates:

	```python
	# In webui_manager.py
	def add_components(self, tab_name: str, components_dict: dict[str, "Component"]) -> None:
	for comp_name, component in components_dict.items():
	comp_id = f"{tab_name}.{comp_name}"
	self.id_to_component[comp_id] = component
	self.component_to_id[component] = comp_id

	# In browser_use_agent_tab.py
	async def handle_submit(webui_manager: WebuiManager, components: Dict[gr.components.Component, Any]):
	# Get component values and update UI state
	task_input = _get_config_value(webui_manager, components, "user_input", "")
	webui_manager.bu_chat_history.append({"role": "user", "content": task_input})
	# Additional UI updates...
	```

	### Docker Environment Challenges

	Challenge: Running browser automation in Docker containers.

	Solution: Special Docker configuration for browser support:

	```python
	# In custom_browser.py
	CHROME_DOCKER_ARGS = [
	"--no-sandbox",
	"--disable-dev-shm-usage",
	# Additional docker-specific args...
	]

	# In docker-compose.yml
	services:
	web-ui:
	build:
	context: .
	volumes:
	- ./tmp:/app/tmp
	ports:
	- "7788:7788"
	- "6080:6080" # VNC for browser viewing
	environment:
	- DISPLAY=:1
	# Additional environment variables...
	```
	"""
	)

	tab_components.update(dict(
	doc_tabs=doc_tabs,
	))

	webui_manager.add_components("documentation", tab_components)

	return tab_components