Spaces:
Running
Running
import gradio as gr | |
from gradio.components import Component | |
from src.webui.webui_manager import WebuiManager | |
def create_documentation_tab(webui_manager: WebuiManager): | |
""" | |
Creates a documentation tab with detailed project analysis. | |
""" | |
tab_components = {} | |
with gr.Group(): | |
gr.Markdown( | |
""" | |
# Browser Use WebUI Documentation | |
This documentation provides a comprehensive overview of the Browser Use WebUI project. | |
""", | |
elem_classes=["tab-header-text"], | |
) | |
with gr.Tabs() as doc_tabs: | |
with gr.TabItem("Project Overview"): | |
gr.Markdown( | |
""" | |
## Project Overview | |
Browser Use WebUI is a Gradio-based interface for controlling and interacting with web browsers using AI assistance. | |
It provides a user-friendly way to automate browser tasks and research using large language models. | |
### Key Features | |
- **AI-Controlled Browser**: Control Chrome or other browsers with AI assistance | |
- **OpenAI LLM Support**: Compatible with OpenAI models including GPT-4 and GPT-3.5 | |
- **Custom Browser Support**: Use your own browser with persistent sessions | |
- **Deep Research Agent**: Specialized agent for conducting in-depth web research | |
### Recent Updates | |
As of the latest version, the system has been streamlined to support only OpenAI as the LLM provider. This change: | |
- Simplifies the codebase and reduces dependencies | |
- Focuses development efforts on optimizing the OpenAI integration | |
- Ensures consistent behavior across all agent interactions | |
- Improves reliability and reduces potential configuration issues | |
If you were using other LLM providers with previous versions, please update your configurations to use OpenAI. | |
""" | |
) | |
with gr.TabItem("Submit Task Flow"): | |
gr.Markdown( | |
""" | |
## BrowserUse Agent: Submit Task Flow Documentation | |
This documentation provides a detailed overview of what happens when you click the "Submit Task" button in the BrowserUse agent tab. | |
### Files Involved | |
- **browser_use_agent_tab.py**: Creates the UI for the BrowserUse agent tab and handles the submit task workflow. | |
- **webui_manager.py**: Maintains the state of the web UI and stores components and agent instances. | |
- **browser_use_agent.py**: Implements the core BrowserUse agent functionality for running tasks. | |
- **custom_controller.py**: Handles the execution of browser actions requested by the agent. | |
- **custom_browser.py**: Custom browser implementation for the BrowserUse agent. | |
- **custom_context.py**: Manages browser contexts for the BrowserUse agent. | |
### Step-by-Step Process | |
#### Step 1: User Submits a Task | |
The process begins when a user enters a task in the text input field and clicks the "Submit Task" button, triggering the `handle_submit` function. | |
#### Step 2: Task Initialization | |
The `run_agent_task` function retrieves the user's task from UI components, updates the chat history, and initializes UI components for the task execution. | |
#### Step 3: Browser and Context Setup | |
The system initializes or reuses an existing browser instance and browser context, which provide the environment for the agent to interact with web pages. | |
#### Step 4: Agent Initialization | |
The system creates a new BrowserUseAgent instance or updates an existing one with the new task. It also registers callbacks for step updates and task completion. | |
#### Step 5: Task Execution | |
The system executes the agent's `run` method in a new task and waits for its completion, updating the UI with progress. | |
#### Step 6: BrowserUseAgent Run Method | |
The agent's `run` method is the core execution logic that performs the task through a series of steps, each interacting with the browser to accomplish the given task. | |
#### Step 7: Step Processing Callback | |
The `_handle_new_step` callback is called after each agent step, updating the UI with the latest screenshot and agent output. | |
#### Step 8: Task Completion Callback | |
The `_handle_done` callback is triggered when the agent completes the task (success or failure), updating the UI with the final results and metrics. | |
### System Flow Diagram | |
``` | |
User submits task β Task initialization β Browser setup β Agent initialization | |
β | |
Task completion β Agent run method β Step processing callback β Task execution | |
``` | |
""" | |
) | |
with gr.TabItem("Architecture"): | |
gr.Markdown( | |
""" | |
## System Architecture | |
The project follows a modular architecture with clear separation of concerns: | |
### Core Components | |
1. **WebUI Module (`src/webui/`)**: | |
- Interface management using Gradio | |
- Tab components for different functionalities | |
- User input/output handling | |
2. **Browser Module (`src/browser/`)**: | |
- Custom browser implementation extending browser-use library | |
- Browser context management | |
- Screenshot and session handling | |
3. **Agent Module (`src/agent/`)**: | |
- Browser Use Agent: General-purpose browser automation | |
- Deep Research Agent: Specialized for research tasks | |
- Agent state and history management | |
4. **Controller Module (`src/controller/`)**: | |
- Action registry for browser control | |
- MCP client integration | |
- Custom action implementations | |
5. **Utils Module (`src/utils/`)**: | |
- OpenAI LLM integration | |
- Configuration helpers | |
- MCP client setup | |
### Data Flow | |
1. User inputs task via WebUI | |
2. WebUI Manager initializes components | |
3. Agent receives task and configures OpenAI LLM | |
4. Browser is launched or connected | |
5. Agent iteratively performs actions via controller | |
6. Results display in WebUI with screenshots | |
""" | |
) | |
with gr.TabItem("Browser Control"): | |
gr.Markdown( | |
""" | |
## Browser Control System | |
The browser control functionality is built on the browser-use library, with custom extensions: | |
### Browser Features | |
- **Custom Browser Integration**: Connect to existing browser instances | |
- **Browser Context Management**: Create and manage browser contexts | |
- **Session Persistence**: Keep browser open between tasks | |
- **Screenshot Capture**: Take and display screenshots of browser state | |
- **DOM Interaction**: Interact with web page elements | |
- **Action Registry**: Comprehensive set of browser actions | |
### Actions Supported | |
- Navigate to URLs | |
- Click elements | |
- Input text | |
- Extract content | |
- Scroll pages | |
- Search Google | |
- Wait for page load | |
- Handle alerts and dialogs | |
- Upload files | |
- And more through the registry system | |
""" | |
) | |
with gr.TabItem("Agent System"): | |
gr.Markdown( | |
""" | |
## Agent System | |
The application provides two main agent types: | |
### Browser Use Agent | |
Extends the base Agent class from browser-use library to provide: | |
- Task execution with dynamic tool selection | |
- LLM integration with multiple providers | |
- Browser control through registered actions | |
- Error handling and recovery | |
- Execution history tracking | |
### Deep Research Agent | |
Specialized agent using LangGraph for: | |
- Research planning through LLM | |
- Web search and content extraction | |
- Information synthesis | |
- Structured research report generation | |
- Multi-browser parallel processing | |
### Agent Components | |
- **State Management**: Track agent state during execution | |
- **History Recording**: Record steps and results | |
- **Output Formatting**: Format results for display | |
- **Tool Calling**: Different methods based on LLM capabilities | |
""" | |
) | |
with gr.TabItem("LLM Integration"): | |
gr.Markdown( | |
""" | |
## LLM Integration | |
The system supports OpenAI LLM: | |
### Supported Provider | |
- **OpenAI**: GPT-4o, GPT-4, GPT-3.5 | |
### Integration Features | |
- **Vision Support**: Vision capabilities with compatible models | |
- **Temperature Control**: Adjust randomness in model outputs | |
- **Context Length Management**: Handle different model context limits | |
- **API Key Management**: Secure handling of API credentials | |
- **Tool Calling Methods**: Different methods based on model capabilities (function_calling, json_mode, raw) | |
""" | |
) | |
with gr.TabItem("Web UI Components"): | |
gr.Markdown( | |
""" | |
## Web UI Components | |
The interface is built with Gradio and organized into tabs: | |
### Main Tabs | |
1. **Agent Settings**: Configure OpenAI models and parameters | |
2. **Browser Settings**: Set up browser preferences and options | |
3. **Run Agent**: Execute browser tasks and view results | |
4. **Agent Marketplace**: Access specialized agents like Deep Research | |
5. **Documentation**: Comprehensive project documentation (you are here) | |
6. **Load & Save Config**: Save and load UI configurations | |
### Interface Features | |
- **Chatbot Interface**: View agent interactions and results | |
- **Task Input**: Submit tasks to the agent | |
- **Control Buttons**: Start, stop, pause, and clear agent execution | |
- **Configuration Forms**: Set up OpenAI and browser parameters | |
- **Results Display**: View agent output including screenshots | |
""" | |
) | |
with gr.TabItem("API & Libraries"): | |
gr.Markdown( | |
""" | |
## Core Libraries & Dependencies | |
The project relies on several key libraries: | |
### Primary Dependencies | |
- **browser-use**: Core browser automation library | |
- **gradio**: Web UI framework | |
- **langchain**: LLM integration framework | |
- **langgraph**: Graph-based workflows for agents | |
- **playwright**: Browser automation and control | |
- **pyperclip**: Clipboard interaction | |
- **dotenv**: Environment variable management | |
### API Integration | |
- **LLM APIs**: OpenAI, Google, Azure, Anthropic, etc. | |
- **MCP (Modular Coordination Protocol)**: Tool integration protocol | |
- **MainContentExtractor**: Web content extraction | |
### Browser APIs | |
- **CDP (Chrome DevTools Protocol)**: Browser communication | |
- **WSS**: WebSocket connections for browser control | |
""" | |
) | |
with gr.TabItem("File Structure"): | |
gr.Markdown( | |
""" | |
## Project File Structure | |
``` | |
web-ui/ | |
βββ src/ | |
β βββ agent/ | |
β β βββ browser_use/ | |
β β β βββ browser_use_agent.py | |
β β βββ deep_research/ | |
β β βββ deep_research_agent.py | |
β βββ browser/ | |
β β βββ custom_browser.py | |
β β βββ custom_context.py | |
β βββ controller/ | |
β β βββ custom_controller.py | |
β βββ utils/ | |
β β βββ config.py | |
β β βββ llm_provider.py | |
β β βββ mcp_client.py | |
β βββ webui/ | |
β β βββ components/ | |
β β β βββ agent_settings_tab.py | |
β β β βββ browser_settings_tab.py | |
β β β βββ browser_use_agent_tab.py | |
β β β βββ deep_research_agent_tab.py | |
β β β βββ documentation_tab.py | |
β β β βββ load_save_config_tab.py | |
β β βββ interface.py | |
β β βββ webui_manager.py | |
β βββ __init__.py | |
βββ assets/ | |
βββ tmp/ | |
βββ tests/ | |
βββ .venv/ | |
βββ webui.py | |
βββ Dockerfile | |
βββ docker-compose.yml | |
βββ requirements.txt | |
βββ setup.py | |
βββ README.md | |
``` | |
""" | |
) | |
with gr.TabItem("Setup & Usage"): | |
gr.Markdown( | |
""" | |
## Setup & Usage Guide | |
### Installation | |
#### Local Installation | |
1. Clone the repository | |
```bash | |
git clone https://github.com/browser-use/web-ui.git | |
cd web-ui | |
``` | |
2. Set up Python environment | |
```bash | |
uv venv --python 3.11 | |
source .venv/bin/activate # Linux/Mac | |
.venv\\Scripts\\activate # Windows | |
``` | |
3. Install dependencies | |
```bash | |
uv pip install -r requirements.txt | |
playwright install --with-deps | |
``` | |
4. Configure environment | |
```bash | |
cp .env.example .env | |
# Edit .env to add your API keys | |
``` | |
5. Run the application | |
```bash | |
python webui.py --ip 127.0.0.1 --port 7788 | |
``` | |
#### Docker Installation | |
```bash | |
docker compose up --build | |
``` | |
### Usage Examples | |
1. **Simple Web Search** | |
- Configure LLM in Agent Settings | |
- Configure browser in Browser Settings | |
- In Run Agent tab, enter: "Search for the latest news about AI" | |
- Click Submit Task | |
2. **Deep Research** | |
- Configure LLM in Agent Settings | |
- Go to Agent Marketplace > Deep Research | |
- Enter research topic: "Advances in renewable energy in 2023" | |
- Click Run | |
3. **Using Custom Browser** | |
- In Browser Settings, check "Use Own Browser" | |
- Configure paths to browser and user data | |
- Submit tasks as normal | |
""" | |
) | |
with gr.TabItem("Source Code Analysis"): | |
gr.Markdown( | |
""" | |
## Detailed Source Code Analysis | |
This section provides a deep dive into the code structure and implementation details of key components. | |
### WebUI Manager Class | |
The `WebuiManager` class in `src/webui/webui_manager.py` serves as the central component managing UI elements and application state: | |
```python | |
class WebuiManager: | |
def __init__(self, settings_save_dir: str = "./tmp/webui_settings"): | |
self.id_to_component: dict[str, Component] = {} | |
self.component_to_id: dict[Component, str] = {} | |
self.settings_save_dir = settings_save_dir | |
os.makedirs(self.settings_save_dir, exist_ok=True) | |
``` | |
Key functions: | |
- `add_components()`: Registers UI components with unique IDs | |
- `get_component_by_id()`: Retrieves components using their ID | |
- `save_config()`: Serializes UI settings to JSON | |
- `load_config()`: Loads settings from JSON | |
- `init_browser_use_agent()`: Creates browser agent instances | |
### Custom Browser Implementation | |
The `CustomBrowser` class in `src/browser/custom_browser.py` extends the base `Browser` class from the browser-use library: | |
```python | |
class CustomBrowser(Browser): | |
async def new_context(self, config: BrowserContextConfig | None = None) -> CustomBrowserContext: | |
browser_config = self.config.model_dump() if self.config else {} | |
context_config = config.model_dump() if config else {} | |
merged_config = {**browser_config, **context_config} | |
return CustomBrowserContext(config=BrowserContextConfig(**merged_config), browser=self) | |
``` | |
Key features: | |
- Extends the browser-use Browser class | |
- Creates custom browser contexts | |
- Configures Chrome arguments for different environments | |
- Handles screen resolution and window dimensions | |
### Browser Use Agent | |
The `BrowserUseAgent` class in `src/agent/browser_use/browser_use_agent.py` extends the Agent class: | |
```python | |
class BrowserUseAgent(Agent): | |
def _set_tool_calling_method(self) -> ToolCallingMethod | None: | |
tool_calling_method = self.settings.tool_calling_method | |
if tool_calling_method == 'auto': | |
if is_model_without_tool_support(self.model_name): | |
return 'raw' | |
elif self.chat_model_library == 'ChatGoogleGenerativeAI': | |
return None | |
elif self.chat_model_library == 'ChatOpenAI': | |
return 'function_calling' | |
# Additional models... | |
``` | |
Key capabilities: | |
- Automatically selects tool calling method based on LLM | |
- Handles agent execution with configurable steps | |
- Provides pause/resume functionality | |
- Manages execution history and state | |
- Implements error handling and recovery | |
### Deep Research Agent | |
The `DeepResearchAgent` class in `src/agent/deep_research/deep_research_agent.py` implements a specialized research agent: | |
```python | |
class DeepResearchAgent: | |
def __init__( | |
self, | |
llm: Any, | |
browser_config: Dict[str, Any], | |
mcp_server_config: Optional[Dict[str, Any]] = None, | |
): | |
# Initialize agent with LLM and browser config | |
``` | |
Key components: | |
- Uses LangGraph for structured research workflows | |
- Implements planning, research, and synthesis nodes | |
- Manages parallel browser instances for efficiency | |
- Generates structured research reports | |
- Handles task state persistence | |
### Custom Controller | |
The `CustomController` class in `src/controller/custom_controller.py` extends the Controller class: | |
```python | |
class CustomController(Controller): | |
def __init__(self, exclude_actions: list[str] = [], | |
output_model: Optional[Type[BaseModel]] = None, | |
ask_assistant_callback: Optional[...] = None): | |
super().__init__(exclude_actions=exclude_actions, output_model=output_model) | |
self._register_custom_actions() | |
self.ask_assistant_callback = ask_assistant_callback | |
self.mcp_client = None | |
self.mcp_server_config = None | |
``` | |
Key features: | |
- Registers custom browser actions | |
- Integrates with MCP (Modular Coordination Protocol) | |
- Provides file upload capabilities | |
- Implements human assistance features | |
- Handles action execution with error management | |
### UI Components | |
The UI is built using Gradio components: | |
```python | |
def create_ui(theme_name="Ocean"): | |
with gr.Blocks(title="Browser Use WebUI", theme=theme_map[theme_name], css=css, js=js_func) as demo: | |
with gr.Tabs() as tabs: | |
with gr.TabItem("βοΈ Agent Settings"): | |
create_agent_settings_tab(ui_manager) | |
# Additional tabs... | |
``` | |
Key UI features: | |
- Modular tab-based interface | |
- Customizable themes | |
- Responsive layout | |
- Dark mode support | |
- Configuration persistence | |
""" | |
) | |
with gr.TabItem("Technical Challenges"): | |
gr.Markdown( | |
""" | |
## Technical Challenges & Solutions | |
This section covers key technical challenges faced during development and the solutions implemented. | |
### Browser Integration Challenges | |
**Challenge**: Connecting to existing browser instances with proper user profiles. | |
**Solution**: Custom implementation using CDP (Chrome DevTools Protocol) and WebSocket connections: | |
```python | |
# Implementation in custom_browser.py | |
chrome_args = { | |
f'--remote-debugging-port={self.config.chrome_remote_debugging_port}', | |
*(CHROME_DOCKER_ARGS if IN_DOCKER else []), | |
*(CHROME_HEADLESS_ARGS if self.config.headless else []), | |
# Additional args... | |
} | |
# Check existing port conflicts | |
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | |
if s.connect_ex(('localhost', self.config.chrome_remote_debugging_port)) == 0: | |
chrome_args.remove(f'--remote-debugging-port={self.config.chrome_remote_debugging_port}') | |
``` | |
### OpenAI LLM Integration | |
**Challenge**: Configuring and optimizing OpenAI models for browser automation. | |
**Solution**: Provider abstraction and method detection: | |
```python | |
# In browser_use_agent.py | |
def _set_tool_calling_method(self) -> ToolCallingMethod | None: | |
tool_calling_method = self.settings.tool_calling_method | |
if tool_calling_method == 'auto': | |
if is_model_without_tool_support(self.model_name): | |
return 'raw' | |
else: | |
return 'function_calling' | |
``` | |
### Execution State Management | |
**Challenge**: Maintaining agent state across steps and allowing pause/resume. | |
**Solution**: Custom execution loop with state management: | |
```python | |
# In browser_use_agent.py | |
async def run(self, max_steps: int = 100, on_step_start: AgentHookFunc | None = None, | |
on_step_end: AgentHookFunc | None = None) -> AgentHistoryList: | |
# Execution loop with state management | |
for step in range(max_steps): | |
# Check pause state | |
if self.state.paused: | |
signal_handler.wait_for_resume() | |
signal_handler.reset() | |
# Check for stop | |
if self.state.stopped: | |
logger.info('Agent stopped') | |
break | |
# Execute step with callbacks | |
if on_step_start is not None: | |
await on_step_start(self) | |
step_info = AgentStepInfo(step_number=step, max_steps=max_steps) | |
await self.step(step_info) | |
if on_step_end is not None: | |
await on_step_end(self) | |
``` | |
### Multi-Browser Research Orchestration | |
**Challenge**: Managing multiple parallel browser instances for research tasks. | |
**Solution**: LangGraph-based workflow with parallel task execution: | |
```python | |
# In deep_research_agent.py | |
async def _run_browser_search_tool( | |
queries: List[str], | |
task_id: str, | |
llm: Any, | |
browser_config: Dict[str, Any], | |
stop_event: threading.Event, | |
max_parallel_browsers: int = 1, | |
) -> List[Dict[str, Any]]: | |
# Execute tasks in parallel with limit | |
tasks = [] | |
results = [] | |
semaphore = asyncio.Semaphore(max_parallel_browsers) | |
async def task_wrapper(query): | |
async with semaphore: | |
return await run_single_browser_task( | |
query, task_id, llm, browser_config, stop_event | |
) | |
# Create and gather tasks | |
for query in queries: | |
tasks.append(asyncio.create_task(task_wrapper(query))) | |
results = await asyncio.gather(*tasks) | |
return results | |
``` | |
### UI State Synchronization | |
**Challenge**: Keeping UI state synchronized with backend operations. | |
**Solution**: Component tracking and event-based updates: | |
```python | |
# In webui_manager.py | |
def add_components(self, tab_name: str, components_dict: dict[str, "Component"]) -> None: | |
for comp_name, component in components_dict.items(): | |
comp_id = f"{tab_name}.{comp_name}" | |
self.id_to_component[comp_id] = component | |
self.component_to_id[component] = comp_id | |
# In browser_use_agent_tab.py | |
async def handle_submit(webui_manager: WebuiManager, components: Dict[gr.components.Component, Any]): | |
# Get component values and update UI state | |
task_input = _get_config_value(webui_manager, components, "user_input", "") | |
webui_manager.bu_chat_history.append({"role": "user", "content": task_input}) | |
# Additional UI updates... | |
``` | |
### Docker Environment Challenges | |
**Challenge**: Running browser automation in Docker containers. | |
**Solution**: Special Docker configuration for browser support: | |
```python | |
# In custom_browser.py | |
CHROME_DOCKER_ARGS = [ | |
"--no-sandbox", | |
"--disable-dev-shm-usage", | |
# Additional docker-specific args... | |
] | |
# In docker-compose.yml | |
services: | |
web-ui: | |
build: | |
context: . | |
volumes: | |
- ./tmp:/app/tmp | |
ports: | |
- "7788:7788" | |
- "6080:6080" # VNC for browser viewing | |
environment: | |
- DISPLAY=:1 | |
# Additional environment variables... | |
``` | |
""" | |
) | |
tab_components.update(dict( | |
doc_tabs=doc_tabs, | |
)) | |
webui_manager.add_components("documentation", tab_components) | |
return tab_components |