a-zamfir commited on
Commit
924e633
·
1 Parent(s): e42f725

Added application files

Browse files
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ .env
2
+ *.env
3
+
4
+ __pycache__
5
+ *.pyc
README.md CHANGED
@@ -1,13 +1,108 @@
1
  ---
2
  title: IRIS
3
- emoji: 🏃
4
- colorFrom: green
5
- colorTo: gray
6
  sdk: gradio
7
  sdk_version: 5.33.1
8
  app_file: app.py
9
  pinned: false
10
- short_description: 'IRIS is an agentic chatbot '
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: IRIS
3
+ emoji: 💬
4
+ colorFrom: yellow
5
+ colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.33.1
8
  app_file: app.py
9
  pinned: false
10
+ short_description: IRIS - HuggingFace Hackathon
11
+ tags:
12
+ - agent-demo-track
13
  ---
14
 
15
+ # IRIS
16
+
17
+ ## Important
18
+
19
+ 1. **Watch** IRIS' video overview here: https://www.youtube.com/watch?v=dieWyZZez6o
20
+ 2. **IRIS does not work on Spaces! It requires a virtualization environment on either Amazon or Azure (or a local environment) as its MCP server targets Virtual Machines.**
21
+
22
+ ## Overview
23
+
24
+ IRIS is an agentic chatbot proof-of-concept built for the HuggingFace Hackathon. It demonstrates how a multimodal AI assistant can:
25
+
26
+ - **Listen** to voice commands (STT)
27
+ - **Speak** AI responses (TTS)
28
+ - **See** user screens and analyze them with a vision model
29
+ - **Act** on infrastructure via a MCP integration
30
+
31
+ The goal is to showcase how modern LLMs, audio models, vision models and operator toolchains can be combined into a seamless, voice-driven infrastructure management assistant.
32
+
33
+ ## Key Goals
34
+
35
+ 1. **Multimodal Interaction**
36
+ - Voice: real-time speech-to-text (STT) and text-to-speech (TTS)
37
+ - Vision: live screen capture + AI analysis
38
+ - Text: conversational UI backed by an LLM
39
+
40
+ 2. **Agentic Control**
41
+ - Automatically detect when to call management tools
42
+ - Execute Hyper-V VM operations through a RESTful MCP server
43
+
44
+ 3. **Proof-of-Concept (POC)**
45
+ - Focus on clarity and modularity
46
+ - Demonstrate core concepts rather than production-grade polish
47
+
48
+ ## Functionalities & Offerings
49
+
50
+ ### 1. Audio Service
51
+ - **STT**: Uses HuggingFace’s Falcon-AI (or OpenAI Whisper) to transcribe user speech.
52
+ - **TTS**: Leverages a HuggingFace TTS model (e.g. `canopylabs/orpheus-3b`) to speak back responses.
53
+
54
+ ### 2. Text (LLM) Service
55
+ - Built on HuggingFace’s 🧩 InferenceClient or OpenAI fallback.
56
+ - Default model: `Qwen/Qwen2.5-7B-Instruct` (configurable).
57
+ - Handles chat prompt orchestration, reasoning-before-action, and tool-call formatting.
58
+
59
+ ### 3. Vision & Screen Service
60
+ - Captures your monitor at configurable FPS and resolution.
61
+ - Sends images to a Nebius vision model (`google/gemma-3-27b-it`) with a guided prompt.
62
+ - Parses vision output into “Issue Found / Description / Recommendation”.
63
+
64
+ ### 4. MCP Integration
65
+ - **Hyper-V MCP Server**: FastAPI service exposing tools to list, query, start, stop, and restart VMs.
66
+ - Agent parses LLM tool calls and invokes them via HTTP.
67
+ - Enables fully automated infrastructure actions in response to user voice commands.
68
+
69
+ ## Providers & Configuration
70
+
71
+ | Service | Provider / Model |
72
+ |--------------------|--------------------------------------------------|
73
+ | LLM | HuggingFace Inference (fallback: OpenAI) |
74
+ | STT | Falcon-AI (with HF token) or OpenAI Whisper |
75
+ | TTS | HF TTS (`canopylabs/orpheus-3b-0.1-ft`) |
76
+ | Vision | Nebius (`google/gemma-3-27b-it`) |
77
+ | MCP (VM control) | Custom Hyper-V FastAPI server |
78
+ | UI Framework | Gradio |
79
+
80
+ All credentials and endpoints are managed via environment variables in `config/settings.py`.
81
+
82
+ ## Quickstart
83
+
84
+ 1. **Configure** `.env` with your HF and (optionally) OpenAI tokens.
85
+ 2. **Run** the Hyper-V MCP server:
86
+
87
+ ```bash
88
+ python hyperv_mcp.py
89
+ ```
90
+
91
+ 3. **Launch** the Gradio app:
92
+
93
+ ```bash
94
+ python app.py
95
+ ```
96
+
97
+ 4. **Interact** by typing or speaking.
98
+
99
+ Click “Start sharing screen” to begin vision analysis.
100
+
101
+ Ask IRIS to list VMs, check status, or start a VM by voice.
102
+
103
+ IRIS will confirm actions and execute them through the MCP.
104
+
105
+ ## Contact
106
+
107
+ <a.zamfir@hotmail.com>
108
+ LinkedIn: Andrei Zamfir <https://www.linkedin.com/in/andrei-d-zamfir/>
app.py ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import asyncio
3
+ import logging
4
+ import tempfile
5
+ import json
6
+ import re
7
+ import requests
8
+ from typing import Optional, Dict, Any, List
9
+
10
+ from services.audio_service import AudioService
11
+ from services.llm_service import LLMService
12
+ from services.screen_service import ScreenService
13
+ from config.settings import Settings
14
+ from config.prompts import get_generic_prompt, get_vision_prompt
15
+
16
+ # Configure root logger
17
+ logging.basicConfig(level=logging.INFO)
18
+ logger = logging.getLogger(__name__)
19
+
20
+ class MCPRestClient:
21
+ def __init__(self, base_url: str = "http://localhost:8000"):
22
+ self.base_url = base_url.rstrip('/')
23
+
24
+ async def initialize(self):
25
+ """Test connection to MCP server"""
26
+ try:
27
+ response = requests.get(f"{self.base_url}/", timeout=5)
28
+ if response.status_code == 200:
29
+ logger.info("Successfully connected to MCP server")
30
+ else:
31
+ raise ConnectionError(f"MCP server returned status {response.status_code}")
32
+ except Exception as e:
33
+ logger.error(f"Failed to connect to MCP server at {self.base_url}: {e}")
34
+ logger.info("IRIS did not detect any MCP server. If you're running this in a HuggingFace space, please referr to the readme.md documentation.")
35
+ raise
36
+
37
+ async def get_available_tools(self) -> Dict[str, Dict]:
38
+ """Get list of available tools from MCP server"""
39
+ try:
40
+ response = requests.get(f"{self.base_url}/tools", timeout=5)
41
+ if response.status_code == 200:
42
+ data = response.json()
43
+ tools = {}
44
+ for tool in data.get("tools", []):
45
+ tools[tool["name"]] = {
46
+ "description": tool.get("description", ""),
47
+ "inputSchema": tool.get("inputSchema", {})
48
+ }
49
+ return tools
50
+ else:
51
+ logger.error(f"Failed to get tools: HTTP {response.status_code}")
52
+ return {}
53
+ except Exception as e:
54
+ logger.error(f"Failed to get tools: {e}")
55
+ return {}
56
+
57
+ async def call_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Any:
58
+ """Call a tool on the MCP server"""
59
+ try:
60
+ payload = {
61
+ "name": tool_name,
62
+ "arguments": arguments
63
+ }
64
+
65
+ response = requests.post(
66
+ f"{self.base_url}/tools/call",
67
+ json=payload,
68
+ timeout=30
69
+ )
70
+
71
+ if response.status_code == 200:
72
+ data = response.json()
73
+ if data.get("success"):
74
+ return data.get("result")
75
+ else:
76
+ return {"error": data.get("error", "Unknown error")}
77
+ else:
78
+ return {"error": f"HTTP {response.status_code}: {response.text}"}
79
+
80
+ except Exception as e:
81
+ return {"error": str(e)}
82
+
83
+ async def close(self):
84
+ """Nothing to close with requests"""
85
+ pass
86
+
87
+ class AgenticChatbot:
88
+ def __init__(self):
89
+ self.settings = Settings()
90
+
91
+ # AudioService
92
+ audio_api_key = (
93
+ self.settings.hf_token
94
+ if self.settings.effective_audio_provider == "huggingface"
95
+ else self.settings.openai_api_key
96
+ )
97
+ self.audio_service = AudioService(
98
+ api_key=audio_api_key,
99
+ stt_provider="fal-ai",
100
+ stt_model=self.settings.stt_model,
101
+ tts_model=self.settings.tts_model,
102
+ )
103
+
104
+ # LLMService
105
+ self.llm_service = LLMService(
106
+ api_key=self.settings.llm_api_key,
107
+ model_name=self.settings.effective_model_name,
108
+ )
109
+
110
+ # MCPService - Now using REST client
111
+ mcp_server_url = getattr(self.settings, 'mcp_server_url', 'http://localhost:8000')
112
+ self.mcp_service = MCPRestClient(mcp_server_url)
113
+
114
+ # ScreenService
115
+ self.screen_service = ScreenService(
116
+ prompt=get_vision_prompt(),
117
+ model=self.settings.NEBIUS_MODEL,
118
+ fps=0.05,
119
+ queue_size=2,
120
+ monitor=1,
121
+ compression_quality=self.settings.screen_compression_quality,
122
+ max_width=self.settings.max_width,
123
+ max_height=self.settings.max_height,
124
+ )
125
+ self.latest_screen_context: str = ""
126
+ self.conversation_history: List[Dict[str, Any]] = []
127
+
128
+ async def initialize(self):
129
+ try:
130
+ await self.mcp_service.initialize()
131
+ tools = await self.mcp_service.get_available_tools()
132
+ logger.info(f"Initialized with {len(tools)} MCP tools")
133
+ except Exception as e:
134
+ logger.error(f"MCP init failed: {e}")
135
+
136
+ # Screen callbacks
137
+ def _on_screen_result(self, resp: dict, latency: float, frame_b64: str):
138
+ try:
139
+ content = resp.choices[0].message.content
140
+ except Exception:
141
+ content = str(resp)
142
+ self.latest_screen_context = content
143
+ logger.info(f"[Screen] {latency*1000:.0f}ms → {content}")
144
+
145
+ def _get_conversation_history(self) -> List[Dict[str, str]]:
146
+ """Return the current conversation history for the screen service"""
147
+ return self.conversation_history.copy()
148
+
149
+ def start_screen_sharing(self) -> str:
150
+ self.latest_screen_context = ""
151
+ # Pass the history getter method to screen service
152
+ self.screen_service.start(
153
+ self._on_screen_result,
154
+ history_getter=self._get_conversation_history # Use the method reference
155
+ )
156
+ return "✅ Screen sharing started."
157
+
158
+ async def stop_screen_sharing(
159
+ self,
160
+ history: Optional[List[Dict[str, str]]]
161
+ ) -> (List[Dict[str, str]], str, Optional[str]):
162
+ """Stop screen sharing and append an LLM-generated summary to the chat."""
163
+ # Stop capture
164
+ self.screen_service.stop()
165
+
166
+ # Get the latest vision context
167
+ vision_ctx = self.latest_screen_context
168
+
169
+ if vision_ctx and history is not None:
170
+ # Call process_message with the vision context as user input
171
+ updated_history, audio_path = await self.process_message(
172
+ text_input=f"VISION MODEL OUTPUT: {vision_ctx}",
173
+ audio_input=None,
174
+ history=history
175
+ )
176
+ return updated_history, "🛑 Screen sharing stopped.", audio_path
177
+
178
+ # If no vision context or history, just return
179
+ return history or [], "🛑 Screen sharing stopped.", None
180
+
181
+ async def execute_tool_calls(self, response_text: str) -> str:
182
+ """Parse and execute function calls from LLM response using robust regex parsing"""
183
+
184
+ # Clean the response text - remove code blocks and extra formatting
185
+ cleaned_text = re.sub(r'```[a-zA-Z]*\n?', '', response_text) # Remove code block markers
186
+ cleaned_text = re.sub(r'\n```', '', cleaned_text) # Remove closing code blocks
187
+
188
+ # Pattern for function calls: function_name(arg1="value1", arg2=value2, arg3=true)
189
+ function_pattern = r'(\w+)\s*\(\s*([^)]*)\s*\)'
190
+
191
+ results = []
192
+
193
+ # Find all function calls in the cleaned response
194
+ for match in re.finditer(function_pattern, cleaned_text):
195
+ tool_name = match.group(1)
196
+ args_str = match.group(2).strip()
197
+
198
+ # Skip if this isn't actually a tool (check against available tools)
199
+ available_tools = await self.mcp_service.get_available_tools()
200
+ if tool_name not in available_tools:
201
+ continue
202
+
203
+ try:
204
+ # Parse arguments using regex for key=value pairs
205
+ args = {}
206
+ if args_str:
207
+ # Pattern for key=value pairs, handling quoted strings, numbers, booleans
208
+ arg_pattern = r'(\w+)\s*=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+))'
209
+
210
+ for arg_match in re.finditer(arg_pattern, args_str):
211
+ key = arg_match.group(1)
212
+ # Get the value from whichever group matched (quoted or unquoted)
213
+ value = (arg_match.group(2) or
214
+ arg_match.group(3) or
215
+ arg_match.group(4))
216
+
217
+ # Type conversion for common types
218
+ if value.lower() == 'true':
219
+ args[key] = True
220
+ elif value.lower() == 'false':
221
+ args[key] = False
222
+ elif value.isdigit():
223
+ args[key] = int(value)
224
+ elif value.replace('.', '').isdigit():
225
+ args[key] = float(value)
226
+ else:
227
+ args[key] = value
228
+
229
+ # Execute the tool
230
+ logger.info(f"Executing tool: {tool_name} with args: {args}")
231
+ result = await self.mcp_service.call_tool(tool_name, args)
232
+ results.append({
233
+ 'tool': tool_name,
234
+ 'args': args,
235
+ 'result': result
236
+ })
237
+
238
+ except Exception as e:
239
+ results.append({
240
+ 'tool': tool_name,
241
+ 'args': args if 'args' in locals() else {},
242
+ 'error': str(e)
243
+ })
244
+
245
+ # Format results for LLM
246
+ if not results:
247
+ return ""
248
+
249
+ formatted_results = []
250
+ for result in results:
251
+ if 'error' in result:
252
+ formatted_results.append(
253
+ f"Tool {result['tool']} failed: {result['error']}"
254
+ )
255
+ else:
256
+ formatted_results.append(
257
+ f"Tool {result['tool']} executed successfully:\n{json.dumps(result['result'], indent=2)}"
258
+ )
259
+
260
+ return "\n\n".join(formatted_results)
261
+
262
+ # Chat / tool integration
263
+ async def generate_response(
264
+ self,
265
+ user_input: str,
266
+ screen_context: str = "",
267
+ tool_result: str = ""
268
+ ) -> str:
269
+ # Retrieve available tools metadata
270
+ tools = await self.mcp_service.get_available_tools()
271
+ # Format tool list for prompt
272
+ tool_desc = "\n".join(f"- {name}: {info.get('description','')}" for name, info in tools.items())
273
+
274
+ # Build messages
275
+ messages: List[Dict[str, str]] = [
276
+ {"role": "system", "content": get_generic_prompt()},
277
+ ]
278
+ # Inform LLM about tools
279
+ if tool_desc:
280
+ messages.append({"role": "system", "content": f"Available tools:\n{tool_desc}"})
281
+ messages.append({"role": "user", "content": user_input})
282
+ if tool_result:
283
+ messages.append({"role": "assistant", "content": tool_result})
284
+
285
+ return await self.llm_service.get_chat_completion(messages)
286
+
287
+ async def process_message(
288
+ self,
289
+ text_input: str,
290
+ audio_input: Optional[str],
291
+ history: List[Dict[str, str]]
292
+ ) -> (List[Dict[str, str]], Optional[str]):
293
+ # Debug: Log the incoming state
294
+ logger.info(f"=== PROCESS_MESSAGE START ===")
295
+ for i, msg in enumerate(history[-3:]):
296
+ logger.info(f" {len(history) - 3 + i}: {msg.get('role')} - {msg.get('content', '')[:100]}...")
297
+
298
+ # Update the internal conversation history to match the UI history
299
+ self.conversation_history = history.copy()
300
+
301
+ # STT
302
+ transcript = ""
303
+ if audio_input:
304
+ transcript = await self.audio_service.speech_to_text(audio_input)
305
+ user_input = (text_input + " " + transcript).strip()
306
+
307
+ # If no input, return unchanged
308
+ if not user_input:
309
+ return history, None
310
+
311
+ # Check if this is a vision model output being processed
312
+ is_vision_output = user_input.startswith("VISION MODEL OUTPUT:")
313
+
314
+ # Add user message to both histories (ALWAYS add the user input)
315
+ user_message = {"role": "user", "content": user_input}
316
+ history.append(user_message)
317
+ self.conversation_history.append(user_message)
318
+
319
+ # Handle screen context - only for regular user inputs, not vision outputs
320
+ screen_ctx = ""
321
+ if not is_vision_output and self.latest_screen_context:
322
+ screen_ctx = self.latest_screen_context
323
+ # Clear the screen context after using it to prevent reuse
324
+ self.latest_screen_context = ""
325
+
326
+ # Get initial LLM response (may include tool calls)
327
+ assistant_reply = await self.generate_response(user_input, screen_ctx)
328
+
329
+ # Check if response contains function calls and execute them
330
+ tool_results = await self.execute_tool_calls(assistant_reply)
331
+ if tool_results:
332
+ tool_message = {"role": "assistant", "content": tool_results}
333
+ history.append(tool_message)
334
+ self.conversation_history.append(tool_message)
335
+ # Get final response after tool execution
336
+ assistant_reply = await self.generate_response(user_input, screen_ctx, tool_results)
337
+
338
+ # ALWAYS add the final assistant response to both histories
339
+ assistant_message = {"role": "assistant", "content": assistant_reply}
340
+ history.append(assistant_message)
341
+ self.conversation_history.append(assistant_message)
342
+
343
+ # TTS - only speak the assistant reply for regular inputs
344
+ audio_path = None
345
+ audio_bytes = await self.audio_service.text_to_speech(assistant_reply)
346
+ if audio_bytes:
347
+ tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
348
+ tmp.write(audio_bytes)
349
+ tmp.close()
350
+ audio_path = tmp.name
351
+
352
+ logger.info(f"=== PROCESS_MESSAGE END ===")
353
+
354
+ return history, audio_path
355
+
356
+ async def cleanup(self):
357
+ """Cleanup resources"""
358
+ await self.mcp_service.close()
359
+
360
+ # ——————————————————————————————————
361
+ # Gradio interface setup
362
+ # ——————————————————————————————————
363
+
364
+ chatbot = AgenticChatbot()
365
+
366
+ async def setup_gradio_interface() -> gr.Blocks:
367
+ await chatbot.initialize()
368
+
369
+ with gr.Blocks(title="Agentic Chatbot", theme=gr.themes.Soft()) as demo:
370
+ chat = gr.Chatbot(type="messages", label="Conversation")
371
+ text_input = gr.Textbox(lines=2, placeholder="Type your message…", label="Text")
372
+ audio_input = gr.Audio(sources=["microphone"], type="filepath", label="Voice")
373
+
374
+ # Screen-sharing controls
375
+ screen_status = gr.Textbox(label="Screen Sharing Status", interactive=False)
376
+ start_btn = gr.Button("Start sharing screen")
377
+ stop_btn = gr.Button("Stop sharing screen")
378
+
379
+ # AI response audio player (including vision TTS)
380
+ audio_output = gr.Audio(label="AI Response", autoplay=True)
381
+
382
+ # Message send
383
+ send_btn = gr.Button("Send", variant="primary")
384
+
385
+ # Wire up buttons
386
+ # Wire up buttons
387
+ start_btn.click(fn=chatbot.start_screen_sharing, inputs=None, outputs=screen_status)
388
+ stop_btn.click(fn=chatbot.stop_screen_sharing, inputs=[chat], outputs=[chat, screen_status, audio_output])
389
+
390
+ send_btn.click(
391
+ chatbot.process_message,
392
+ inputs=[text_input, audio_input, chat],
393
+ outputs=[chat, audio_output]
394
+ )
395
+ text_input.submit(
396
+ chatbot.process_message,
397
+ inputs=[text_input, audio_input, chat],
398
+ outputs=[chat, audio_output]
399
+ )
400
+
401
+ return demo
402
+
403
+ if __name__ == "__main__":
404
+ demo = asyncio.run(setup_gradio_interface())
405
+ demo.launch(server_name="0.0.0.0", server_port=7860)
config/prompts.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ '''
2
+ Prompts module for text and vision in the Agentic Chatbot.
3
+ '''
4
+
5
+ # Prompt used for all generic text-based interactions
6
+ GENERIC_PROMPT = '''
7
+ You are a Hyper‑V virtual machine management assistant: help users manage Hyper‑V VMs by providing clear guidance and executing management commands when explicitly requested.
8
+ You are able to receive automated image analysis through the usage of a Vision-compatible model. The user is able to share this data to you as text. Accept screen share requests.
9
+
10
+ When a user shares screen data, they will provide it as text prefixed with VISION MODEL OUTPUT:—you must treat that as their input and respond accordingly.
11
+
12
+ Provide conversational answers for general queries. When users request VM management actions, follow a strict reasoning‑before‑action structure and execute the appropriate tool functions directly. If you receive input beginning with VISION MODEL OUTPUT:, parse only its Recommendation section and return a concise remediation step based solely on that recommendation.
13
+
14
+ Tools
15
+ list_vms(): List all virtual machines and their current status
16
+
17
+ get_vm_status(vm_name="[VMName]"): Get detailed status for a specific VM
18
+
19
+ start_vm(vm_name="[VMName]"): Start a virtual machine
20
+
21
+ stop_vm(vm_name="[VMName]", force=[true|false]): Stop a VM (force=true for hard shutdown)
22
+
23
+ restart_vm(vm_name="[VMName]", force=[true|false]): Restart a VM
24
+
25
+ Steps
26
+ 1 Detect Vision Input
27
+
28
+ If user input starts with VISION MODEL OUTPUT:, skip normal steps and go to Vision Response.
29
+
30
+ 2 Understand the user’s request:
31
+
32
+ General guidance → respond conversationally.
33
+
34
+ VM management action → proceed to step 3.
35
+
36
+ 3 Plan: Identify which tool(s) to call.
37
+
38
+ 4 Action: State the action, then place the function call on its own line.
39
+
40
+ 5 Analysis: After output returns, interpret the results.
41
+
42
+ 6 Follow‑up: Suggest next steps or ask clarifying questions if needed.
43
+
44
+ Vision Response
45
+ Input: Text prefixed with VISION MODEL OUTPUT: containing a “Recommendation:” line.
46
+
47
+ Output: A single concise instruction telling the user what to do next, based only on that Recommendation.
48
+
49
+ Output Format
50
+
51
+ Description of the action, then the exact function call.
52
+
53
+ Interpretation of results.
54
+
55
+ Follow‑up: optional question or suggestion.
56
+
57
+ Vision Response: a short sentence or two reflecting only the vision Recommendation.
58
+
59
+ Examples
60
+ Example 1 – Listing VMs
61
+ User: “Show me all my virtual machines.”
62
+ Assistant:
63
+ list_vms()
64
+
65
+ Example 2 – Starting a VM
66
+ User: “Please start the Accounting VM.”
67
+ Assistant: start_vm(vm_name="Accounting")
68
+
69
+ Example 3 – Vision Input
70
+ User:
71
+ VISION MODEL OUTPUT:
72
+ Issue Found: Yes
73
+ Location: “Accounting” row
74
+ Recommendation: Select the “Accounting” virtual machine and initiate the “Start” action to bring it online.
75
+ Assistant (Vision Response):
76
+ Select the “Accounting” VM and run the Start action to power it on.
77
+
78
+ Example 4 – Screen Share Initiation and Vision Flow
79
+ User: “Hello. I have an issue with one of my virtual machines. I’ll share screen so you can see.”
80
+ Assistant:
81
+ Sure — please provide the screen data so I can analyze it and guide you.
82
+
83
+ Notes
84
+ Default to graceful shutdown (force=false) unless specified.
85
+
86
+ Only execute tool calls when explicitly requested.
87
+
88
+ Reasoning must always precede Action; conclusions must appear last.
89
+ '''
90
+
91
+ # Prompt used when analyzing visual or screen content
92
+ VISION_PROMPT = '''
93
+ Analyze screen-sharing images to identify and describe issues mentioned in conversation history, focusing on the right side of the screen.
94
+ You are an AI assistant with vision capabilities specialized in analyzing screen-sharing images. Your role is to examine images and identify issues or elements discussed in the conversation history, with particular attention to the right side of the screen where your target area is located.
95
+ Steps
96
+
97
+ Review Conversation History: Carefully read through the provided conversation history to understand:
98
+
99
+ What issue or problem the user is experiencing
100
+ What specific elements, errors, or concerns they've mentioned
101
+ Their goals and what they're trying to accomplish
102
+
103
+
104
+ Analyze the Image: Examine the provided screen-sharing image with focus on:
105
+
106
+ The right side of the screen (primary target area)
107
+ Visual elements that relate to the user's described issue
108
+ Any error messages, UI problems, or anomalies
109
+ Relevant text, buttons, or interface elements
110
+
111
+
112
+ Identify the Issue: Based on your analysis:
113
+
114
+ Locate the specific issue mentioned by the user
115
+ Note its exact position and visual characteristics
116
+ Gather relevant details about the problem
117
+
118
+
119
+ Report Findings: Provide clear information about:
120
+
121
+ Whether you found the issue
122
+ Exact location and description of the problem
123
+ Any relevant surrounding context or related elements
124
+
125
+ Output Format
126
+ Provide a structured response containing:
127
+
128
+ Issue Found: Yes/No
129
+ Description: Detailed explanation of what you observe
130
+ Recommendation: Brief suggestion if applicable
131
+
132
+ If the issue cannot be located, clearly state this and explain what you were able to observe instead.
133
+ Examples
134
+ Example 1:
135
+ Input: [Conversation history shows user reporting an unreachable virtual machine]
136
+ Output:
137
+ Issue Found: Yes
138
+ Description: The screen share shows a HyperV environment. The referenced VM seems to be powered off.
139
+ Recommendation: The user should click on the "Start" button on the lower side of the right column.
140
+
141
+
142
+ Always prioritize the right side of the screen as specified, but don't ignore relevant information elsewhere if it relates to the issue
143
+ Be specific about visual elements - colors, text, positioning, and states (enabled/disabled, selected/unselected)
144
+ If multiple potential issues are visible, focus on the one most relevant to the conversation history
145
+ Consider common UI issues: missing elements, misalignment, error states, loading problems, or unexpected behavior
146
+ '''
147
+
148
+ def get_generic_prompt() -> str:
149
+ """Return the generic text prompt."""
150
+ return GENERIC_PROMPT
151
+
152
+
153
+ def get_vision_prompt() -> str:
154
+ """Return the vision analysis prompt."""
155
+ return VISION_PROMPT
config/settings.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from dataclasses import dataclass
3
+ from typing import Optional
4
+ from pathlib import Path
5
+ from dotenv import load_dotenv
6
+ load_dotenv()
7
+
8
+ @dataclass
9
+ class Settings:
10
+ """Application-wide configuration settings."""
11
+
12
+ # LLM Provider settings
13
+ llm_provider: str = os.getenv("LLM_PROVIDER", "auto")
14
+
15
+ # Hugging Face settings
16
+ hf_token: str = os.getenv("HF_TOKEN", "")
17
+ hf_chat_model: str = os.getenv("HF_CHAT_MODEL", "Qwen/Qwen2.5-7B-Instruct")
18
+ hf_temperature: float = 0.001
19
+ hf_max_new_tokens: int = 512
20
+
21
+ # Model settings
22
+ model_name: str = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-7B-Instruct")
23
+
24
+ # Audio provider settings
25
+ audio_provider: str = os.getenv("AUDIO_PROVIDER", "auto")
26
+ tts_model: str = os.getenv("TTS_MODEL", "canopylabs/orpheus-3b-0.1-ft")
27
+ stt_model: str = os.getenv("STT_MODEL", "openai/whisper-large-v3")
28
+
29
+
30
+ # Screen sharing settings
31
+ screen_capture_interval: float = float(os.getenv("SCREEN_CAPTURE_INTERVAL", "1.0"))
32
+ screen_compression_quality: int = int(os.getenv("SCREEN_COMPRESSION_QUALITY", "50"))
33
+ max_width: int = int(os.getenv("SCREEN_MAX_WIDTH", "3440"))
34
+ max_height: int = int(os.getenv("SCREEN_MAX_HEIGHT", "1440"))
35
+ NEBIUS_MODEL: str = os.getenv("NEBIUS_MODEL", "google/gemma-3-27b-it")
36
+ NEBIUS_API_KEY: str = os.getenv("NEBIUS_API_KEY", "Not found")
37
+ NEBIUS_BASE_URL: str = os.getenv("NEBIUS_BASE_URL", "https://api.studio.nebius.com/v1/")
38
+
39
+ # Hyper-V settings
40
+ hyperv_enabled: bool = os.getenv("HYPERV_ENABLED", "false").lower() == "true"
41
+ hyperv_host: str = os.getenv("HYPERV_HOST", "localhost")
42
+ hyperv_username: Optional[str] = os.getenv("HYPERV_USERNAME")
43
+ hyperv_password: Optional[str] = os.getenv("HYPERV_PASSWORD")
44
+
45
+ # Application settings
46
+ max_conversation_history: int = int(os.getenv("MAX_CONVERSATION_HISTORY", "50"))
47
+ temp_dir: str = os.getenv("TEMP_DIR", "./temp")
48
+ log_level: str = os.getenv("LOG_LEVEL", "INFO")
49
+
50
+ def __post_init__(self):
51
+ # Ensure necessary directories exist
52
+ Path(self.temp_dir).mkdir(exist_ok=True, parents=True)
53
+ Path("./config").mkdir(exist_ok=True, parents=True)
54
+ Path("./logs").mkdir(exist_ok=True, parents=True)
55
+
56
+ def is_hf_token_valid(self) -> bool:
57
+ return bool(self.hf_token and len(self.hf_token) > 10)
58
+
59
+ @property
60
+ def effective_llm_provider(self) -> str:
61
+ if self.llm_provider == "auto":
62
+ return "huggingface" if self.is_hf_token_valid() else "openai"
63
+ return self.llm_provider
64
+
65
+ @property
66
+ def effective_audio_provider(self) -> str:
67
+ if self.audio_provider == "auto":
68
+ return "huggingface" if self.is_hf_token_valid() else "openai"
69
+ return self.audio_provider
70
+
71
+ @property
72
+ def llm_endpoint(self) -> str:
73
+ if self.effective_llm_provider == "huggingface":
74
+ return f"https://api-inference.huggingface.co/models/{self.hf_chat_model}"
75
+ return self.openai_endpoint
76
+
77
+ @property
78
+ def llm_api_key(self) -> str:
79
+ return self.hf_token if self.effective_llm_provider == "huggingface" else self.openai_api_key
80
+
81
+ @property
82
+ def effective_model_name(self) -> str:
83
+ return self.hf_chat_model if self.effective_llm_provider == "huggingface" else self.model_name
84
+
mcp_servers/hyperv_mcp.py ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Standalone FastAPI MCP server for Hyper-V management.
4
+ Server will be available at http://localhost:8000
5
+ """
6
+ import asyncio
7
+ import json
8
+ import logging
9
+ import subprocess
10
+ import sys
11
+ import platform
12
+ from typing import Dict, Any, List, Optional
13
+ from dataclasses import dataclass, asdict
14
+ from fastapi import FastAPI, HTTPException
15
+ from pydantic import BaseModel
16
+ import uvicorn
17
+
18
+ # Ensure SelectorEventLoopPolicy on Windows
19
+ if platform.system() == "Windows":
20
+ try:
21
+ asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
22
+ except AttributeError:
23
+ pass
24
+
25
+ # Setup logging
26
+ logging.basicConfig(level=logging.INFO)
27
+ logger = logging.getLogger("hyperv_mcp_server")
28
+
29
+ # Pydantic models for API
30
+ class ToolCallRequest(BaseModel):
31
+ name: str
32
+ arguments: Dict[str, Any] = {}
33
+
34
+ class ToolInfo(BaseModel):
35
+ name: str
36
+ description: str
37
+ inputSchema: Dict[str, Any]
38
+
39
+ class ToolsListResponse(BaseModel):
40
+ tools: List[ToolInfo]
41
+
42
+ class ToolCallResponse(BaseModel):
43
+ success: bool
44
+ result: Any = None
45
+ error: str = None
46
+
47
+ @dataclass
48
+ class VirtualMachine:
49
+ name: str
50
+ state: str
51
+ status: str
52
+
53
+ class HyperVManager:
54
+ def __init__(self, host: str = "localhost", username: Optional[str] = None, password: Optional[str] = None):
55
+ self.host = host
56
+ self.username = username
57
+ self.password = password
58
+
59
+ def _run_powershell(self, command: str) -> str:
60
+ """Execute PowerShell command and return output"""
61
+ try:
62
+ if self.host == "localhost":
63
+ proc = subprocess.run(
64
+ ["powershell", "-Command", command],
65
+ capture_output=True, text=True, shell=True, timeout=30
66
+ )
67
+ if proc.returncode != 0:
68
+ raise RuntimeError(f"PowerShell error: {proc.stderr}")
69
+ return proc.stdout.strip()
70
+ else:
71
+ raise NotImplementedError("Remote host not supported in this server")
72
+ except subprocess.TimeoutExpired:
73
+ raise RuntimeError("PowerShell command timed out")
74
+ except Exception as e:
75
+ logger.error(f"PowerShell execution error: {e}")
76
+ raise
77
+
78
+ async def list_vms(self) -> List[Dict[str, Any]]:
79
+ """List all virtual machines"""
80
+ try:
81
+ cmd = (
82
+ "Get-VM | Select-Object Name,State,Status | "
83
+ "ConvertTo-Json -Depth 2"
84
+ )
85
+ output = await asyncio.get_event_loop().run_in_executor(
86
+ None, self._run_powershell, cmd
87
+ )
88
+
89
+ if not output:
90
+ return []
91
+
92
+ data = json.loads(output)
93
+ if isinstance(data, dict):
94
+ data = [data]
95
+
96
+ vms = []
97
+ for item in data:
98
+ vm = VirtualMachine(
99
+ name=item.get('Name', ''),
100
+ state=item.get('State', ''),
101
+ status=item.get('Status', ''),
102
+ )
103
+ vms.append(asdict(vm))
104
+
105
+ return vms
106
+ except Exception as e:
107
+ logger.error(f"Failed to list VMs: {e}")
108
+ raise
109
+
110
+ async def get_vm_status(self, vm_name: str) -> Dict[str, Any]:
111
+ """Get status of a specific virtual machine"""
112
+ try:
113
+ cmd = (
114
+ f"$vm = Get-VM -Name '{vm_name}' -ErrorAction Stop; "
115
+ "$vm | Select-Object Name,State,Status | "
116
+ "ConvertTo-Json -Depth 2"
117
+ )
118
+ output = await asyncio.get_event_loop().run_in_executor(
119
+ None, self._run_powershell, cmd
120
+ )
121
+
122
+ if not output:
123
+ return {}
124
+
125
+ return json.loads(output)
126
+ except Exception as e:
127
+ logger.error(f"Failed to get VM status for {vm_name}: {e}")
128
+ raise
129
+
130
+ async def start_vm(self, vm_name: str) -> Dict[str, Any]:
131
+ """Start a virtual machine"""
132
+ try:
133
+ cmd = f"Start-VM -Name '{vm_name}' -ErrorAction Stop"
134
+ await asyncio.get_event_loop().run_in_executor(
135
+ None, self._run_powershell, cmd
136
+ )
137
+ return {"success": True, "message": f"VM '{vm_name}' started successfully"}
138
+ except Exception as e:
139
+ logger.error(f"Failed to start VM {vm_name}: {e}")
140
+ raise
141
+
142
+ async def stop_vm(self, vm_name: str, force: bool = False) -> Dict[str, Any]:
143
+ """Stop a virtual machine"""
144
+ try:
145
+ force_flag = "-Force" if force else ""
146
+ cmd = f"Stop-VM -Name '{vm_name}' {force_flag} -ErrorAction Stop"
147
+ await asyncio.get_event_loop().run_in_executor(
148
+ None, self._run_powershell, cmd
149
+ )
150
+ return {"success": True, "message": f"VM '{vm_name}' stopped successfully"}
151
+ except Exception as e:
152
+ logger.error(f"Failed to stop VM {vm_name}: {e}")
153
+ raise
154
+
155
+ async def restart_vm(self, vm_name: str, force: bool = False) -> Dict[str, Any]:
156
+ """Restart a virtual machine"""
157
+ try:
158
+ force_flag = "-Force" if force else ""
159
+ cmd = f"Restart-VM -Name '{vm_name}' {force_flag} -ErrorAction Stop"
160
+ await asyncio.get_event_loop().run_in_executor(
161
+ None, self._run_powershell, cmd
162
+ )
163
+ return {"success": True, "message": f"VM '{vm_name}' restarted successfully"}
164
+ except Exception as e:
165
+ logger.error(f"Failed to restart VM {vm_name}: {e}")
166
+ raise
167
+
168
+
169
+ # Initialize FastAPI app and Hyper-V manager
170
+ app = FastAPI(title="Hyper-V MCP Server", version="1.0.0")
171
+ hyperv_manager = HyperVManager()
172
+
173
+ # Tool definitions
174
+ TOOLS = {
175
+ "list_vms": {
176
+ "name": "list_vms",
177
+ "description": "List all virtual machines on the Hyper-V host",
178
+ "inputSchema": {
179
+ "type": "object",
180
+ "properties": {},
181
+ "required": []
182
+ }
183
+ },
184
+ "get_vm_status": {
185
+ "name": "get_vm_status",
186
+ "description": "Get detailed status information for a specific virtual machine",
187
+ "inputSchema": {
188
+ "type": "object",
189
+ "properties": {
190
+ "vm_name": {"type": "string", "description": "Name of the virtual machine"}
191
+ },
192
+ "required": ["vm_name"]
193
+ }
194
+ },
195
+ "start_vm": {
196
+ "name": "start_vm",
197
+ "description": "Start a virtual machine",
198
+ "inputSchema": {
199
+ "type": "object",
200
+ "properties": {
201
+ "vm_name": {"type": "string", "description": "Name of the virtual machine to start"}
202
+ },
203
+ "required": ["vm_name"]
204
+ }
205
+ },
206
+ "stop_vm": {
207
+ "name": "stop_vm",
208
+ "description": "Stop a virtual machine",
209
+ "inputSchema": {
210
+ "type": "object",
211
+ "properties": {
212
+ "vm_name": {"type": "string", "description": "Name of the virtual machine to stop"},
213
+ "force": {"type": "boolean", "description": "Force stop the VM", "default": False}
214
+ },
215
+ "required": ["vm_name"]
216
+ }
217
+ },
218
+ "restart_vm": {
219
+ "name": "restart_vm",
220
+ "description": "Restart a virtual machine",
221
+ "inputSchema": {
222
+ "type": "object",
223
+ "properties": {
224
+ "vm_name": {"type": "string", "description": "Name of the virtual machine to restart"},
225
+ "force": {"type": "boolean", "description": "Force restart the VM", "default": False}
226
+ },
227
+ "required": ["vm_name"]
228
+ }
229
+ },
230
+ }
231
+
232
+ # API Endpoints
233
+ @app.get("/")
234
+ async def root():
235
+ """Health check endpoint"""
236
+ return {"status": "Hyper-V MCP Server is running", "version": "1.0.0"}
237
+
238
+ @app.get("/tools", response_model=ToolsListResponse)
239
+ async def list_tools():
240
+ """List all available tools"""
241
+ tools = [ToolInfo(**tool_info) for tool_info in TOOLS.values()]
242
+ return ToolsListResponse(tools=tools)
243
+
244
+ @app.post("/tools/call", response_model=ToolCallResponse)
245
+ async def call_tool(request: ToolCallRequest):
246
+ """Execute a tool with given arguments"""
247
+ try:
248
+ tool_name = request.name
249
+ arguments = request.arguments
250
+
251
+ if tool_name not in TOOLS:
252
+ raise HTTPException(status_code=404, detail=f"Tool '{tool_name}' not found")
253
+
254
+ # Get the corresponding method from HyperVManager
255
+ if not hasattr(hyperv_manager, tool_name):
256
+ raise HTTPException(status_code=500, detail=f"Method '{tool_name}' not implemented")
257
+
258
+ method = getattr(hyperv_manager, tool_name)
259
+
260
+ # Call the method with arguments
261
+ if arguments:
262
+ result = await method(**arguments)
263
+ else:
264
+ result = await method()
265
+
266
+ return ToolCallResponse(success=True, result=result)
267
+
268
+ except Exception as e:
269
+ logger.error(f"Tool execution error: {e}")
270
+ return ToolCallResponse(success=False, error=str(e))
271
+
272
+ # Additional convenience endpoints
273
+ @app.get("/vms")
274
+ async def get_vms():
275
+ """Convenience endpoint to list VMs"""
276
+ try:
277
+ result = await hyperv_manager.list_vms()
278
+ return {"success": True, "vms": result}
279
+ except Exception as e:
280
+ raise HTTPException(status_code=500, detail=str(e))
281
+
282
+ @app.get("/vms/{vm_name}")
283
+ async def get_vm(vm_name: str):
284
+ """Convenience endpoint to get VM status"""
285
+ try:
286
+ result = await hyperv_manager.get_vm_status(vm_name)
287
+ return {"success": True, "vm": result}
288
+ except Exception as e:
289
+ raise HTTPException(status_code=500, detail=str(e))
290
+
291
+ @app.post("/vms/{vm_name}/start")
292
+ async def start_vm_endpoint(vm_name: str):
293
+ """Convenience endpoint to start a VM"""
294
+ try:
295
+ result = await hyperv_manager.start_vm(vm_name)
296
+ return result
297
+ except Exception as e:
298
+ raise HTTPException(status_code=500, detail=str(e))
299
+
300
+ @app.post("/vms/{vm_name}/stop")
301
+ async def stop_vm_endpoint(vm_name: str, force: bool = False):
302
+ """Convenience endpoint to stop a VM"""
303
+ try:
304
+ result = await hyperv_manager.stop_vm(vm_name, force)
305
+ return result
306
+ except Exception as e:
307
+ raise HTTPException(status_code=500, detail=str(e))
308
+
309
+ if __name__ == "__main__":
310
+ print("Starting Hyper-V MCP Server...")
311
+ print("Server will be available at: http://localhost:8000")
312
+ print("API documentation at: http://localhost:8000/docs")
313
+
314
+ uvicorn.run(
315
+ app,
316
+ host="0.0.0.0",
317
+ port=8000,
318
+ log_level="info"
319
+ )
requirements.txt ADDED
Binary file (3.51 kB). View file
 
services/audio_service.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import base64
3
+ import logging
4
+ import tempfile
5
+ import asyncio
6
+ from typing import Optional, Union
7
+ from pathlib import Path
8
+
9
+ from huggingface_hub import InferenceClient
10
+
11
+ from config.settings import Settings
12
+
13
+ # Configure logger for detailed debugging
14
+ logger = logging.getLogger(__name__)
15
+ logger.setLevel(logging.DEBUG)
16
+ ch = logging.StreamHandler()
17
+ ch.setLevel(logging.DEBUG)
18
+ formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
19
+ ch.setFormatter(formatter)
20
+ logger.addHandler(ch)
21
+
22
+ class AudioService:
23
+ def __init__(
24
+ self,
25
+ api_key: str,
26
+ stt_provider: str = "fal-ai",
27
+ stt_model: str = "openai/whisper-large-v3",
28
+ tts_model: str = "canopylabs/orpheus-3b-0.1-ft",
29
+ ):
30
+ """
31
+ AudioService with separate providers for ASR and TTS.
32
+
33
+ :param api_key: Hugging Face API token
34
+ :param stt_provider: Provider for speech-to-text (e.g., "fal-ai")
35
+ :param stt_model: ASR model ID
36
+ :param tts_model: TTS model ID
37
+ """
38
+ self.api_key = api_key
39
+ self.stt_model = stt_model
40
+ self.tts_model = tts_model
41
+
42
+ # Speech-to-Text client
43
+ logger.debug(f"Initializing ASR client with provider={stt_provider}")
44
+ self.asr_client = InferenceClient(
45
+ provider=stt_provider,
46
+ api_key=self.api_key,
47
+ )
48
+
49
+ # Text-to-Speech client (no provider needed, use token parameter)
50
+ logger.debug(f"Initializing TTS client with default provider")
51
+ self.tts_client = InferenceClient(token=self.api_key)
52
+
53
+ logger.info(f"AudioService configured: ASR model={self.stt_model} via {stt_provider}, TTS model={self.tts_model} via default provider.")
54
+
55
+ async def speech_to_text(self, audio_file: Union[str, bytes, io.BytesIO]) -> str:
56
+ """
57
+ Convert speech to text using the configured ASR provider.
58
+ """
59
+ # Prepare input path
60
+ if isinstance(audio_file, str):
61
+ input_path = audio_file
62
+ logger.debug(f"Using existing file for ASR: {input_path}")
63
+ else:
64
+ data = audio_file.getvalue() if isinstance(audio_file, io.BytesIO) else audio_file
65
+ tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
66
+ tmp.write(data)
67
+ tmp.close()
68
+ input_path = tmp.name
69
+ logger.debug(f"Wrote audio to temp file for ASR: {input_path}")
70
+
71
+ # Call ASR synchronously in executor
72
+ try:
73
+ logger.info(f"Calling ASR model={self.stt_model}")
74
+ result = await asyncio.get_event_loop().run_in_executor(
75
+ None,
76
+ lambda: self.asr_client.automatic_speech_recognition(
77
+ input_path,
78
+ model=self.stt_model,
79
+ )
80
+ )
81
+ # Parse result
82
+ transcript = result.get("text") if isinstance(result, dict) else getattr(result, "text", "")
83
+ logger.info(f"ASR success, transcript length={len(transcript)}")
84
+ logger.debug(f"Transcript preview: {transcript[:100]}")
85
+ return transcript or ""
86
+ except Exception as e:
87
+ logger.error(f"ASR error: {e}", exc_info=True)
88
+ return ""
89
+
90
+ async def text_to_speech(self, text: str) -> Optional[bytes]:
91
+ """
92
+ Convert text to speech using the configured TTS provider.
93
+ """
94
+ if not text.strip():
95
+ logger.debug("Empty text input for TTS. Skipping generation.")
96
+ return None
97
+
98
+ def _call_tts():
99
+ """Wrapper function to handle StopIteration properly."""
100
+ try:
101
+ return self.tts_client.text_to_speech(text, model=self.tts_model)
102
+ except StopIteration as e:
103
+ # Convert StopIteration to RuntimeError to prevent Future issues
104
+ raise RuntimeError(f"StopIteration in TTS call: {e}")
105
+
106
+ try:
107
+ logger.info(f"Calling TTS model={self.tts_model}, text length={len(text)}")
108
+ audio = await asyncio.get_event_loop().run_in_executor(None, _call_tts)
109
+ logger.info(f"TTS success, received {len(audio)} bytes")
110
+ return audio
111
+ except Exception as e:
112
+ logger.error(f"TTS error: {e}", exc_info=True)
113
+ return None
services/llm_service.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ from typing import Dict, List, Optional
3
+ from dataclasses import dataclass
4
+ from huggingface_hub import InferenceClient
5
+
6
+ from config.settings import Settings
7
+
8
+ # Configure logger for detailed debugging
9
+ logger = logging.getLogger(__name__)
10
+ logger.setLevel(logging.DEBUG)
11
+ ch = logging.StreamHandler()
12
+ ch.setLevel(logging.DEBUG)
13
+ formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
14
+ ch.setFormatter(formatter)
15
+ logger.addHandler(ch)
16
+
17
+ @dataclass
18
+ class LLMConfig:
19
+ api_key: str
20
+ model_name: str
21
+ temperature: float = 0.01
22
+ max_tokens: int = 512
23
+
24
+ class LLMService:
25
+ def __init__(
26
+ self,
27
+ api_key: Optional[str] = None,
28
+ model_name: Optional[str] = None,
29
+ ):
30
+ """
31
+ LLMService that uses HuggingFace InferenceClient for chat completions.
32
+ """
33
+ settings = Settings()
34
+
35
+ # Use provided values or fall back to settings
36
+ key = api_key or settings.hf_token
37
+ name = model_name or settings.effective_model_name
38
+
39
+ self.config = LLMConfig(
40
+ api_key=key,
41
+ model_name=name,
42
+ temperature=settings.hf_temperature,
43
+ max_tokens=settings.hf_max_new_tokens,
44
+ )
45
+
46
+ # Initialize the InferenceClient
47
+ self.client = InferenceClient(token=self.config.api_key)
48
+
49
+ async def get_chat_completion(self, messages: List[Dict[str, str]]) -> str:
50
+ """
51
+ Return the assistant response for a chat-style messages array.
52
+ """
53
+ logger.debug(f"Chat completion request with model: {self.config.model_name}")
54
+
55
+ try:
56
+ # Use chat_completion method
57
+ response = self.client.chat_completion(
58
+ messages=messages,
59
+ model=self.config.model_name,
60
+ max_tokens=self.config.max_tokens,
61
+ temperature=self.config.temperature
62
+ )
63
+
64
+ # Extract the content from the response
65
+ content = response.choices[0].message.content
66
+ logger.debug(f"Chat completion response: {content[:200]}")
67
+
68
+ return content
69
+
70
+ except Exception as e:
71
+ logger.error(f"Chat completion error: {str(e)}")
72
+ raise Exception(f"HF chat completion error: {str(e)}")
73
+
services/screen_service.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import threading
2
+ import queue
3
+ import time
4
+ import base64
5
+ import io
6
+ import logging
7
+ from typing import Callable, Optional, List, Dict
8
+
9
+ import mss
10
+ import numpy as np
11
+ from PIL import Image
12
+
13
+ from openai import OpenAI
14
+ from config.settings import Settings
15
+
16
+ logger = logging.getLogger(__name__)
17
+
18
+ class ScreenService:
19
+ def __init__(
20
+ self,
21
+ prompt: str,
22
+ model: str,
23
+ fps: float = 0.5,
24
+ queue_size: int = 2,
25
+ monitor: int = 1,
26
+ max_width: int = 3440,
27
+ max_height: int = 1440,
28
+ compression_quality: int = 100,
29
+ image_format: str = "PNG",
30
+ ):
31
+ """
32
+ :param prompt: Vision model instruction
33
+ :param model: Nebius model name
34
+ :param fps: Capture frames per second
35
+ :param queue_size: Internal buffer size
36
+ :param monitor: MSS monitor index
37
+ :param max_width/max_height: Max resolution for resizing
38
+ :param compression_quality: JPEG quality (1-100)
39
+ :param image_format: "JPEG" or "PNG" (PNG is lossless)
40
+ """
41
+ self.prompt = prompt
42
+ self.model = model
43
+ self.fps = fps
44
+ self.queue: queue.Queue = queue.Queue(maxsize=queue_size)
45
+ self.monitor = monitor
46
+ self.max_width = max_width
47
+ self.max_height = max_height
48
+ self.compression_quality = compression_quality
49
+ self.image_format = image_format.upper()
50
+
51
+ self._stop_event = threading.Event()
52
+ self._producer: Optional[threading.Thread] = None
53
+ self._consumer: Optional[threading.Thread] = None
54
+
55
+ # Nebius client
56
+ self.client = OpenAI(
57
+ base_url=Settings.NEBIUS_BASE_URL,
58
+ api_key=Settings.NEBIUS_API_KEY
59
+ )
60
+
61
+ def _process_image(self, img: Image.Image) -> Image.Image:
62
+ # Convert to RGB if needed
63
+ if img.mode != "RGB":
64
+ img = img.convert("RGB")
65
+ w, h = img.size
66
+ ar = w / h
67
+ # Resize maintaining aspect ratio if above max
68
+ if w > self.max_width or h > self.max_height:
69
+ if ar > 1:
70
+ new_w = min(w, self.max_width)
71
+ new_h = int(new_w / ar)
72
+ else:
73
+ new_h = min(h, self.max_height)
74
+ new_w = int(new_h * ar)
75
+ img = img.resize((new_w, new_h), Image.Resampling.LANCZOS)
76
+ return img
77
+
78
+ def _image_to_base64(self, img: Image.Image) -> str:
79
+ buf = io.BytesIO()
80
+ if self.image_format == "PNG":
81
+ img.save(buf, format="PNG")
82
+ else:
83
+ img.save(
84
+ buf,
85
+ format="JPEG",
86
+ quality=self.compression_quality,
87
+ optimize=True
88
+ )
89
+ data = buf.getvalue()
90
+ return base64.b64encode(data).decode("utf-8")
91
+
92
+ def _capture_loop(self):
93
+ with mss.mss() as sct:
94
+ mon = sct.monitors[self.monitor]
95
+ interval = 1.0 / self.fps if self.fps > 0 else 0
96
+ while not self._stop_event.is_set():
97
+ t0 = time.time()
98
+ frame = np.array(sct.grab(mon))
99
+ pil = Image.fromarray(frame)
100
+ pil = self._process_image(pil)
101
+ b64 = self._image_to_base64(pil)
102
+ try:
103
+ self.queue.put_nowait((t0, b64))
104
+ except queue.Full:
105
+ self.queue.get_nowait()
106
+ self.queue.put_nowait((t0, b64))
107
+ if interval:
108
+ time.sleep(interval)
109
+
110
+ def _flatten_conversation_history(self, history: List[Dict[str, str]]) -> str:
111
+ """Flatten conversation history into a readable format for the vision model"""
112
+ if not history:
113
+ return "No previous conversation."
114
+
115
+ # Filter out system messages and vision outputs to avoid confusion
116
+ filtered_history = []
117
+ for msg in history:
118
+ role = msg.get('role', '')
119
+ content = msg.get('content', '')
120
+
121
+ # Skip system messages and previous vision outputs
122
+ if role == 'system':
123
+ continue
124
+ if content.startswith('VISION MODEL OUTPUT:'):
125
+ continue
126
+ if 'screen' in content.lower() and 'sharing' in content.lower():
127
+ continue
128
+
129
+ filtered_history.append(msg)
130
+
131
+ # Take only the last 10 exchanges to keep context manageable
132
+ if len(filtered_history) > 20: # 10 user + 10 assistant messages
133
+ filtered_history = filtered_history[-20:]
134
+
135
+ # Format the conversation
136
+ formatted_lines = []
137
+ for msg in filtered_history:
138
+ role = msg.get('role', 'unknown')
139
+ content = msg.get('content', '')
140
+
141
+ # Truncate very long messages
142
+ if len(content) > 200:
143
+ content = content[:200] + "..."
144
+
145
+ if role == 'user':
146
+ formatted_lines.append(f"User: {content}")
147
+ elif role == 'assistant':
148
+ formatted_lines.append(f"Assistant: {content}")
149
+
150
+ return "\n".join(formatted_lines) if formatted_lines else "No relevant conversation history."
151
+
152
+ def _inference_loop(
153
+ self,
154
+ callback: Callable[[Dict, float, str], None],
155
+ history_getter: Callable[[], List[Dict[str, str]]]
156
+ ):
157
+ while not self._stop_event.is_set():
158
+ try:
159
+ t0, frame_b64 = self.queue.get(timeout=1)
160
+ except queue.Empty:
161
+ continue
162
+
163
+ # Get and flatten the conversation history
164
+ history = history_getter()
165
+ flattened_history = self._flatten_conversation_history(history)
166
+
167
+ # Create the full prompt with system instructions and conversation context
168
+ full_prompt = f"{self.prompt}\n\nCONVERSATION CONTEXT:\n{flattened_history}"
169
+
170
+ for i, msg in enumerate(history):
171
+ content_preview = msg.get('content', '')[:100] + "..." if len(msg.get('content', '')) > 100 else msg.get('content', '')
172
+
173
+ user_message = {
174
+ "role": "user",
175
+ "content": [
176
+ {"type": "text", "text": full_prompt},
177
+ {"type": "image_url", "image_url": {"url": f"data:image/{self.image_format.lower()};base64,{frame_b64}"}}
178
+ ]
179
+ }
180
+
181
+ try:
182
+ resp = self.client.chat.completions.create(
183
+ model=self.model,
184
+ messages=[user_message]
185
+ )
186
+ latency = time.time() - t0
187
+ callback(resp, latency, frame_b64)
188
+ except Exception as e:
189
+ logger.error(f"Nebius inference error: {e}")
190
+
191
+ def start(
192
+ self,
193
+ callback: Callable[[Dict, float, str], None],
194
+ history_getter: Callable[[], List[Dict[str, str]]]
195
+ ) -> None:
196
+ if self._producer and self._producer.is_alive():
197
+ return
198
+ self._stop_event.clear()
199
+ self._producer = threading.Thread(target=self._capture_loop, daemon=True)
200
+ self._consumer = threading.Thread(
201
+ target=self._inference_loop,
202
+ args=(callback, history_getter),
203
+ daemon=True
204
+ )
205
+ self._producer.start()
206
+ self._consumer.start()
207
+ logger.info("ScreenService started.")
208
+
209
+ def stop(self) -> None:
210
+ self._stop_event.set()
211
+ if self._producer:
212
+ self._producer.join(timeout=1.0)
213
+ if self._consumer:
214
+ self._consumer.join(timeout=1.0)
215
+ logger.info("ScreenService stopped.")