Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.46.0
title: IRIS
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: app.py
pinned: false
short_description: IRIS HuggingFace Hackathon
tags:
- agent-demo-track
IRIS
Important
- Watch IRIS' video overview here: https://www.youtube.com/watch?v=dieWyZZez6o
- IRIS does not work on Spaces! It requires a virtualization environment on either Amazon or Azure (or a local environment) as its MCP server targets Virtual Machines.
Overview
IRIS is an agentic chatbot proof-of-concept built for the HuggingFace Hackathon. It demonstrates how a multimodal AI assistant can:
- Listen to voice commands (STT)
- Speak AI responses (TTS)
- See user screens and analyze them with a vision model
- Act on infrastructure via a MCP integration
The goal is to showcase how modern LLMs, audio models, vision models and operator toolchains can be combined into a seamless, voice-driven infrastructure management assistant.
Key Goals
Multimodal Interaction
- Voice: real-time speech-to-text (STT) and text-to-speech (TTS)
- Vision: live screen capture + AI analysis
- Text: conversational UI backed by an LLM
Agentic Control
- Automatically detect when to call management tools
- Execute Hyper-V VM operations through a RESTful MCP server
Proof-of-Concept (POC)
- Focus on clarity and modularity
- Demonstrate core concepts rather than production-grade polish
Functionalities & Offerings
1. Audio Service
- STT: Uses HuggingFace’s Falcon-AI (or OpenAI Whisper) to transcribe user speech.
- TTS: Leverages a HuggingFace TTS model (e.g.
canopylabs/orpheus-3b
) to speak back responses.
2. Text (LLM) Service
- Built on HuggingFace’s 🧩 InferenceClient or OpenAI fallback.
- Default model:
Qwen/Qwen2.5-7B-Instruct
(configurable). - Handles chat prompt orchestration, reasoning-before-action, and tool-call formatting.
3. Vision & Screen Service
- Captures your monitor at configurable FPS and resolution.
- Sends images to a Nebius vision model (
google/gemma-3-27b-it
) with a guided prompt. - Parses vision output into “Issue Found / Description / Recommendation”.
4. MCP Integration
- Hyper-V MCP Server: FastAPI service exposing tools to list, query, start, stop, and restart VMs.
- Agent parses LLM tool calls and invokes them via HTTP.
- Enables fully automated infrastructure actions in response to user voice commands.
Providers & Configuration
Service | Provider / Model |
---|---|
LLM | HuggingFace Inference (fallback: OpenAI) |
STT | Falcon-AI (with HF token) or OpenAI Whisper |
TTS | HF TTS (canopylabs/orpheus-3b-0.1-ft ) |
Vision | Nebius (google/gemma-3-27b-it ) |
MCP (VM control) | Custom Hyper-V FastAPI server |
UI Framework | Gradio |
All credentials and endpoints are managed via environment variables in config/settings.py
.
Quickstart
- Configure
.env
with your HF and (optionally) OpenAI tokens. - Run the Hyper-V MCP server:
python hyperv_mcp.py
- Launch the Gradio app:
python app.py
- Interact by typing or speaking.
Click “Start sharing screen” to begin vision analysis.
Ask IRIS to list VMs, check status, or start a VM by voice.
IRIS will confirm actions and execute them through the MCP.
Contact
a.zamfir@hotmail.com
LinkedIn: Andrei Zamfir https://www.linkedin.com/in/andrei-d-zamfir/