Spaces:

Agents-MCP-Hackathon
/

IRIS

Sleeping

App Files Files Community

IRIS / README.md

a-zamfir

updated readme

1d19806 3 months ago

preview code

raw

history blame contribute delete

3.79 kB

A newer version of the Gradio SDK is available: 5.46.0

Upgrade

metadata

title: IRIS
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: app.py
pinned: false
short_description: IRIS HuggingFace Hackathon
tags:
  - agent-demo-track

IRIS

Important

Watch IRIS' video overview here: https://www.youtube.com/watch?v=dieWyZZez6o
IRIS does not work on Spaces! It requires a virtualization environment on either Amazon or Azure (or a local environment) as its MCP server targets Virtual Machines.

Overview

IRIS is an agentic chatbot proof-of-concept built for the HuggingFace Hackathon. It demonstrates how a multimodal AI assistant can:

Listen to voice commands (STT)
Speak AI responses (TTS)
See user screens and analyze them with a vision model
Act on infrastructure via a MCP integration

The goal is to showcase how modern LLMs, audio models, vision models and operator toolchains can be combined into a seamless, voice-driven infrastructure management assistant.

Key Goals

Multimodal Interaction
- Voice: real-time speech-to-text (STT) and text-to-speech (TTS)
- Vision: live screen capture + AI analysis
- Text: conversational UI backed by an LLM
Agentic Control
- Automatically detect when to call management tools
- Execute Hyper-V VM operations through a RESTful MCP server
Proof-of-Concept (POC)
- Focus on clarity and modularity
- Demonstrate core concepts rather than production-grade polish

Functionalities & Offerings

1. Audio Service

STT: Uses HuggingFace’s Falcon-AI (or OpenAI Whisper) to transcribe user speech.
TTS: Leverages a HuggingFace TTS model (e.g. canopylabs/orpheus-3b) to speak back responses.

2. Text (LLM) Service

Built on HuggingFace’s 🧩 InferenceClient or OpenAI fallback.
Default model: Qwen/Qwen2.5-7B-Instruct (configurable).
Handles chat prompt orchestration, reasoning-before-action, and tool-call formatting.

3. Vision & Screen Service

Captures your monitor at configurable FPS and resolution.
Sends images to a Nebius vision model (google/gemma-3-27b-it) with a guided prompt.
Parses vision output into “Issue Found / Description / Recommendation”.

4. MCP Integration

Hyper-V MCP Server: FastAPI service exposing tools to list, query, start, stop, and restart VMs.
Agent parses LLM tool calls and invokes them via HTTP.
Enables fully automated infrastructure actions in response to user voice commands.

Providers & Configuration

Service	Provider / Model
LLM	HuggingFace Inference (fallback: OpenAI)
STT	Falcon-AI (with HF token) or OpenAI Whisper
TTS	HF TTS (`canopylabs/orpheus-3b-0.1-ft`)
Vision	Nebius (`google/gemma-3-27b-it`)
MCP (VM control)	Custom Hyper-V FastAPI server
UI Framework	Gradio

All credentials and endpoints are managed via environment variables in config/settings.py.

Quickstart

Configure .env with your HF and (optionally) OpenAI tokens.
Run the Hyper-V MCP server:

python hyperv_mcp.py

Launch the Gradio app:

python app.py

Interact by typing or speaking.

Click “Start sharing screen” to begin vision analysis.

Ask IRIS to list VMs, check status, or start a VM by voice.

IRIS will confirm actions and execute them through the MCP.

Contact

a.zamfir@hotmail.com
LinkedIn: Andrei Zamfir https://www.linkedin.com/in/andrei-d-zamfir/