IRIS / README.md
a-zamfir's picture
updated readme
1d19806

A newer version of the Gradio SDK is available: 5.46.0

Upgrade
metadata
title: IRIS
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: app.py
pinned: false
short_description: IRIS HuggingFace Hackathon
tags:
  - agent-demo-track

IRIS

Important

  1. Watch IRIS' video overview here: https://www.youtube.com/watch?v=dieWyZZez6o
  2. IRIS does not work on Spaces! It requires a virtualization environment on either Amazon or Azure (or a local environment) as its MCP server targets Virtual Machines.

Overview

IRIS is an agentic chatbot proof-of-concept built for the HuggingFace Hackathon. It demonstrates how a multimodal AI assistant can:

  • Listen to voice commands (STT)
  • Speak AI responses (TTS)
  • See user screens and analyze them with a vision model
  • Act on infrastructure via a MCP integration

The goal is to showcase how modern LLMs, audio models, vision models and operator toolchains can be combined into a seamless, voice-driven infrastructure management assistant.

Key Goals

  1. Multimodal Interaction

    • Voice: real-time speech-to-text (STT) and text-to-speech (TTS)
    • Vision: live screen capture + AI analysis
    • Text: conversational UI backed by an LLM
  2. Agentic Control

    • Automatically detect when to call management tools
    • Execute Hyper-V VM operations through a RESTful MCP server
  3. Proof-of-Concept (POC)

    • Focus on clarity and modularity
    • Demonstrate core concepts rather than production-grade polish

Functionalities & Offerings

1. Audio Service

  • STT: Uses HuggingFace’s Falcon-AI (or OpenAI Whisper) to transcribe user speech.
  • TTS: Leverages a HuggingFace TTS model (e.g. canopylabs/orpheus-3b) to speak back responses.

2. Text (LLM) Service

  • Built on HuggingFace’s 🧩 InferenceClient or OpenAI fallback.
  • Default model: Qwen/Qwen2.5-7B-Instruct (configurable).
  • Handles chat prompt orchestration, reasoning-before-action, and tool-call formatting.

3. Vision & Screen Service

  • Captures your monitor at configurable FPS and resolution.
  • Sends images to a Nebius vision model (google/gemma-3-27b-it) with a guided prompt.
  • Parses vision output into “Issue Found / Description / Recommendation”.

4. MCP Integration

  • Hyper-V MCP Server: FastAPI service exposing tools to list, query, start, stop, and restart VMs.
  • Agent parses LLM tool calls and invokes them via HTTP.
  • Enables fully automated infrastructure actions in response to user voice commands.

Providers & Configuration

Service Provider / Model
LLM HuggingFace Inference (fallback: OpenAI)
STT Falcon-AI (with HF token) or OpenAI Whisper
TTS HF TTS (canopylabs/orpheus-3b-0.1-ft)
Vision Nebius (google/gemma-3-27b-it)
MCP (VM control) Custom Hyper-V FastAPI server
UI Framework Gradio

All credentials and endpoints are managed via environment variables in config/settings.py.

Quickstart

  1. Configure .env with your HF and (optionally) OpenAI tokens.
  2. Run the Hyper-V MCP server:
python hyperv_mcp.py
  1. Launch the Gradio app:
python app.py
  1. Interact by typing or speaking.

Click “Start sharing screen” to begin vision analysis.

Ask IRIS to list VMs, check status, or start a VM by voice.

IRIS will confirm actions and execute them through the MCP.

Contact

a.zamfir@hotmail.com
LinkedIn: Andrei Zamfir https://www.linkedin.com/in/andrei-d-zamfir/