---
title: SmolVLM2 On Llama.cpp
emoji: 💻
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 5.33.2
app_file: app.py
pinned: false
license: mit
short_description: SmolVLM2 on llama.cpp
---

# SmolVLM2 Live Inference Demo

This HuggingFace Spaces demo runs SmolVLM2 2.2B, 500M, or 256M Instruct GGUF models on CPU using `llama-cpp-python` (v0.3.9) which builds `llama.cpp` under the hood, and Gradio v5.33.2 for the UI. It captures frames from your webcam every N milliseconds and performs live inference, displaying the model's response in real time.

## Setup

1. **Clone this repository**

   ```bash
   git clone <your-space-repo-url>
   cd <your-space-repo-name>
   ```

2. **Install dependencies**

   ```bash
   pip install -r requirements.txt
   ```

3. **Add your GGUF models**

   Create a `models/` directory in the root of the repo and upload your `.gguf` files:

   ```bash
   mkdir models
   # then upload:
   # - smolvlm2-2.2B-instruct.gguf
   # - smolvlm2-500M-instruct.gguf
   # - smolvlm2-256M-instruct.gguf
   ```

## Usage

- **Select Model**: Choose one of the `.gguf` files you uploaded.
- **System Prompt**: Customize the system-level instructions for the model.
- **User Prompt**: Provide the user query or instruction.
- **Interval (ms)**: Set how often (in milliseconds) to capture a frame and run inference.
- **Live Camera Feed**: The demo will start your webcam and capture frames at the specified interval.
- **Model Output**: See the model’s response below the camera feed.

## Notes

- This demo runs entirely on CPU. Inference speed depends on the model size and your machine's CPU performance.
- Make sure your browser has permission to access your webcam.