metadata

title: SmolVLM2 On Llama.cpp
emoji: 💻
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 5.33.2
app_file: app.py
pinned: false
license: mit
short_description: SmolVLM2 on llama.cpp

SmolVLM2 Live Inference Demo

This HuggingFace Spaces demo runs SmolVLM2 2.2B, 500M, or 256M Instruct GGUF models on CPU using llama-cpp-python (v0.3.9) which builds llama.cpp under the hood, and Gradio v5.33.2 for the UI. It captures frames from your webcam every N milliseconds and performs live inference, displaying the model's response in real time.

Setup

Clone this repository

git clone <your-space-repo-url>
cd <your-space-repo-name>

Install dependencies
```
pip install -r requirements.txt
```

Add your GGUF models

Create a models/ directory in the root of the repo and upload your .gguf files:

mkdir models
# then upload:
# - smolvlm2-2.2B-instruct.gguf
# - smolvlm2-500M-instruct.gguf
# - smolvlm2-256M-instruct.gguf

Usage

Select Model: Choose one of the .gguf files you uploaded.
System Prompt: Customize the system-level instructions for the model.
User Prompt: Provide the user query or instruction.
Interval (ms): Set how often (in milliseconds) to capture a frame and run inference.
Live Camera Feed: The demo will start your webcam and capture frames at the specified interval.
Model Output: See the model’s response below the camera feed.

Notes

This demo runs entirely on CPU. Inference speed depends on the model size and your machine's CPU performance.
Make sure your browser has permission to access your webcam.