Spaces:

Luigi
/

SmolVLM2-on-llama.cpp

Running

App Files Files Community

Luigi commited on Jun 13

Commit

970f416

1 Parent(s): 292fb3c

switch to gradio implementation as streamlit + webrtc requires turn server

Browse files

Files changed (3) hide show

README.md +24 -27
app.py +56 -125
requirements.txt +4 -4

README.md CHANGED Viewed

@@ -1,27 +1,28 @@
 ---
-title: SmolVLM2 On Llama.cpp
 emoji: 💻
 colorFrom: gray
 colorTo: gray
-sdk: streamlit
-app_file: app.py
 pinned: false
 license: mit
-short_description: SmolVLM2 on llama.cpp
-sdk_version: 1.45.1
----
-# SmolVLM2 Real‑Time Captioning Demo
-This Hugging Face Spaces app uses **Streamlit** + **WebRTC** to capture your webcam feed every *N* milliseconds and run it through the SmolVLM2 model on your CPU, displaying live captions below the video.
 ## Features
-* **CPU‑only inference** via `llama-cpp-python` wrapping `llama.cpp`.
-* **WebRTC camera input** for low‑latency, browser‑native video streaming.
-* **Adjustable interval slider** (100 ms to 10 s) for capture frequency.
 * **Automatic GGUF model download** from Hugging Face Hub when missing.
-* **Debug logging** in the terminal for tracing inference steps.
 ## Setup
@@ -38,38 +39,34 @@ This Hugging Face Spaces app uses **Streamlit** + **WebRTC** to capture your web
    pip install -r requirements.txt
    ```
-3. **(Optional) Pre‑download model files**
-   The app automatically downloads these files if they are not present:
-   * `SmolVLM2-500M-Video-Instruct.Q4_K_M.gguf`
    * `mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf`
-   To skip download, manually place them in the repo root.
 ## Usage
 1. **Launch the app**:
    ```bash
-   streamlit run app.py
    ```
-2. **Open your browser** at the URL shown (e.g. `http://localhost:8501`).
-3. **Allow webcam access** when prompted by the browser.
-4. **Adjust the capture interval** using the slider.
-5. **Click **Start** to begin streaming and captioning.**
-6. **View live captions** in the panel below the video.
 ## File Structure
-* `app.py` — Main Streamlit + WebRTC application.
 * `requirements.txt` — Python dependencies.
-* `.gguf` model files (auto‑downloaded or user‑provided).
 ## License
-Licensed under the MIT License.

 ---
+title: SmolVLM2 Real-Time Captioning Demo
 emoji: 💻
 colorFrom: gray
 colorTo: gray
+sdk: gradio
+app\_file: app.py
 pinned: false
 license: mit
+short\_description: Real-time webcam captioning with SmolVLM2 on llama.cpp
+sdk\_version: 5.0
+-----------------
+# SmolVLM2 Real-Time Captioning Demo
+This Hugging Face Spaces app uses **Gradio v5 Blocks** to capture your webcam feed every *N* milliseconds and run it through the SmolVLM2 model on your CPU, displaying live captions below each frame.
 ## Features
+* **CPU-only inference** via `llama-cpp-python` wrapping `llama.cpp`.
+* **Gradio live streaming** for low-latency, browser-native video input.
+* **Adjustable interval slider** (100 ms to 10 s) for frame capture frequency.
 * **Automatic GGUF model download** from Hugging Face Hub when missing.
+* **Debug logging** in the terminal for tracing each inference step.
 ## Setup
    pip install -r requirements.txt
    ```
+3. **(Optional) Pre-download model files**
+   These will be automatically downloaded if absent:
+   * `SmolVLM2-500M-Video-Instruct.Q8_0.gguf`
    * `mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf`
+   To skip downloads, place both GGUF files in the repo root.
 ## Usage
 1. **Launch the app**:
    ```bash
+   python app.py
    ```
+2. **Open your browser** at the URL shown in the terminal (e.g. `http://127.0.0.1:7860`).
+3. **Allow webcam access** when prompted.
+4. **Adjust the capture interval** using the slider in the UI.
+5. **Live captions** will appear below each video frame.
 ## File Structure
+* `app.py` — Main Gradio v5 Blocks application.
 * `requirements.txt` — Python dependencies.
+* `.gguf` model files (auto-downloaded or user-provided).
 ## License

app.py CHANGED Viewed

@@ -1,17 +1,11 @@
-# app.py
-import streamlit as st
-st.set_page_config(layout="wide")
-import av
 import cv2
-import time
 import tempfile
 import os
 from pathlib import Path
 from huggingface_hub import hf_hub_download
-from streamlit_webrtc import webrtc_streamer, VideoProcessorBase, RTCConfiguration
 from llama_cpp import Llama
-from llama_cpp.llama_chat_format import LlamaChatCompletionHandlerRegistry, Llava15ChatHandler
 from termcolor import cprint
 # —————————————————————————————————————————
@@ -39,11 +33,6 @@ class SmolVLM2ChatHandler(Llava15ChatHandler):
         "{% if add_generation_prompt %}Assistant:{% endif %}"
     )
-# Overwrite any previous registration
-LlamaChatCompletionHandlerRegistry().register_chat_completion_handler(
-    "smolvlm2", SmolVLM2ChatHandler, overwrite=True
-)
 # —————————————————————————————————————————
 # 2) Model & CLIP files — download if missing
 MODEL_FILE = "SmolVLM2-500M-Video-Instruct.Q8_0.gguf"
@@ -61,7 +50,6 @@ def ensure_models():
 ensure_models()
-@st.cache_resource
 def load_llm():
     handler = SmolVLM2ChatHandler(clip_model_path=CLIP_FILE, verbose=False)
     return Llama(
@@ -74,122 +62,65 @@ def load_llm():
 llm = load_llm()
 # —————————————————————————————————————————
-# 3) Helper to run a single frame through the model (with debug)
 def caption_frame(frame):
-    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
         cv2.imwrite(f.name, frame)
         uri = Path(f.name).absolute().as_uri()
-    messages = [
-        {
-            "role": "system",
-            "content": (
-                "Focus only on describing the key dramatic action or notable event occurring "
-                "in this image. Skip general context or scene-setting details unless they are "
-                "crucial to understanding the main action."
-            ),
-        },
-        {
-            "role": "user",
-            "content": [
-                {"type": "image_url", "image_url": {"url": uri}},
-                {"type": "text",      "text": "What is happening in this image?"},
-            ],
-        },
-    ]
-    print("DEBUG ▶ caption_frame: invoking LLM")
-    resp = llm.create_chat_completion(
-        messages=messages,
-        max_tokens=128,
-        temperature=0.1,
-        repeat_penalty=1.1,       # discourage exact token repeats
-        stop=["<end_of_utterance>"],
-    )
-    out = (resp["choices"][0].get("message", {}).get("content") or "").strip()
-    print(f"DEBUG ▶ LLM returned: {out!r}")
-    return out
-# —————————————————————————————————————————
-# 4) Streamlit UI + WebRTC configuration
-st.title("🎥 Real-Time Camera Captioning with SmolVLM2 (CPU)")
-interval_ms = st.slider(
-    "Caption every N ms", min_value=100, max_value=10000, value=3000, step=100
-)
-RTC_CONFIG = RTCConfiguration({
-    "iceServers": [{"urls": ["stun:stun.l.google.com:19302"]}]
-})
-import concurrent.futures
-class CaptionProcessor(VideoProcessorBase):
-    def __init__(self):
-        self.interval = 1.0
-        self.last_time = time.time()
-        self.caption = ""
-        self.executor = concurrent.futures.ThreadPoolExecutor(max_workers=1)
-        self.future = None
-    def recv(self, frame: av.VideoFrame) -> av.VideoFrame:
-        img = frame.to_ndarray(format="bgr24")
-        now = time.time()
-        # 1) Schedule a new inference if interval has passed and previous is done
-        if now - self.last_time >= self.interval:
-            self.last_time = now
-            # only submit if there isn't already a running task
-            if self.future is None or self.future.done():
-                # copy the frame so that downstream modifying code can't clash
-                img_copy = img.copy()
-                self.future = self.executor.submit(caption_frame, img_copy)
-        # 2) If the background task finished, grab its result
-        if self.future and self.future.done():
-            try:
-                self.caption = self.future.result()
-            except Exception as e:
-                self.caption = f"[error: {e}]"
-            self.future = None
-        # 3) Draw the **last** caption onto every frame immediately
-        cv2.putText(
-            img,
-            self.caption or "_…thinking…_",
-            org=(10, img.shape[0] - 20),
-            fontFace=cv2.FONT_HERSHEY_SIMPLEX,
-            fontScale=0.6,
-            color=(255, 255, 255),
-            thickness=2,
-            lineType=cv2.LINE_AA,
         )
-        return av.VideoFrame.from_ndarray(img, format="bgr24")
-ctx = webrtc_streamer(
-    key="smolvlm2-captioner",
-    video_processor_factory=CaptionProcessor,
-    rtc_configuration=RTC_CONFIG,
-    media_stream_constraints={"video": True, "audio": False},
-)
-# Update the processor interval
-if ctx.video_processor:
-    ctx.video_processor.interval = interval_ms / 1000.0
-# Placeholder for showing captions
-placeholder = st.empty()
-if ctx.state.playing:
-    placeholder.markdown("**Caption:** _Waiting for first inference…_")
-    while ctx.state.playing:
-        vp = ctx.video_processor
-        if vp is not None:
-            txt = vp.caption or "_…thinking…_"
-        else:
-            txt = "_…loading…_"
-        placeholder.markdown(f"**Caption:** {txt}")
-        time.sleep(0.1)
-else:
-    st.info("▶️ Click **Start** above to begin streaming")

+import gradio as gr
 import cv2
 import tempfile
 import os
 from pathlib import Path
 from huggingface_hub import hf_hub_download
 from llama_cpp import Llama
+from llama_cpp.llama_chat_format import Llava15ChatHandler
 from termcolor import cprint
 # —————————————————————————————————————————
         "{% if add_generation_prompt %}Assistant:{% endif %}"
     )
 # —————————————————————————————————————————
 # 2) Model & CLIP files — download if missing
 MODEL_FILE = "SmolVLM2-500M-Video-Instruct.Q8_0.gguf"
 ensure_models()
 def load_llm():
     handler = SmolVLM2ChatHandler(clip_model_path=CLIP_FILE, verbose=False)
     return Llama(
 llm = load_llm()
 # —————————————————————————————————————————
+# 4) Captioning helper (stateless prompt)
 def caption_frame(frame):
+    # make a writable copy
+    frame = frame.copy()
+    # save frame to temporary file for URI
+    with tempfile.NamedTemporaryFile(suffix='.jpg') as f:
         cv2.imwrite(f.name, frame)
         uri = Path(f.name).absolute().as_uri()
+        # build a single prompt string
+        messages = [
+            {
+                "role": "system",
+                "content": (
+                    "Focus only on describing the key dramatic action or notable event occurring "
+                    "in this image. Skip general context or scene-setting details unless they are "
+                    "crucial to understanding the main action."
+                ),
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image_url", "image_url": uri},
+                    {"type": "text",      "text": "What is happening in this image?"},
+                ],
+            },
+        ]
+        # stateless completion call
+        llm.chat_handler = SmolVLM2ChatHandler(clip_model_path=CLIP_FILE, verbose=False)
+        llm.reset()                           # reset n_tokens back to 0
+        llm._ctx.kv_cache_clear()      # clear any cached key/values
+        resp = llm.create_chat_completion(
+            messages = messages,
+            max_tokens=256,
+            temperature=0.1,
+            stop=["<end_of_utterance>"],
         )
+    # extract caption
+    caption = (resp.get("choices", [])[0]['message'].get("content", "") or "").strip()
+    return caption
+# —————————————————————————————————————————
+# 5) Gradio UI (v5 streaming)
+demo = gr.Blocks()
+with demo:
+    gr.Markdown("## 🎥 Real-Time Camera Captioning with SmolVLM2 (CPU)")
+    input_img = gr.Image(sources=["webcam"], streaming=True, label="Webcam Feed")
+    caption_box = gr.Textbox(interactive=False, label="Caption")
+    # stream frames and captions
+    input_img.stream(
+        fn=caption_frame,
+        inputs=[input_img],
+        outputs=[caption_box],
+        stream_every=3,
+        time_limit=600
+    )
+if __name__ == "__main__":
+    demo.launch()

requirements.txt CHANGED Viewed

@@ -1,6 +1,6 @@
-streamlit
-streamlit-webrtc
 llama-cpp-python
 huggingface-hub
-termcolor
-opencv-python

+gradio>=5.0
+opencv-python
+pillow
 llama-cpp-python
 huggingface-hub
+termcolor