ProfVLM V2: Video-Language Model for Sports Proficiency Analysis

ProfVLM is a multimodal model that combines video understanding with language generation for analyzing human performance and proficiency levels in various activities.

Model Version: V2

Model Description

ProfVLM integrates:

  • Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
  • Vision Encoder: facebook/timesformer-base-finetuned-k600 with LoRA adapters
  • Custom Video Adapter: AttentiveProjector with multi-head attention for view integration

Key Features

  • Multi-view support: Processes 1 camera view(s) simultaneously
  • Temporal modeling: Analyzes 32 frames per video
  • Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
  • Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)

Version 2 Features

  • Vision LoRA: Applied LoRA adapters to TimesFormer vision encoder for better video understanding
  • Flexible Frame Count: Supports different number of frames through time embedding interpolation
  • Enhanced Sampling: Efficient segment-based frame sampling for better temporal coverage
  • Dual LoRA: Both LLM and Vision encoder use LoRA for efficient fine-tuning

Model Architecture

Video Input (B, V, T, C, H, W) β†’ TimesFormer(+LoRA) β†’ AttentiveProjector β†’ LLM(+LoRA) β†’ Text Analysis

Where:

  • B: Batch size
  • V: Number of views (1)
  • T: Number of frames (32)
  • C, H, W: Channel, Height, Width

Usage

Loading the Model

import torch
from transformers import AutoTokenizer, AutoImageProcessor, AutoModelForCausalLM, TimesformerModel
from peft import PeftModel
import json
import os

def load_profvlm_model(model_path, device="cuda"):
    # Load configuration
    with open(os.path.join(model_path, "config.json"), 'r') as f:
        config = json.load(f)
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_path, "tokenizer"))
    
    # Load base models
    base_llm = AutoModelForCausalLM.from_pretrained(
        config["llm_checkpoint"],
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Load LLM LoRA
    llm_model = PeftModel.from_pretrained(base_llm, os.path.join(model_path, "llm_lora"))
    
    # Load vision encoder
    # For v2, load with 8 frames initially if using vision LoRA
    initial_frames = 8 if config.get("has_vision_lora", False) else config["num_frames"]
    vision_encoder = TimesformerModel.from_pretrained(
        config["vision_encoder"],
        num_frames=initial_frames,
        torch_dtype=torch.float16
    )
    
    # Load Vision LoRA if available (v2)
    if config.get("has_vision_lora", False):
        vision_encoder = PeftModel.from_pretrained(vision_encoder, os.path.join(model_path, "vision_lora"))
    
    # Initialize your ProfVLM model with loaded components
    # (You'll need to implement the full model class)
    
    return model

Inference Example

# Prepare your video data
# videos should be a list of lists: [[view1_frames, view2_frames, ...]]
# where each view contains 32 RGB frames

messages = [
    {"role": "system", "content": "You are a visual agent for human performance analysis."},
    {"role": "user", "content": "Here are frames sampled from a video: <|video_start|><|video|><|video_end|>. Given this video, analyze the proficiency level of the subject."}
]

# Generate analysis
with torch.no_grad():
    outputs = model.generate(prompt, videos)
    print(model.decode(outputs))

Training Details

Dataset

  • Multi-sport dataset with proficiency annotations
  • Sports: Basketball, Cooking, Dance, Bouldering, Soccer, Music
  • Proficiency levels: Novice, Early Expert, Intermediate Expert, Late Expert

Training Configuration

  • LLM LoRA: r=32, alpha=64, dropout=0.1
  • Vision LoRA: r=48, alpha=96, dropout=0.1
  • Video Processing: 32 frames per video, 1 view(s)
  • Optimization: AdamW with cosine scheduling
  • Mixed Precision: FP16 training

Performance

The model demonstrates strong performance in:

  • Multi-view video understanding
  • Temporal feature integration
  • Cross-sport proficiency assessment
  • Human performance analysis

Files Structure

model/
β”œβ”€β”€ llm_lora/              # LLM LoRA adapter weights
β”œβ”€β”€ vision_lora/           # Vision LoRA adapter weights
β”œβ”€β”€ tokenizer/             # Tokenizer files
β”œβ”€β”€ vision_processor/      # Vision processor config
β”œβ”€β”€ video_adapter.pt       # Custom video adapter weights
β”œβ”€β”€ config.json           # Model configuration
└── README.md             # This file

Requirements

torch>=2.0.0
transformers>=4.35.0
peft>=0.6.0
av>=10.0.0
opencv-python>=4.8.0
torchvision>=0.15.0
numpy>=1.24.0
pillow>=9.5.0

Model Versions

  • v1: Base version with LoRA on LLM only
  • v2: Enhanced version with LoRA on both LLM and Vision Encoder, plus flexible frame count support

This is the V2 version of the model.

Citation

If you use this model, please cite:

@article{profvlm2024,
  title={ProfVLM: Video-Language Model for Sports Proficiency Analysis},
  author={Your Name},
  year={2024}
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

  • Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
  • Vision Encoder: facebook/timesformer-base-finetuned-k600
  • Built with πŸ€— Transformers and PyTorch
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EdBianchi/ProfVLMv2-Exo3-PATS-DualLoRA-16heads

Adapter
(23)
this model