ProfVLM V2: Video-Language Model for Sports Proficiency Analysis

ProfVLM is a multimodal model that combines video understanding with language generation for analyzing human performance and proficiency levels in various activities.

Model Version: V2

Model Description

ProfVLM integrates:

Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
Vision Encoder: facebook/timesformer-base-finetuned-k600 with LoRA adapters
Custom Video Adapter: AttentiveProjector with multi-head attention for view integration

Key Features

Multi-view support: Processes 1 camera view(s) simultaneously
Temporal modeling: Analyzes 32 frames per video
Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)

Version 2 Features

Vision LoRA: Applied LoRA adapters to TimesFormer vision encoder for better video understanding
Flexible Frame Count: Supports different number of frames through time embedding interpolation
Enhanced Sampling: Efficient segment-based frame sampling for better temporal coverage
Dual LoRA: Both LLM and Vision encoder use LoRA for efficient fine-tuning

Model Architecture

Video Input (B, V, T, C, H, W) → TimesFormer(+LoRA) → AttentiveProjector → LLM(+LoRA) → Text Analysis

Where:

B: Batch size
V: Number of views (1)
T: Number of frames (32)
C, H, W: Channel, Height, Width

Usage

Loading the Model

import torch
from transformers import AutoTokenizer, AutoImageProcessor, AutoModelForCausalLM, TimesformerModel
from peft import PeftModel
import json
import os

def load_profvlm_model(model_path, device="cuda"):
    # Load configuration
    with open(os.path.join(model_path, "config.json"), 'r') as f:
        config = json.load(f)
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_path, "tokenizer"))
    
    # Load base models
    base_llm = AutoModelForCausalLM.from_pretrained(
        config["llm_checkpoint"],
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Load LLM LoRA
    llm_model = PeftModel.from_pretrained(base_llm, os.path.join(model_path, "llm_lora"))
    
    # Load vision encoder
    # For v2, load with 8 frames initially if using vision LoRA
    initial_frames = 8 if config.get("has_vision_lora", False) else config["num_frames"]
    vision_encoder = TimesformerModel.from_pretrained(
        config["vision_encoder"],
        num_frames=initial_frames,
        torch_dtype=torch.float16
    )
    
    # Load Vision LoRA if available (v2)
    if config.get("has_vision_lora", False):
        vision_encoder = PeftModel.from_pretrained(vision_encoder, os.path.join(model_path, "vision_lora"))
    
    # Initialize your ProfVLM model with loaded components
    # (You'll need to implement the full model class)
    
    return model

Inference Example

# Prepare your video data
# videos should be a list of lists: [[view1_frames, view2_frames, ...]]
# where each view contains 32 RGB frames

messages = [
    {"role": "system", "content": "You are a visual agent for human performance analysis."},
    {"role": "user", "content": "Here are frames sampled from a video: <|video_start|><|video|><|video_end|>. Given this video, analyze the proficiency level of the subject."}
]

# Generate analysis
with torch.no_grad():
    outputs = model.generate(prompt, videos)
    print(model.decode(outputs))

Training Details

Dataset

Multi-sport dataset with proficiency annotations
Sports: Basketball, Cooking, Dance, Bouldering, Soccer, Music
Proficiency levels: Novice, Early Expert, Intermediate Expert, Late Expert

Training Configuration

LLM LoRA: r=32, alpha=64, dropout=0.1
Vision LoRA: r=48, alpha=96, dropout=0.1
Video Processing: 32 frames per video, 1 view(s)
Optimization: AdamW with cosine scheduling
Mixed Precision: FP16 training

Performance

The model demonstrates strong performance in:

Multi-view video understanding
Temporal feature integration
Cross-sport proficiency assessment
Human performance analysis

Files Structure

model/
├── llm_lora/              # LLM LoRA adapter weights
├── vision_lora/           # Vision LoRA adapter weights
├── tokenizer/             # Tokenizer files
├── vision_processor/      # Vision processor config
├── video_adapter.pt       # Custom video adapter weights
├── config.json           # Model configuration
└── README.md             # This file

Requirements

torch>=2.0.0
transformers>=4.35.0
peft>=0.6.0
av>=10.0.0
opencv-python>=4.8.0
torchvision>=0.15.0
numpy>=1.24.0
pillow>=9.5.0

Model Versions

v1: Base version with LoRA on LLM only
v2: Enhanced version with LoRA on both LLM and Vision Encoder, plus flexible frame count support

This is the V2 version of the model.

Citation

If you use this model, please cite:

@article{profvlm2024,
  title={ProfVLM: Video-Language Model for Sports Proficiency Analysis},
  author={Your Name},
  year={2024}
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
Vision Encoder: facebook/timesformer-base-finetuned-k600
Built with 🤗 Transformers and PyTorch

EdBianchi
/

ProfVLMv2-Exo3-PATS-DualLoRA-16heads