ProfVLM V2: Video-Language Model for Sports Proficiency Analysis
ProfVLM is a multimodal model that combines video understanding with language generation for analyzing human performance and proficiency levels in various activities.
Model Version: V2
Model Description
ProfVLM integrates:
- Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
- Vision Encoder: facebook/timesformer-base-finetuned-k600 with LoRA adapters
- Custom Video Adapter: AttentiveProjector with multi-head attention for view integration
Key Features
- Multi-view support: Processes 1 camera view(s) simultaneously
- Temporal modeling: Analyzes 32 frames per video
- Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
- Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)
Version 2 Features
- Vision LoRA: Applied LoRA adapters to TimesFormer vision encoder for better video understanding
- Flexible Frame Count: Supports different number of frames through time embedding interpolation
- Enhanced Sampling: Efficient segment-based frame sampling for better temporal coverage
- Dual LoRA: Both LLM and Vision encoder use LoRA for efficient fine-tuning
Model Architecture
Video Input (B, V, T, C, H, W) β TimesFormer(+LoRA) β AttentiveProjector β LLM(+LoRA) β Text Analysis
Where:
- B: Batch size
- V: Number of views (1)
- T: Number of frames (32)
- C, H, W: Channel, Height, Width
Usage
Loading the Model
import torch
from transformers import AutoTokenizer, AutoImageProcessor, AutoModelForCausalLM, TimesformerModel
from peft import PeftModel
import json
import os
def load_profvlm_model(model_path, device="cuda"):
# Load configuration
with open(os.path.join(model_path, "config.json"), 'r') as f:
config = json.load(f)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_path, "tokenizer"))
# Load base models
base_llm = AutoModelForCausalLM.from_pretrained(
config["llm_checkpoint"],
torch_dtype=torch.float16,
device_map="auto"
)
# Load LLM LoRA
llm_model = PeftModel.from_pretrained(base_llm, os.path.join(model_path, "llm_lora"))
# Load vision encoder
# For v2, load with 8 frames initially if using vision LoRA
initial_frames = 8 if config.get("has_vision_lora", False) else config["num_frames"]
vision_encoder = TimesformerModel.from_pretrained(
config["vision_encoder"],
num_frames=initial_frames,
torch_dtype=torch.float16
)
# Load Vision LoRA if available (v2)
if config.get("has_vision_lora", False):
vision_encoder = PeftModel.from_pretrained(vision_encoder, os.path.join(model_path, "vision_lora"))
# Initialize your ProfVLM model with loaded components
# (You'll need to implement the full model class)
return model
Inference Example
# Prepare your video data
# videos should be a list of lists: [[view1_frames, view2_frames, ...]]
# where each view contains 32 RGB frames
messages = [
{"role": "system", "content": "You are a visual agent for human performance analysis."},
{"role": "user", "content": "Here are frames sampled from a video: <|video_start|><|video|><|video_end|>. Given this video, analyze the proficiency level of the subject."}
]
# Generate analysis
with torch.no_grad():
outputs = model.generate(prompt, videos)
print(model.decode(outputs))
Training Details
Dataset
- Multi-sport dataset with proficiency annotations
- Sports: Basketball, Cooking, Dance, Bouldering, Soccer, Music
- Proficiency levels: Novice, Early Expert, Intermediate Expert, Late Expert
Training Configuration
- LLM LoRA: r=32, alpha=64, dropout=0.1
- Vision LoRA: r=48, alpha=96, dropout=0.1
- Video Processing: 32 frames per video, 1 view(s)
- Optimization: AdamW with cosine scheduling
- Mixed Precision: FP16 training
Performance
The model demonstrates strong performance in:
- Multi-view video understanding
- Temporal feature integration
- Cross-sport proficiency assessment
- Human performance analysis
Files Structure
model/
βββ llm_lora/ # LLM LoRA adapter weights
βββ vision_lora/ # Vision LoRA adapter weights
βββ tokenizer/ # Tokenizer files
βββ vision_processor/ # Vision processor config
βββ video_adapter.pt # Custom video adapter weights
βββ config.json # Model configuration
βββ README.md # This file
Requirements
torch>=2.0.0
transformers>=4.35.0
peft>=0.6.0
av>=10.0.0
opencv-python>=4.8.0
torchvision>=0.15.0
numpy>=1.24.0
pillow>=9.5.0
Model Versions
- v1: Base version with LoRA on LLM only
- v2: Enhanced version with LoRA on both LLM and Vision Encoder, plus flexible frame count support
This is the V2 version of the model.
Citation
If you use this model, please cite:
@article{profvlm2024,
title={ProfVLM: Video-Language Model for Sports Proficiency Analysis},
author={Your Name},
year={2024}
}
License
This model is released under the Apache 2.0 License.
Acknowledgments
- Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
- Vision Encoder: facebook/timesformer-base-finetuned-k600
- Built with π€ Transformers and PyTorch
- Downloads last month
- 38
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for EdBianchi/ProfVLMv2-Exo3-PATS-DualLoRA-16heads
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct