MC3-18 HMDB51 (Kinetics-400 Init)
Model Description
MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips.
Validation Accuracy: 56.34%
This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation.
Model Details
- Architecture: MC3-18 (11.7M parameters)
- Initialization: Kinetics-400 pretrained weights
- Dataset: HMDB51 split 1
- Train: 3,570 videos across 51 action classes
- Validation: 1,530 videos
- Input: RGB video clips (8 frames, 112x112 spatial resolution)
- Output: 51-class action predictions
Training Configuration
Frames: 8
Frame Interval: 1
Spatial Size: 112x112
Batch Size: 16
Epochs: 150
Learning Rate: 0.0003
Weight Decay: 3e-3
Optimizer: SGD (momentum=0.9)
Augmentation:
- MixUp (alpha=0.6)
- CutMix (alpha=1.0)
- Label Smoothing (0.15)
- Random horizontal flip
- Color jitter
- Random grayscale
Performance
| Metric | Value |
|---|---|
| Validation Accuracy | 56.34% |
| Training Accuracy | ~75% |
| Train-Val Gap | ~19% |
| Val F1 Score | ~0.54 |
| Val Precision | ~0.55 |
Overfitting Analysis
The 19% train-validation gap indicates significant overfitting, which is expected given:
- HMDB51's small size (only 69 videos per class on average)
- MC3-18 has 11.7M parameters
- Even with strong augmentation and regularization, the model memorizes training data
This gap is typical for HMDB51 and difficult to eliminate without:
- Larger pretraining datasets
- Ensemble methods
- More aggressive regularization techniques
- Reduced model capacity
Design Choices
Why num_frames=8 and frame_interval=1?
HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents:
- Frame repetition/tiling for short videos
- Loss of temporal information
This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics.
Usage
import torch
from torchvision.models.video import mc3_18
from torchvision import transforms
import cv2
# Load model
model = mc3_18(weights=None)
model.fc = torch.nn.Linear(model.fc.in_features, 51)
checkpoint = torch.load('best.pth')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Preprocessing
transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((128, 171)),
transforms.CenterCrop(112),
transforms.ToTensor(),
transforms.Normalize(mean=[0.43216, 0.394666, 0.37645],
std=[0.22803, 0.22145, 0.216989])
])
# Load 8 frames from video
frames = [] # Load your 8 RGB frames here
frames = [transform(frame) for frame in frames]
video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0) # (1, 3, 8, 112, 112)
# Inference
with torch.no_grad():
output = model(video_tensor)
pred = output.argmax(dim=1)
Alternative Approach
We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%).
See: mc3-18-hmdb51-ucf-transfer
Kinetics vs UCF-101 initialization:
- Kinetics: Larger pretraining dataset, optimized for short clips (8 frames)
- UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos)
Limitations
- Overfits on small datasets (19% train-val gap)
- Single model without ensemble
- No test-time augmentation (multi-crop, temporal sampling)
- Optimized for 8-frame inputs (may not generalize to different temporal windows)
- Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3)
HMDB51 Classes
The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.
Training Details
- Framework: PyTorch
- Hardware: Single GPU (CUDA)
- Training Time: ~2 hours (150 epochs)
- Convergence: Best model saved at epoch ~85-90
Citation
If you use this model, please cite the original HMDB51 dataset:
@inproceedings{kuehne2011hmdb,
title={HMDB: a large video database for human motion recognition},
author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
booktitle={2011 International Conference on Computer Vision},
pages={2556--2563},
year={2011},
organization={IEEE}
}
And the MC3 architecture:
@inproceedings{tran2018closer,
title={A closer look at spatiotemporal convolutions for action recognition},
author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
pages={6450--6459},
year={2018}
}
License
Model weights: [Apache] Code: [Apache] HMDB51 Dataset: [Original dataset license]