MC3-18 HMDB51 (Kinetics-400 Init)

Model Description

MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips.

Validation Accuracy: 56.34%

This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation.

Model Details

Architecture: MC3-18 (11.7M parameters)
Initialization: Kinetics-400 pretrained weights
Dataset: HMDB51 split 1
- Train: 3,570 videos across 51 action classes
- Validation: 1,530 videos
Input: RGB video clips (8 frames, 112x112 spatial resolution)
Output: 51-class action predictions

Training Configuration

Frames: 8
Frame Interval: 1
Spatial Size: 112x112
Batch Size: 16
Epochs: 150
Learning Rate: 0.0003
Weight Decay: 3e-3
Optimizer: SGD (momentum=0.9)

Augmentation:

MixUp (alpha=0.6)
CutMix (alpha=1.0)
Label Smoothing (0.15)
Random horizontal flip
Color jitter
Random grayscale

Performance

Metric	Value
Validation Accuracy	56.34%
Training Accuracy	~75%
Train-Val Gap	~19%
Val F1 Score	~0.54
Val Precision	~0.55

Overfitting Analysis

The 19% train-validation gap indicates significant overfitting, which is expected given:

HMDB51's small size (only 69 videos per class on average)
MC3-18 has 11.7M parameters
Even with strong augmentation and regularization, the model memorizes training data

This gap is typical for HMDB51 and difficult to eliminate without:

Larger pretraining datasets
Ensemble methods
More aggressive regularization techniques
Reduced model capacity

Design Choices

Why num_frames=8 and frame_interval=1?

HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents:

Frame repetition/tiling for short videos
Loss of temporal information

This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics.

Usage

import torch
from torchvision.models.video import mc3_18
from torchvision import transforms
import cv2

# Load model
model = mc3_18(weights=None)
model.fc = torch.nn.Linear(model.fc.in_features, 51)
checkpoint = torch.load('best.pth')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((128, 171)),
    transforms.CenterCrop(112),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], 
                        std=[0.22803, 0.22145, 0.216989])
])

# Load 8 frames from video
frames = []  # Load your 8 RGB frames here
frames = [transform(frame) for frame in frames]
video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0)  # (1, 3, 8, 112, 112)

# Inference
with torch.no_grad():
    output = model(video_tensor)
    pred = output.argmax(dim=1)

Alternative Approach

We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%).

See: mc3-18-hmdb51-ucf-transfer

Kinetics vs UCF-101 initialization:

Kinetics: Larger pretraining dataset, optimized for short clips (8 frames)
UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos)

Limitations

Overfits on small datasets (19% train-val gap)
Single model without ensemble
No test-time augmentation (multi-crop, temporal sampling)
Optimized for 8-frame inputs (may not generalize to different temporal windows)
Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3)

HMDB51 Classes

The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.

Training Details

Framework: PyTorch
Hardware: Single GPU (CUDA)
Training Time: ~2 hours (150 epochs)
Convergence: Best model saved at epoch ~85-90

Citation

If you use this model, please cite the original HMDB51 dataset:

@inproceedings{kuehne2011hmdb,
  title={HMDB: a large video database for human motion recognition},
  author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
  booktitle={2011 International Conference on Computer Vision},
  pages={2556--2563},
  year={2011},
  organization={IEEE}
}

And the MC3 architecture:

@inproceedings{tran2018closer,
  title={A closer look at spatiotemporal convolutions for action recognition},
  author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={6450--6459},
  year={2018}
}

License

Model weights: [Apache] Code: [Apache] HMDB51 Dataset: [Original dataset license]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support