--- tags: - video-classification - action-recognition - mc3 - hmdb51 - pytorch - computer-vision - spatiotemporal - 3dcnn library_name: pytorch datasets: - hmdb51 metrics: - accuracy - f1 - precision pipeline_tag: video-classification license: apache-2.0 language: - en --- # MC3-18 HMDB51 (Kinetics-400 Init) ## Model Description MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips. **Validation Accuracy: 56.34%** This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation. ## Model Details - **Architecture:** MC3-18 (11.7M parameters) - **Initialization:** Kinetics-400 pretrained weights - **Dataset:** HMDB51 split 1 - Train: 3,570 videos across 51 action classes - Validation: 1,530 videos - **Input:** RGB video clips (8 frames, 112x112 spatial resolution) - **Output:** 51-class action predictions ## Training Configuration ```yaml Frames: 8 Frame Interval: 1 Spatial Size: 112x112 Batch Size: 16 Epochs: 150 Learning Rate: 0.0003 Weight Decay: 3e-3 Optimizer: SGD (momentum=0.9) ``` **Augmentation:** - MixUp (alpha=0.6) - CutMix (alpha=1.0) - Label Smoothing (0.15) - Random horizontal flip - Color jitter - Random grayscale ## Performance | Metric | Value | |--------|-------| | Validation Accuracy | 56.34% | | Training Accuracy | ~75% | | Train-Val Gap | ~19% | | Val F1 Score | ~0.54 | | Val Precision | ~0.55 | ## Overfitting Analysis The 19% train-validation gap indicates significant overfitting, which is expected given: - HMDB51's small size (only 69 videos per class on average) - MC3-18 has 11.7M parameters - Even with strong augmentation and regularization, the model memorizes training data This gap is typical for HMDB51 and difficult to eliminate without: - Larger pretraining datasets - Ensemble methods - More aggressive regularization techniques - Reduced model capacity ## Design Choices **Why num_frames=8 and frame_interval=1?** HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents: - Frame repetition/tiling for short videos - Loss of temporal information This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics. ## Usage ```python import torch from torchvision.models.video import mc3_18 from torchvision import transforms import cv2 # Load model model = mc3_18(weights=None) model.fc = torch.nn.Linear(model.fc.in_features, 51) checkpoint = torch.load('best.pth') model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Preprocessing transform = transforms.Compose([ transforms.ToPILImage(), transforms.Resize((128, 171)), transforms.CenterCrop(112), transforms.ToTensor(), transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989]) ]) # Load 8 frames from video frames = [] # Load your 8 RGB frames here frames = [transform(frame) for frame in frames] video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0) # (1, 3, 8, 112, 112) # Inference with torch.no_grad(): output = model(video_tensor) pred = output.argmax(dim=1) ``` ## Alternative Approach We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%). See: `mc3-18-hmdb51-ucf-transfer` **Kinetics vs UCF-101 initialization:** - Kinetics: Larger pretraining dataset, optimized for short clips (8 frames) - UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos) ## Limitations - Overfits on small datasets (19% train-val gap) - Single model without ensemble - No test-time augmentation (multi-crop, temporal sampling) - Optimized for 8-frame inputs (may not generalize to different temporal windows) - Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3) ## HMDB51 Classes The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave. ## Training Details - Framework: PyTorch - Hardware: Single GPU (CUDA) - Training Time: ~2 hours (150 epochs) - Convergence: Best model saved at epoch ~85-90 ## Citation If you use this model, please cite the original HMDB51 dataset: ```bibtex @inproceedings{kuehne2011hmdb, title={HMDB: a large video database for human motion recognition}, author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas}, booktitle={2011 International Conference on Computer Vision}, pages={2556--2563}, year={2011}, organization={IEEE} } ``` And the MC3 architecture: ```bibtex @inproceedings{tran2018closer, title={A closer look at spatiotemporal convolutions for action recognition}, author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar}, booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition}, pages={6450--6459}, year={2018} } ``` ## License Model weights: [Apache] Code: [Apache] HMDB51 Dataset: [Original dataset license]