---
tags:
- video-classification
- action-recognition
- mc3
- hmdb51
- pytorch
- computer-vision
- spatiotemporal
- 3dcnn
library_name: pytorch
datasets:
- hmdb51
metrics:
- accuracy
- f1
- precision
pipeline_tag: video-classification
license: apache-2.0
language:
- en
---
# MC3-18 HMDB51 (Kinetics-400 Init)

## Model Description

MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips.

**Validation Accuracy: 56.34%**

This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation.

## Model Details

- **Architecture:** MC3-18 (11.7M parameters)
- **Initialization:** Kinetics-400 pretrained weights
- **Dataset:** HMDB51 split 1
  - Train: 3,570 videos across 51 action classes
  - Validation: 1,530 videos
- **Input:** RGB video clips (8 frames, 112x112 spatial resolution)
- **Output:** 51-class action predictions

## Training Configuration
```yaml
Frames: 8
Frame Interval: 1
Spatial Size: 112x112
Batch Size: 16
Epochs: 150
Learning Rate: 0.0003
Weight Decay: 3e-3
Optimizer: SGD (momentum=0.9)
```

**Augmentation:**
- MixUp (alpha=0.6)
- CutMix (alpha=1.0)
- Label Smoothing (0.15)
- Random horizontal flip
- Color jitter
- Random grayscale

## Performance

| Metric | Value |
|--------|-------|
| Validation Accuracy | 56.34% |
| Training Accuracy | ~75% |
| Train-Val Gap | ~19% |
| Val F1 Score | ~0.54 |
| Val Precision | ~0.55 |

## Overfitting Analysis

The 19% train-validation gap indicates significant overfitting, which is expected given:
- HMDB51's small size (only 69 videos per class on average)
- MC3-18 has 11.7M parameters
- Even with strong augmentation and regularization, the model memorizes training data

This gap is typical for HMDB51 and difficult to eliminate without:
- Larger pretraining datasets
- Ensemble methods
- More aggressive regularization techniques
- Reduced model capacity

## Design Choices

**Why num_frames=8 and frame_interval=1?**

HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents:
- Frame repetition/tiling for short videos
- Loss of temporal information

This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics.

## Usage
```python
import torch
from torchvision.models.video import mc3_18
from torchvision import transforms
import cv2

# Load model
model = mc3_18(weights=None)
model.fc = torch.nn.Linear(model.fc.in_features, 51)
checkpoint = torch.load('best.pth')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((128, 171)),
    transforms.CenterCrop(112),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], 
                        std=[0.22803, 0.22145, 0.216989])
])

# Load 8 frames from video
frames = []  # Load your 8 RGB frames here
frames = [transform(frame) for frame in frames]
video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0)  # (1, 3, 8, 112, 112)

# Inference
with torch.no_grad():
    output = model(video_tensor)
    pred = output.argmax(dim=1)
```

## Alternative Approach

We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%).

See: `mc3-18-hmdb51-ucf-transfer`

**Kinetics vs UCF-101 initialization:**
- Kinetics: Larger pretraining dataset, optimized for short clips (8 frames)
- UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos)

## Limitations

- Overfits on small datasets (19% train-val gap)
- Single model without ensemble
- No test-time augmentation (multi-crop, temporal sampling)
- Optimized for 8-frame inputs (may not generalize to different temporal windows)
- Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3)

## HMDB51 Classes

The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.

## Training Details

- Framework: PyTorch
- Hardware: Single GPU (CUDA)
- Training Time: ~2 hours (150 epochs)
- Convergence: Best model saved at epoch ~85-90

## Citation

If you use this model, please cite the original HMDB51 dataset:
```bibtex
@inproceedings{kuehne2011hmdb,
  title={HMDB: a large video database for human motion recognition},
  author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
  booktitle={2011 International Conference on Computer Vision},
  pages={2556--2563},
  year={2011},
  organization={IEEE}
}
```

And the MC3 architecture:
```bibtex
@inproceedings{tran2018closer,
  title={A closer look at spatiotemporal convolutions for action recognition},
  author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={6450--6459},
  year={2018}
}
```

## License

Model weights: [Apache]
Code: [Apache]
HMDB51 Dataset: [Original dataset license]