fisheye8k_jozhang97_deta-swin-large

This model is a fine-tuned version of jozhang97/deta-swin-large on the Fisheye8K dataset. It was developed as part of the Mcity Data Engine project, an open-source system designed for iterative model improvement through open-vocabulary data selection.

It achieves the following results on the evaluation set:

Loss: 17.9701

Paper: Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection Project Page: Mcity Data Engine Docs Code: GitHub Repository

Model description

This model is a key component of the Mcity Data Engine, a comprehensive, open-source system for the complete data-based development cycle of machine learning models. It specifically targets challenges in Intelligent Transportation Systems (ITS), where the goal is to detect rare and novel classes in vast amounts of unlabeled data, such as those generated by vehicle fleets and roadside perception systems.

This fisheye8k_jozhang97_deta-swin-large model is an object detection model fine-tuned using the Mcity Data Engine's methodologies. It focuses on identifying specific object categories relevant to ITS, trained on data from fisheye cameras. The engine facilitates iterative model improvements by intelligently selecting and labeling data, especially for long-tail classes.

Intended uses & limitations

Intended Uses: This model is primarily intended for object detection tasks within Intelligent Transportation Systems (ITS). It is designed to identify objects such as Bus, Bike, Car, Pedestrian, and Truck in visual data, particularly from fisheye camera perspectives, as part of the iterative data selection and model training processes facilitated by the Mcity Data Engine. It serves as a practical demonstration and artifact of the engine's capabilities.

Limitations: As a model fine-tuned on a specific dataset (Fisheye8K), its performance may vary when applied to datasets with significantly different characteristics, environmental conditions, or object distributions. Its optimal utility is achieved when integrated within the broader Mcity Data Engine framework for continuous improvement and adaptation to novel classes.

Training and evaluation data

This model was fine-tuned on the Voxel51/fisheye8k dataset. This dataset is crucial for the model's application in Intelligent Transportation Systems, providing data from fisheye cameras. The training process leverages the open-vocabulary data selection capabilities of the Mcity Data Engine to identify and incorporate relevant samples, including rare and long-tail classes. The model detects the following classes: Bus, Bike, Car, Pedestrian, Truck.

Sample Usage

You can use this model directly with the Hugging Face transformers library for object detection:

import torch
from transformers import AutoImageProcessor, AutoModelForObjectDetection
from PIL import Image
import requests

# Load an example image (replace with your fisheye image if available)
# This example uses a standard COCO image for demonstration purposes.
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Load image processor and model from the Hugging Face Hub
model_name = "jozhang97/fisheye8k_jozhang97_deta-swin-large"
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForObjectDetection.from_pretrained(model_name)

# Process image and get predictions
inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Post-process outputs to get bounding boxes, labels, and scores
target_sizes = torch.tensor([image.size[::-1]]) # (height, width) for post-processing
results = image_processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

# Print detected objects
print("Detected objects:")
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"  Detected {model.config.id2label[label.item()]} "
        f"with confidence {round(score.item(), 3)} at location {box}"
    )

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 1
eval_batch_size: 8
seed: 0
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 36
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
13.7551	1.0	5288	17.5573
12.6537	2.0	10576	17.4879
12.023	3.0	15864	17.6520
11.4167	4.0	21152	18.5138
10.8161	5.0	26440	17.7264
10.5346	6.0	31728	17.9145
10.1203	7.0	37016	17.9701

Framework versions

Transformers 4.48.3
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.21.0

Acknowledgements

Mcity would like to thank Amazon Web Services (AWS) for their pivotal role in providing the cloud infrastructure on which the Data Engine depends. We couldn’t have done it without their tremendous support!

Citation

If you use the Mcity Data Engine in your research, feel free to cite the project:

@article{bogdoll2025mcitydataengine,
  title={Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection},
  author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
  journal={arXiv preprint arXiv:2504.21614},
  year={2025}
}

Downloads last month: 18

Model tree for mcity-data-engine/fisheye8k_jozhang97_deta-swin-large

Base model

jozhang97/deta-swin-large

Finetuned

(2)

this model

mcity-data-engine
/

fisheye8k_jozhang97_deta-swin-large