MOT-TR: DETR Fine-Tuning for Moved-Object Detection

Overview

MOT-TR is a robust, reproducible pipeline designed to fine-tune the DETR (DEtection TRansformer) model for the task of detecting moved objects in pairs of images. Instead of treating object detection as a static problem, the project focuses on scene-change analysis — identifying objects that have been displaced between a “before” and “after” view.

The pipeline spans dataset construction, preprocessing, model adaptation, staged training, evaluation, and analysis. It combines PyTorch, Hugging Face Transformers, and Torchvision to deliver an end-to-end workflow that is both modular and easy to reproduce. You can build custom datasets, configure training via a simple config file, and produce detailed quantitative and qualitative results.

Dataset & Preprocessing

Data organisation: Raw images are stored in data/base/cv_data_hw2/data/, while bounding-box annotations live in data/matched_annotations/. Each annotation file encodes one object per line as: <class_id><x_center><y_center><width><height>

All coordinates are normalised relative to image dimensions.

Image diff strategy: To highlight movement, the pipeline computes the pixel-wise difference between the “before” and “after” images and feeds this diff through a pre-trained image processor. This emphasises scene changes and proved more effective than feature-level diffs.
Custom dataset class: The MovedObjectDataset class handles image-annotation pairing, applies diff preprocessing, and integrates with Hugging Face’s image processor. The dataset is split into training and validation sets with fixed seeds for reproducibility.

Model Architecture & Training

Model: Fine-tunes facebook/detr-resnet-50, adapting its classification head to the number of custom classes. Transfer learning leverages pre-trained visual and spatial features.
Training strategies:
- Train only the classification head (20 epochs).
- Unfreeze all layers and reduce the learning rate.
- Continue staged training with progressively smaller learning rates.
Techniques:
- Gradient accumulation to simulate larger batch sizes on limited GPUs.
- Cosine learning-rate schedule with restarts for better convergence.
Hyperparameters: Configurable via code/config.py. A typical run uses:
- Raw batch size: 2 (for a Tesla T4 GPU)
- Accumulation steps: 16 (effective batch size 32)
- Cosine-with-restarts scheduler
- ~80 epochs of training
Logging & monitoring: Metrics (loss, precision, recall, F1) logged via TensorBoard. The best model is automatically checkpointed.

Evaluation & Results

Quantitative metrics:

Stage 1: Eval loss ≈ 2.38
Stage 2: Eval loss ≈ 1.62
Stage 3: Eval loss ≈ 0.60
This shows a three-fold improvement with staged unfreezing and careful LR tuning.
Qualitative analysis: Visualisations overlay predicted bounding boxes on diff images. Successes show accurate moved-object detection, while common failures include excessive differences from camera movement and missed detections from limited training data.

Key Contributions

End-to-end pipeline: Includes ground-truth annotation tools, custom dataloaders, preprocessing with image diffs, model adaptation, staged training, and evaluation.
Performance gains: Reduced eval loss from 2.38 to 0.60, stabilising convergence on a small dataset.
Reproducibility: Fixed random seeds, modular configs, clear directory structure, and automated logging/checkpointing.

Lessons Learned & Future Work

This project underscored the importance of:

Thoughtful preprocessing (image diffs matter).
Careful learning-rate tuning.
Staged training strategies for small datasets.

Future directions:

Data augmentation (geometric + photometric).
Hyperparameter optimisation (Optuna, Ray Tune).
Advanced evaluation metrics (mAP).
Camera-motion compensation to reduce false positives.

📄 Full Technical Report and Code available at

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Arjun Parasuram Prasad