MOT-TR: DETR Fine-Tuning for Moved-Object Detection
Overview
MOT-TR is a robust, reproducible pipeline designed to fine-tune the DETR (DEtection TRansformer) model for the task of detecting moved objects in pairs of images. Instead of treating object detection as a static problem, the project focuses on scene-change analysis — identifying objects that have been displaced between a “before” and “after” view.
The pipeline spans dataset construction, preprocessing, model adaptation, staged training, evaluation, and analysis. It combines PyTorch, Hugging Face Transformers, and Torchvision to deliver an end-to-end workflow that is both modular and easy to reproduce. You can build custom datasets, configure training via a simple config file, and produce detailed quantitative and qualitative results.
Dataset & Preprocessing
- Data organisation: Raw images are stored in
data/base/cv_data_hw2/data/, while bounding-box annotations live indata/matched_annotations/. Each annotation file encodes one object per line as:<class_id><x_center><y_center><width><height>
All coordinates are normalised relative to image dimensions.
- Image diff strategy: To highlight movement, the pipeline computes the pixel-wise difference between the “before” and “after” images and feeds this diff through a pre-trained image processor. This emphasises scene changes and proved more effective than feature-level diffs.
- Custom dataset class: The
MovedObjectDatasetclass handles image-annotation pairing, applies diff preprocessing, and integrates with Hugging Face’s image processor. The dataset is split into training and validation sets with fixed seeds for reproducibility.
Model Architecture & Training
- Model: Fine-tunes
facebook/detr-resnet-50, adapting its classification head to the number of custom classes. Transfer learning leverages pre-trained visual and spatial features. - Training strategies:
- Train only the classification head (20 epochs).
- Unfreeze all layers and reduce the learning rate.
- Continue staged training with progressively smaller learning rates.
- Techniques:
- Gradient accumulation to simulate larger batch sizes on limited GPUs.
- Cosine learning-rate schedule with restarts for better convergence.
- Hyperparameters: Configurable via
code/config.py. A typical run uses:- Raw batch size: 2 (for a Tesla T4 GPU)
- Accumulation steps: 16 (effective batch size 32)
- Cosine-with-restarts scheduler
- ~80 epochs of training
- Logging & monitoring: Metrics (loss, precision, recall, F1) logged via TensorBoard. The best model is automatically checkpointed.
Evaluation & Results
Quantitative metrics:
- Stage 1: Eval loss ≈ 2.38
- Stage 2: Eval loss ≈ 1.62
- Stage 3: Eval loss ≈ 0.60
This shows a three-fold improvement with staged unfreezing and careful LR tuning. - Qualitative analysis: Visualisations overlay predicted bounding boxes on diff images. Successes show accurate moved-object detection, while common failures include excessive differences from camera movement and missed detections from limited training data.
Key Contributions
- End-to-end pipeline: Includes ground-truth annotation tools, custom dataloaders, preprocessing with image diffs, model adaptation, staged training, and evaluation.
- Performance gains: Reduced eval loss from 2.38 to 0.60, stabilising convergence on a small dataset.
- Reproducibility: Fixed random seeds, modular configs, clear directory structure, and automated logging/checkpointing.
Lessons Learned & Future Work
This project underscored the importance of:
- Thoughtful preprocessing (image diffs matter).
- Careful learning-rate tuning.
- Staged training strategies for small datasets.
Future directions:
- Data augmentation (geometric + photometric).
- Hyperparameter optimisation (Optuna, Ray Tune).
- Advanced evaluation metrics (mAP).
- Camera-motion compensation to reduce false positives.
