Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection
Quick Take
Contrastive-SDXL enhances night-time pedestrian detection by preserving semantic structure through advanced augmentation techniques.
Key Points
- Utilizes latent diffusion models for image translation.
- Introduces patch-wise semantic contrastive loss.
- Achieves 6-7% reduction in miss rate for detectors.
📖 Reader Mode
~2 min readAbstract:Night-time pedestrian detection remains challenging because labelled night-time data are limited and large illumination differences make daytime-only trained detectors unreliable. Latent diffusion models (LDMs) provide a powerful basis for image-to-image translation and cross-domain augmentation, but their effectiveness in safety-critical perception depends on whether detector-relevant objects and local semantic structure are preserved when translating between source and target domains. In this work, we present Contrastive-SDXL, a day-to-night augmentation framework for night-time pedestrian detection built on SDXL-Turbo and fine-tuned using Low-Rank Adaptation (LoRA). To preserve semantic correspondence between daytime inputs and translated night-time images, we introduce a patch-wise semantic contrastive loss guided by a pretrained DINOv2 encoder rather than generator encoder features. Multi-level DINOv2 self-attention maps enforce both local and global semantic consistency, while an object consistency loss explicitly encourages pedestrian preservation. Contrastive-SDXL produces realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5. Detectors trained with our synthetic images obtain a 6-7% reduction in miss rate compared with a daytime-only baseline, approaching the performance of detectors trained on real night-time data. These results demonstrate that consistency-driven diffusion augmentation can effectively support safety-critical night-time pedestrian this http URL
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.16406 [cs.CV] |
| (or arXiv:2605.16406v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16406 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Muhammad Khalid Dr [view email]
[v1]
Wed, 13 May 2026 10:41:55 UTC (6,273 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.