Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation
Quick Take
The paper presents a novel RGB-thermal fusion architecture for improved semantic segmentation in challenging environments.
Key Points
- Utilizes dual ConvNeXt V2 backbones for multi-modal fusion.
- Introduces frequency-based fusion for early-stage thermal features.
- Achieves high mIoU with fewer parameters and lower costs.
Article Content
From source RSS / original summaryarXiv:2605. 26273v1 Announce Type: new Abstract: Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem.
In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies.
For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism.
For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61. 73\% and 86. 24\% mIoU, respectively, with only 35.
43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github. com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer enables efficient, uncertainty-aware tuning of biomedical vision-language models with minimal parameter updates.
