Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers
Quick Answer
The proposed framework utilizes VGGT, a 3D feed-forward transformer, for unpaired RGB-thermal novel view synthesis, achieving competitive thermal view synthesis while maintaining RGB fidelity.
Quick Take
The proposed framework utilizes VGGT, a 3D feed-forward transformer, for unpaired RGB-thermal novel view synthesis, achieving competitive thermal view synthesis while maintaining RGB fidelity. This method overcomes limitations of existing paired calibration approaches, demonstrating effective cross-modal feature matching and alignment through the Procrustes algorithm. A new benchmarking framework is introduced to evaluate multi-modal coherence in reconstructed scenes.
Key Points
- Introduces unpaired RGB-thermal novel view synthesis using VGGT transformer architecture.
- Achieves competitive performance in thermal view synthesis while preserving RGB image quality.
- Utilizes Procrustes algorithm for aligning camera poses across modalities without paired calibration.
- Proposes a benchmarking framework for evaluating multi-modal coherence in scene reconstruction.
- Demonstrates effectiveness on diverse scenes, addressing limitations of existing reconstruction methods.
Article Content
From source RSS / original summaryarXiv:2606. 05491v1 Announce Type: new Abstract: Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment.
To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images.
Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.