Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

arXiv cs.CV·Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle

2d ago

·~1 min·6/5/2026·en·2

Quick Answer

The proposed framework utilizes VGGT, a 3D feed-forward transformer, for unpaired RGB-thermal novel view synthesis, achieving competitive thermal view synthesis while maintaining RGB fidelity.

Quick Take

The proposed framework utilizes VGGT, a 3D feed-forward transformer, for unpaired RGB-thermal novel view synthesis, achieving competitive thermal view synthesis while maintaining RGB fidelity. This method overcomes limitations of existing paired calibration approaches, demonstrating effective cross-modal feature matching and alignment through the Procrustes algorithm. A new benchmarking framework is introduced to evaluate multi-modal coherence in reconstructed scenes.

Key Points

Introduces unpaired RGB-thermal novel view synthesis using VGGT transformer architecture.
Achieves competitive performance in thermal view synthesis while preserving RGB image quality.
Utilizes Procrustes algorithm for aligning camera poses across modalities without paired calibration.
Proposes a benchmarking framework for evaluating multi-modal coherence in scene reconstruction.
Demonstrates effectiveness on diverse scenes, addressing limitations of existing reconstruction methods.

Article Content

From source RSS / original summary

arXiv:2606. 05491v1 Announce Type: new Abstract: Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment.

To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images.

Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

2d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup