Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning
Quick Answer
This paper presents a sensor-guided adaptive contrastive learning framework for PointGoal navigation, leveraging LiDAR data to enhance visual representation learning.
Quick Take
This paper presents a sensor-guided adaptive contrastive learning framework for PointGoal navigation, leveraging LiDAR data to enhance visual representation learning. The method significantly improves policy-level scene transfer in diverse environments, outperforming large pretrained models and standard contrastive baselines, while relying solely on monocular RGB observations during deployment.
Key Points
- Introduces a geometry-aware similarity metric for contrastive learning.
- Decouples representation learning from policy optimization using a frozen encoder.
- Demonstrates significant improvements in scene transfer across indoor and outdoor settings.
- Agent operates using only monocular RGB and standard task-related inputs.
- Releases a multimodal dataset for future research in navigation representation learning.
Article Content
From source RSS / original summaryarXiv:2606. 05506v1 Announce Type: new Abstract: We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance.
The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features.
Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors.
Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.
