Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception
Quick Answer
This study introduces a framework to enhance radar-camera BEV perception robustness by modeling source-domain variations, improving 3D detection performance across datasets without target-domain samples.
Quick Take
This study introduces a framework to enhance radar-camera BEV perception robustness by modeling source-domain variations, improving 3D detection performance across datasets without target-domain samples. Experiments show consistent gains in multi-modal BEV features, particularly between View-of-Delft and TJ4DRadSet datasets.
Key Points
- Framework synthesizes diverse source-domain views to improve BEV-based 3D detectors.
- Captures image-level variations to stabilize multi-modal BEV feature fusion.
- Improves performance across datasets without using target-domain samples.
- Consistent gains observed in experiments between specific radar-camera datasets.
- Method applied only during training, leaving inference unchanged.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11573v1 Announce Type: new Abstract: Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples.
We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes.
The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.