SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining
Quick Answer
SceneMiner introduces a unified BEV pipeline for mining challenging driving scenes without LiDAR, achieving mAP 0.4614 on 20 scene tags.
Quick Take
SceneMiner introduces a unified BEV pipeline for mining challenging driving scenes without LiDAR, achieving mAP 0.4614 on 20 scene tags. The model employs identity-preserving multi-task fine-tuning to mitigate cross-task interference, maintaining performance while training only ~102k parameters. Code is available for further exploration.
Key Points
- SceneMiner uses a frozen vision-language backbone for efficient scene mining.
- Achieves micro-F1 score of 0.5557 on multi-label scene tagging.
- Introduces a continuous physics-based risk score as a byproduct.
- Cross-task interference is addressed through zero-initializing new sub-modules.
- Code is publicly available for research and development.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11507v1 Announce Type: new Abstract: Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own.
We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution).
Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream.
The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0. 4614 (micro-F1 0. 5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous. 4open. science/r/sceneminer_anonymous-64E5
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.