Context-Guided Semantic Alignment for Feature Fusion Networks
Quick Answer
This paper shows that The Feature Interaction Network (FINE) enhances feature fusion in object detection by refining low-level features with high-level contextual guidance.
Quick Take
The Feature Interaction Network (FINE) enhances feature fusion in object detection by refining low-level features with high-level contextual guidance. It employs Alignment-Aware Token Sampling to reduce attention complexity, improving detection accuracy while maintaining efficiency across various detectors.
Key Points
- FINE refines low-level features using high-level contextual guidance.
- Introduces Alignment-Aware Token Sampling to reduce attention complexity significantly.
- Improves detection accuracy without compromising computational efficiency.
- Applicable to various object detectors for enhanced performance.
- Ensures selective enhancement of semantically relevant pixels.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14005v1 Announce Type: new Abstract: Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations.
In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude.
The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.