Context-Guided Semantic Alignment for Feature Fusion Networks

arXiv cs.CV·Hyungseop Lee, Jiho Lee, Woochul Kang

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

This paper shows that The Feature Interaction Network (FINE) enhances feature fusion in object detection by refining low-level features with high-level contextual guidance.

Quick Take

The Feature Interaction Network (FINE) enhances feature fusion in object detection by refining low-level features with high-level contextual guidance. It employs Alignment-Aware Token Sampling to reduce attention complexity, improving detection accuracy while maintaining efficiency across various detectors.

Key Points

FINE refines low-level features using high-level contextual guidance.
Introduces Alignment-Aware Token Sampling to reduce attention complexity significantly.
Improves detection accuracy without compromising computational efficiency.
Applicable to various object detectors for enhanced performance.
Ensures selective enhancement of semantically relevant pixels.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 14005v1 Announce Type: new Abstract: Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations.

In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude.

The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup