Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.
翻译:特征融合网络是现代目标检测器的基本组件,通过聚合多尺度特征来检测不同尺寸的目标。然而,直接融合不同金字塔层级的特征往往因表征异质性而导致语义不一致。本文提出特征交互网络(Feature Interaction NEtwork, FINE),这是一种轻量级语义对齐模块,在融合前通过跨层级注意力机制利用高层级上下文引导来优化低层级特征。为弥合结构差异并确保计算效率,我们引入对齐感知令牌采样(Alignment-Aware Token Sampling),该模块对齐跨尺度的对应空间区域,将注意力复杂度降低一个数量级。生成的注意力权重产生空间通道调制图,通过上采样后以残差逐元素调制方式作用于低层级特征。该机制确保网络选择性地增强语义相关像素,同时保留密集预测任务所需的亚像素定位精度。FINE可通用地应用于各类检测器,在不牺牲效率的前提下持续提升检测精度。