Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder's cross-attention distribution as an endogenous probe. By utilizing Spatial Attention Entropy (SAE) to heuristically evaluate local spatial ambiguity, the framework executes dynamic spatial routing, allocating a fixed computational budget exclusively to regions exhibiting both high target saliency and high cognitive uncertainty. By shrinking the spatial trust region and injecting high-frequency localized observations, ViCrop-Det actively resolves spatial ambiguity and recovers fine-grained features without requiring architectural modifications. Extensive evaluations on VisDrone and DOTA-v1.5 demonstrate that ViCrop-Det yields competitive performance enhancements, consistently adding +1-3 mAP@50 to RT-DETR-R50 and Deformable DETR with a marginal 20-23\% latency overhead. On MS COCO, $AP_{S}$ improves while $AP_{M}/AP_{L}$ remains stable, indicating precise fine-scale refinement without compromising the global spatial prior. Under compute-matched settings, our adaptive routing strategy comprehensively surpasses uniform slicing baselines, achieving a highly optimized accuracy-speed trade-off.
翻译:基于Transformer的架构已在全局语义感知中确立了主导范式,但其仍受限于自然图像中固有的深刻空间异质性。具体而言,在不同信息密度区域施加统一的全局感受野,不可避免地导致局部特征退化,尤其是在微观目标密集的冲突区域。为应对这一机制性局限,我们提出ViCrop-Det——一种无训练推理框架,引入自适应空间信任域收缩机制。受注意力熵在异常分割中应用的启发,ViCrop-Det利用检测解码器的交叉注意力分布作为内源性探针。通过采用空间注意力熵(SAE)启发式评估局部空间模糊性,该框架执行动态空间路由,将固定计算预算专用于兼具高目标显著性与高认知不确定性的区域。通过收缩空间信任域并注入高频局部观测,ViCrop-Det无需修改架构即可主动消解空间模糊性并恢复细粒度特征。在VisDrone和DOTA-v1.5上的广泛评估表明,ViCrop-Det产生了具有竞争力的性能提升,以20-23%的额外延迟开销为RT-DETR-R50和Deformable DETR持续带来+1-3 mAP@50的提升。在MS COCO上,$AP_{S}$得到改善而$AP_{M}/AP_{L}$保持稳定,表明在未损害全局空间先验的前提下实现了精确的细尺度优化。在计算量匹配设置下,我们的自适应路由策略全面超越均匀切片基线,实现了高度优化的精度-速度权衡。