Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).
翻译:孪生视觉跟踪器近期通过构建于卷积或Transformer架构上日益复杂的融合机制取得了进展。然而,这两种架构在资源受限的硬件上均难以高效实现像素级交互,导致精度与效率的失衡问题持续存在。受此局限性的启发,我们采用简单而有效的基于多层感知机(MLP)的融合模块重新设计了孪生网络颈部结构,该模块能以极小的结构开销实现像素级交互。然而,简单地堆叠MLP模块会引入新的挑战:计算成本可能随通道宽度呈二次方增长。为解决此问题,我们构建了一个由精心设计的MLP模块组成的层次化搜索空间,并引入定制化的松弛策略,使可微分神经架构搜索(DNAS)能够将通道宽度优化与其他架构选择解耦。这种定向解耦机制能自动平衡通道宽度与网络深度,从而生成低复杂度架构。所得跟踪器实现了最优的精度-效率权衡,在四个通用跟踪基准和三个空中跟踪基准中均位列前茅,同时在资源受限的图形处理器(GPU)和神经处理器(NPU)上均保持实时性能。