Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the Hadamard Attention Recurrent Stereo Transformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at https://github.com/ZYangChen/HART.

翻译：鉴于Transformer技术的进步，现有研究提出构建立体Transformer作为解决双目立体匹配挑战的潜在方案。然而，受限于注意力机制的低秩瓶颈和二次复杂度，立体Transformer在合理推理时间内仍未能展现出足够的非线性表达能力。对关键同名点关注的缺乏使得此类方法的表示在反射和弱纹理等挑战性条件下表现脆弱。此外，较慢的计算速度不利于实际应用。为克服这些困难，我们提出了哈达玛注意力循环立体Transformer（HART），其包含以下组件：1）为加速推理，我们提出了一种基于哈达玛积的注意力机制范式，实现了线性计算复杂度。2）我们设计了密集注意力核（DAK）以放大相关与无关特征响应之间的差异，使HART能够聚焦重要细节。DAK还将零元素转换为非零元素，以缓解低秩瓶颈导致的表达能力下降。3）为补偿哈达玛积缺失的空间与通道交互，我们提出MKOI，通过大小核卷积的交错来捕获全局与局部信息。实验结果证明了HART的有效性。在反射区域，HART在提交时于KITTI 2012基准测试中位列所有已发表方法的第一名。代码发布于https://github.com/ZYangChen/HART。