Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the \textbf{H}adamard \textbf{A}ttention \textbf{R}ecurrent Stereo \textbf{T}ransformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked \textbf{1st} on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at \url{https://github.com/ZYangChen/HART}.

翻译：鉴于Transformer技术的进步，现有研究认为构建立体Transformer是解决双目立体匹配挑战的潜在方案。然而，受限于注意力机制的低秩瓶颈和二次复杂度，立体Transformer在合理推理时间内仍未能展现出足够的非线性表达能力。对关键同名点关注的缺乏使得此类方法的表示在反射和弱纹理等挑战性条件下显得脆弱。此外，较慢的计算速度不利于实际应用。为克服这些困难，我们提出了\textbf{H}adamard \textbf{A}ttention \textbf{R}ecurrent Stereo \textbf{T}ransformer (HART)，其包含以下组件：1) 为实现更快推理，我们提出了一种基于哈达玛积的注意力机制范式，实现了线性计算复杂度。2) 我们设计了密集注意力核(DAK)以放大相关与无关特征响应之间的差异，使HART能够聚焦于重要细节。DAK还将零元素转换为非零元素，以缓解低秩瓶颈导致的表达能力下降。3) 为补偿哈达玛积中缺失的空间和通道交互，我们提出MKOI，通过大小核卷积的交错来捕获全局和局部信息。实验结果证明了我们HART的有效性。在反射区域，HART在提交时于KITTI 2012基准测试中位列所有已发表方法的\textbf{第1名}。代码发布于\url{https://github.com/ZYangChen/HART}。