Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
翻译:单目深度估计是计算机视觉领域的核心问题,在机器人、增强现实和自动驾驶中具有重要应用,然而驱动现代Transformer架构的自注意力机制仍缺乏可解释性。本文将SVD启发的注意力机制(SVDA)引入密集预测Transformer(DPT),首次为密集预测任务提供了谱结构化的注意力形式化表达。SVDA通过将可学习的对角矩阵嵌入归一化的查询-键交互,实现了方向对齐与谱调制的解耦,从而生成本质可解释的注意力图,而非事后近似。在KITTI和NYU-v2数据集上的实验表明,SVDA在仅增加轻微计算开销的同时,保持或略微提升了预测精度。更重要的是,SVDA衍生出六个谱指标,可量化熵、秩、稀疏性、对齐度、选择性和鲁棒性。这些指标揭示了注意力在训练过程中组织方式的跨数据集一致性模式和深度维度规律,这些洞见在标准Transformer中是无法获得的。通过将注意力的角色从不透明机制转变为可量化描述符,SVDA重新定义了单目深度估计中的可解释性,并为构建透明的密集预测模型开辟了理论路径。