Transformers achieve remarkable performance across many domains, yet struggle with tasks requiring multi-hop relational reasoning over structured data. We analyze this limitation through circuit complexity: standard transformers are $\mathsf{TC}^0$-complete and cannot solve graph connectivity in constant depth, implying $Ω(k)$ layers are necessary for $k$-hop reasoning regardless of model size or training data. We introduce RASA (Relation-Aware Sparse Attention), a minimal architectural modification that provides structural inductive bias for relational reasoning. RASA adds: (1) sparse adjacency masking that restricts attention to graph-connected positions, reducing the attention pattern search space from $O(2^{n^2})$ to $O(2^m)$ for graphs with $m$ edges; and (2) learnable edge-type biases that encode relation-specific attention preferences. While RASA does not circumvent asymptotic depth requirements, the exponential reduction in attention pattern space provides stronger inductive bias for learning graph-structured functions. Empirically, on the MetaQA knowledge graph QA benchmark, RASA achieves 97.7% accuracy on 3-hop questions, outperforming EmbedKGQA (94.8%) by 2.9 percentage points. Notably, RASA's advantage grows with reasoning depth, validating that structural inductive bias is most beneficial for complex multi-hop queries. Our results demonstrate that minimal architectural modifications, grounded in complexity-theoretic analysis, can substantially improve multi-hop reasoning.
翻译:Transformer在许多领域取得了卓越的性能,但在需要对结构化数据进行多跳关系推理的任务上仍面临困难。我们通过电路复杂性分析这一局限性:标准Transformer是$\mathsf{TC}^0$-完全的,无法在恒定深度内解决图连通性问题,这意味着无论模型规模或训练数据如何,$k$跳推理都需要$Ω(k)$层。我们提出RASA(关系感知稀疏注意力),这是一种最小的架构修改,为关系推理提供了结构归纳偏置。RASA增加了:(1)稀疏邻接掩码,将注意力限制在图连通的位置上,将注意力模式搜索空间从$O(2^{n^2})$减少到$O(2^m)$(对于具有$m$条边的图);(2)可学习的边类型偏置,编码特定关系的注意力偏好。虽然RASA无法规避渐近深度要求,但注意力模式空间的指数级缩减为学习图结构函数提供了更强的归纳偏置。在MetaQA知识图谱问答基准测试中,RASA在3跳问题上达到了97.7%的准确率,比EmbedKGQA(94.8%)高出2.9个百分点。值得注意的是,RASA的优势随着推理深度的增加而增长,验证了结构归纳偏置对于复杂多跳查询最为有益。我们的结果表明,基于复杂性理论分析的最小架构修改可以显著提升多跳推理能力。