The dot product attention mechanism, originally designed for natural language processing (NLP) tasks, is a cornerstone of modern Transformers. It adeptly captures semantic relationships between word pairs in sentences by computing a similarity overlap between queries and keys. In this work, we explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions to approximate ground states of quantum many-body spin Hamiltonians. Specifically, we perform numerical simulations on the two-dimensional $J_1$-$J_2$ Heisenberg model, a common benchmark in the field of quantum-many body systems on lattice. By comparing the performance of standard attention mechanisms with a simplified version that excludes queries and keys, relying solely on positions, we achieve competitive results while reducing computational cost and parameter usage. Furthermore, through the analysis of the attention maps generated by standard attention mechanisms, we show that the attention weights become effectively input-independent at the end of the optimization. We support the numerical results with analytical calculations, providing physical insights of why queries and keys should be, in principle, omitted from the attention mechanism when studying large systems. Interestingly, the same arguments can be extended to the NLP domain, in the limit of long input sentences.
翻译:点积注意力机制最初为自然语言处理任务设计,现已成为现代Transformer架构的基石。该机制通过计算查询向量与键向量的相似度重叠,能够有效捕捉句子中词对之间的语义关联。本研究探讨了Transformer(尤其关注其注意力机制)在变分波函数参数化这一特定领域的适用性,旨在通过该框架近似量子多体自旋哈密顿量的基态。具体而言,我们在二维$J_1$-$J_2$海森堡模型(晶格量子多体系统领域的常用基准模型)上进行了数值模拟。通过对比标准注意力机制与简化版本(该版本摒弃查询与键向量,仅依赖位置信息)的性能表现,我们在降低计算成本与参数使用量的同时获得了具有竞争力的结果。进一步地,通过分析标准注意力机制生成的注意力分布图,我们发现优化过程结束时注意力权重实际上变得与输入无关。我们通过解析计算佐证了数值结果,从物理角度阐释了在研究大尺度系统时为何原则上应从注意力机制中省略查询与键向量。值得注意的是,相同论点可推广至自然语言处理领域,尤其适用于长输入语句的极限情况。