Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures in a single forward pass. To empirically validate this theoretical gap, we propose the Latent Space Algebra (LSA) benchmark, which reveals a significant degradation in ViT representations as the compositional depth of non-solvable tasks increases.
翻译:视觉Transformer在语义识别方面表现优异,但在心理旋转等空间推理任务中系统性失效。本文认为,尽管这一现象常被归因于数据规模,但其根源在于架构内在的电路复杂度。通过将空间理解形式化为群同态问题(即潜在嵌入需保持作用于图像的物理变换的代数结构),我们识别出一个基础性计算瓶颈。具体而言,对于不可解群(如$\mathrm{SO}(3)$),维持此类结构保持嵌入的下界与字问题等价,后者是$\mathsf{NC^1}$-完全问题。相比之下,具有多项式精度的恒定深度视觉Transformer严格受限于复杂度类$\mathsf{TC^0}$。在标准猜想$\mathsf{TC^0} \subsetneq \mathsf{NC^1}$下,存在一个复杂度边界:恒定深度架构在单次前向传播中缺乏捕获不可解空间结构所需的逻辑深度。为实证验证这一理论缺口,我们提出潜在空间代数基准测试,该测试揭示了当不可解任务的组合深度增加时,视觉Transformer表征会出现显著退化。