Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, we propose that this limitation arises from the intrinsic circuit complexity of the architecture. We formalize spatial understanding as learning a Group Homomorphism: mapping image sequences to a latent space that preserves the algebraic structure of the underlying transformation group. We demonstrate that for non-solvable groups (e.g., the 3D rotation group $\mathrm{SO}(3)$), maintaining such a structure-preserving embedding is computationally lower-bounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, we prove that constant-depth ViTs with polynomial precision are strictly bounded by $\mathsf{TC^0}$. Under the conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, we establish a complexity boundary: constant-depth ViTs fundamentally lack the logical depth to efficiently capture non-solvable spatial structures. We validate this complexity gap via latent-space probing, demonstrating that ViT representations suffer a structural collapse on non-solvable tasks as compositional depth increases.
翻译:视觉Transformer(ViT)在语义识别方面表现出色,但在空间推理任务(如心理旋转)中却表现出系统性失败。虽然这常被归因于数据规模,但我们提出这种局限性源于架构固有的电路复杂性。我们将空间理解形式化为学习一个群同态:将图像序列映射到一个潜在空间,该空间需保持底层变换群的代数结构。我们证明,对于不可解群(例如三维旋转群 $\mathrm{SO}(3)$),维持这样一个结构保持嵌入的计算复杂度下限由字问题决定,该问题是 $\mathsf{NC^1}$-完全的。相比之下,我们证明了具有多项式精度的恒定深度ViT严格受限于 $\mathsf{TC^0}$ 复杂度类。在 $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$ 的猜想下,我们确立了一个复杂性边界:恒定深度ViT从根本上缺乏有效捕捉不可解空间结构所需的逻辑深度。我们通过潜在空间探测验证了这一复杂性鸿沟,证明随着组合深度的增加,ViT的表征在不可解任务上会发生结构坍塌。