HexFormer: Hyperbolic Vision Transformer with Exponential Map Aggregation

Data across modalities such as images, text, and graphs often contains hierarchical and relational structures, which are challenging to model within Euclidean geometry. Hyperbolic geometry provides a natural framework for representing such structures. Building on this property, this work introduces HexFormer, a hyperbolic vision transformer for image classification that incorporates exponential map aggregation within its attention mechanism. Two designs are explored: a hyperbolic ViT (HexFormer) and a hybrid variant (HexFormer-Hybrid) that combines a hyperbolic encoder with an Euclidean linear classification head. HexFormer incorporates a novel attention mechanism based on exponential map aggregation, which yields more accurate and stable aggregated representations than standard centroid based averaging, showing that simpler approaches retain competitive merit. Experiments across multiple datasets demonstrate consistent performance improvements over Euclidean baselines and prior hyperbolic ViTs, with the hybrid variant achieving the strongest overall results. Additionally, this study provides an analysis of gradient stability in hyperbolic transformers. The results reveal that hyperbolic models exhibit more stable gradients and reduced sensitivity to warmup strategies compared to Euclidean architectures, highlighting their robustness and efficiency in training. Overall, these findings indicate that hyperbolic geometry can enhance vision transformer architectures by improving gradient stability and accuracy. In addition, relatively simple mechanisms such as exponential map aggregation can provide strong practical benefits.

翻译：跨模态数据（如图像、文本和图）常包含层次化与关系化结构，这些结构在欧几里得几何中难以有效建模。双曲几何为表征此类结构提供了自然框架。基于此特性，本研究提出HexFormer——一种用于图像分类的双曲视觉Transformer，在其注意力机制中引入了指数映射聚合。我们探索了两种设计：纯双曲ViT（HexFormer）以及混合变体（HexFormer-Hybrid），后者将双曲编码器与欧几里得线性分类头相结合。HexFormer采用基于指数映射聚合的新型注意力机制，相比传统的基于质心的平均方法，该机制能产生更精确且稳定的聚合表示，表明更简洁的方法仍具有竞争优势。在多个数据集上的实验表明，该模型性能持续优于欧几里得基线及先前的双曲ViT，其中混合变体取得了最佳综合结果。此外，本研究对双曲Transformer中的梯度稳定性进行了分析。结果表明，与欧几里得架构相比，双曲模型展现出更稳定的梯度特性及对预热策略更低的敏感性，凸显了其训练过程的鲁棒性与高效性。总体而言，这些发现证明双曲几何能通过提升梯度稳定性和精度来增强视觉Transformer架构。同时，相对简单的机制（如指数映射聚合）也能带来显著的实际效益。