The unsupervised task of Joint Alignment (JA) of images is beset by challenges such as high complexity, geometric distortions, and convergence to poor local or even global optima. Although Vision Transformers (ViT) have recently provided valuable features for JA, they fall short of fully addressing these issues. Consequently, researchers frequently depend on expensive models and numerous regularization terms, resulting in long training times and challenging hyperparameter tuning. We introduce the Spatial Joint Alignment Model (SpaceJAM), a novel approach that addresses the JA task with efficiency and simplicity. SpaceJAM leverages a compact architecture with only 16K trainable parameters and uniquely operates without the need for regularization or atlas maintenance. Evaluations on SPair-71K and CUB datasets demonstrate that SpaceJAM matches the alignment capabilities of existing methods while significantly reducing computational demands and achieving at least a 10x speedup. SpaceJAM sets a new standard for rapid and effective image alignment, making the process more accessible and efficient. Our code is available at: https://bgu-cs-vil.github.io/SpaceJAM/.
翻译:图像联合对齐(JA)这一无监督任务长期面临高计算复杂度、几何畸变以及易收敛至不良局部甚至全局最优解等挑战。尽管视觉Transformer(ViT)近期为JA提供了有价值的特征表示,但仍未能完全解决这些问题。因此,研究者往往依赖计算昂贵的模型和大量正则化项,导致训练时间冗长且超参数调优困难。本文提出空间联合对齐模型(SpaceJAM),这是一种兼顾效率与简洁性的新型JA方法。SpaceJAM采用仅含16K可训练参数的紧凑架构,其独特之处在于无需正则化项或模板图谱维护。在SPair-71K和CUB数据集上的评估表明,SpaceJAM在保持与现有方法相当的对齐性能的同时,显著降低了计算需求,并实现了至少10倍的加速。SpaceJAM为快速高效的图像对齐设立了新标准,使该过程更具可及性与实用性。代码已开源:https://bgu-cs-vil.github.io/SpaceJAM/。