Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a recurrent token mixing mechanism, an axis-landmark-positional embedding mechanism, as well as a confidence-enhanced multi-head attention mechanism to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group structure modeling mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking.
翻译:近年来,基于一维关键点表征的热图回归方法在人脸关键点定位任务中展现出卓越性能。然而,现有方法未能深入挖掘一维关键点表征在连续空间与结构建模方面的潜力,以对多关键点序列进行跟踪。为解决这一局限,我们提出一种名为1DFormer的Transformer架构,该架构通过时间与空间两个维度的标记通信捕获关键点的动态模式与几何模式,从而学习包含丰富信息的一维关键点表征。在时序建模方面,我们提出循环标记混合机制、轴向关键点位置嵌入机制及置信度增强多头注意力机制,能够自适应且鲁棒地将长期关键点动态嵌入至其一维表征;在结构建模方面,我们设计组内与组间结构建模机制,通过一维卷积层在空间维度执行标记通信,将组件级及全局级人脸结构模式编码为关键点一维表征的精细化信息。在300VW与TF数据库上的实验结果表明,1DFormer成功建模了远程序列模式与人脸固有结构,从而学习到关键点序列包含丰富信息的一维表征,并在人脸关键点跟踪任务中达到最先进性能。