In this paper, we examine 3 important issues in the practical use of state-of-the-art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require face normalization as a preprocessing step, which is accomplished by a separately-trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre-trained network performs the optimal face normalization for landmark detection. We instead analyze the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, and jointly learn optimal face normalization and landmark detection. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space can further improve accuracy. To convert the predicted 3D landmarks into screen-space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Finally, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a suboptimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently-proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state-of-the-art on standard benchmarks.
翻译:本文研究了当前最先进面部关键点检测器在实际应用中的三个重要问题,并展示了特定架构修改的组合如何直接提升其精度与时间稳定性。首先,多数面部关键点检测器需要以人脸归一化作为预处理步骤,该步骤通常由独立训练的神经网络完成,负责对输入图像中的人脸进行裁剪和缩放。这种预训练网络无法保证为人脸关键点检测提供最优的归一化效果。为此,我们分析了空间变换网络的应用——该网络以无监督方式与关键点检测器协同训练,从而联合学习最优的人脸归一化与关键点检测。其次,我们证明通过修改关键点预测器的输出头,使其在规范三维空间中推断关键点,可进一步提升检测精度。为将预测的三维关键点转换至屏幕空间,我们额外从输入图像预测相机内参与头部姿态。此举的附带优势是仅需二维关键点作为监督信号,即可从给定图像预测三维人脸形状,这在确定关键点可见性等方面具有实用价值。最后,当同时在多个数据集上训练关键点检测器时,数据集间的标注不一致性会迫使网络产生次优的平均结果。我们提出引入语义校正网络来解决该问题。这个额外的轻量级神经网络与关键点检测器同步训练,且无需任何额外监督。虽然本文的见解可应用于大多数常见的关键点检测器,但我们特别针对近期提出的连续二维关键点检测器进行验证,通过标准基准测试表明每一项改进都能带来超越现有技术水平的实质性提升。