SegFace: Face Segmentation of Long-Tail Classes

Face parsing refers to the semantic segmentation of human faces into key facial regions such as eyes, nose, hair, etc. It serves as a prerequisite for various advanced applications, including face editing, face swapping, and facial makeup, which often require segmentation masks for classes like eyeglasses, hats, earrings, and necklaces. These infrequently occurring classes are called long-tail classes, which are overshadowed by more frequently occurring classes known as head classes. Existing methods, primarily CNN-based, tend to be dominated by head classes during training, resulting in suboptimal representation for long-tail classes. Previous works have largely overlooked the problem of poor segmentation performance of long-tail classes. To address this issue, we propose SegFace, a simple and efficient approach that uses a lightweight transformer-based model which utilizes learnable class-specific tokens. The transformer decoder leverages class-specific tokens, allowing each token to focus on its corresponding class, thereby enabling independent modeling of each class. The proposed approach improves the performance of long-tail classes, thereby boosting overall performance. To the best of our knowledge, SegFace is the first work to employ transformer models for face parsing. Moreover, our approach can be adapted for low-compute edge devices, achieving 95.96 FPS. We conduct extensive experiments demonstrating that SegFace significantly outperforms previous state-of-the-art models, achieving a mean F1 score of 88.96 (+2.82) on the CelebAMask-HQ dataset and 93.03 (+0.65) on the LaPa dataset. Code: https://github.com/Kartik-3004/SegFace

翻译：人脸解析是指将人脸语义分割为眼睛、鼻子、头发等关键面部区域。它是多种高级应用（如人脸编辑、人脸交换和面部化妆）的前提，这些应用通常需要诸如眼镜、帽子、耳环和项链等类别的分割掩码。这些不常出现的类别被称为长尾类别，它们被更频繁出现的类别（称为头部类别）所掩盖。现有方法（主要基于CNN）在训练过程中往往被头部类别主导，导致对长尾类别的表征欠佳。先前的研究在很大程度上忽视了长尾类别分割性能不佳的问题。为解决此问题，我们提出了SegFace，一种简单高效的方法，它使用一个基于轻量级Transformer的模型，该模型利用了可学习的类别特定令牌。Transformer解码器利用类别特定令牌，使每个令牌专注于其对应的类别，从而实现对每个类别的独立建模。所提出的方法提升了长尾类别的性能，进而提高了整体性能。据我们所知，SegFace是首个将Transformer模型用于人脸解析的工作。此外，我们的方法可以适配到低计算能力的边缘设备，实现95.96 FPS。我们进行了大量实验，证明SegFace显著优于先前的最先进模型，在CelebAMask-HQ数据集上实现了88.96（+2.82）的平均F1分数，在LaPa数据集上实现了93.03（+0.65）的平均F1分数。代码：https://github.com/Kartik-3004/SegFace