In this work, we introduce FaceXformer, an end-to-end unified transformer model for a comprehensive range of facial analysis tasks such as face parsing, landmark detection, head pose estimation, attributes recognition, and estimation of age, gender, race, and landmarks visibility. Conventional methods in face analysis have often relied on task-specific designs and preprocessing techniques, which limit their approach to a unified architecture. Unlike these conventional methods, our FaceXformer leverages a transformer-based encoder-decoder architecture where each task is treated as a learnable token, enabling the integration of multiple tasks within a single framework. Moreover, we propose a parameter-efficient decoder, FaceX, which jointly processes face and task tokens, thereby learning generalized and robust face representations across different tasks. To the best of our knowledge, this is the first work to propose a single model capable of handling all these facial analysis tasks using transformers. We conducted a comprehensive analysis of effective backbones for unified face task processing and evaluated different task queries and the synergy between them. We conduct experiments against state-of-the-art specialized models and previous multi-task models in both intra-dataset and cross-dataset evaluations across multiple benchmarks. Additionally, our model effectively handles images "in-the-wild," demonstrating its robustness and generalizability across eight different tasks, all while maintaining the real-time performance of 37 FPS.
翻译:本文提出FaceXformer,一个端到端的统一Transformer模型,可处理面部解析、关键点检测、头部姿态估计、属性识别以及年龄、性别、种族和关键点可见性估计等多种面部分析任务。传统面部分析方法通常依赖特定任务的设计和预处理技术,限制了统一架构的实现。与这些传统方法不同,FaceXformer采用基于Transformer的编码器-解码器架构,将每个任务视为可学习的token,从而在单一框架内实现多任务集成。此外,我们提出一种参数高效的解码器FaceX,该解码器联合处理面部token与任务token,学习跨不同任务的泛化且鲁棒的面部表示。据我们所知,这是首个提出单一模型能通过Transformer处理所有上述面部分析任务的工作。我们对统一面部任务处理的有效骨干网络进行了全面分析,并评估了不同任务查询及其协同性。我们在多个基准数据集上,与当前最先进的专用模型及先前的多任务模型进行了数据集内和跨数据集对比实验。此外,我们的模型有效处理"野外"图像,在八个不同任务上展示了鲁棒性和泛化能力,同时保持了37 FPS的实时性能。