Currently, image-text-driven multi-modal deep learning models have demonstrated their outstanding potential in many fields. In practice, tasks centered around facial images have broad application prospects. This paper presents \textbf{FaceCaption-15M}, a large-scale, diverse, and high-quality dataset of facial images accompanied by their natural language descriptions (facial image-to-text). This dataset aims to facilitate a study on face-centered tasks. FaceCaption-15M comprises over 15 million pairs of facial images and their corresponding natural language descriptions of facial features, making it the largest facial image-caption dataset to date. We conducted a comprehensive analysis of image quality, text naturalness, text complexity, and text-image relevance to demonstrate the superiority of FaceCaption-15M. To validate the effectiveness of FaceCaption-15M, we first trained a facial language-image pre-training model (FLIP, similar to CLIP) to align facial image with its corresponding captions in feature space. Subsequently, using both image and text encoders and fine-tuning only the linear layer, our FLIP-based models achieved state-of-the-art results on two challenging face-centered tasks. The purpose is to promote research in the field of face-related tasks through the availability of the proposed FaceCaption-15M dataset. All data, codes, and models are publicly available. https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M
翻译:当前,图像-文本驱动的多模态深度学习模型已在诸多领域展现出卓越潜力。实践中,以人脸图像为中心的任务具有广阔的应用前景。本文提出\textbf{FaceCaption-15M},一个大规模、多样化且高质量的人脸图像及其自然语言描述(人脸图像到文本)数据集。该数据集旨在促进以人脸为中心任务的研究。FaceCaption-15M包含超过1500万对人脸图像及其对应的人脸特征自然语言描述,是迄今最大规模的人脸图像-描述数据集。我们对图像质量、文本自然度、文本复杂度及文本-图像相关性进行了全面分析,以证明FaceCaption-15M的优越性。为验证FaceCaption-15M的有效性,我们首先训练了一个人脸语言-图像预训练模型(FLIP,类似于CLIP),以在特征空间中对齐人脸图像与其对应描述。随后,利用图像和文本编码器并仅微调线性层,我们基于FLIP的模型在两个具有挑战性的人脸中心任务上取得了最先进的结果。本研究旨在通过提供所提出的FaceCaption-15M数据集,推动人脸相关任务领域的研究。所有数据、代码和模型均已公开。https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M