15M Multimodal Facial Image-Text Dataset

Currently, image-text-driven multi-modal deep learning models have demonstrated their outstanding potential in many fields. In practice, tasks centered around facial images have broad application prospects. This paper presents \textbf{FaceCaption-15M}, a large-scale, diverse, and high-quality dataset of facial images accompanied by their natural language descriptions (facial image-to-text). This dataset aims to facilitate a study on face-centered tasks. FaceCaption-15M comprises over 15 million pairs of facial images and their corresponding natural language descriptions of facial features, making it the largest facial image-caption dataset to date. We conducted a comprehensive analysis of image quality, text naturalness, text complexity, and text-image relevance to demonstrate the superiority of FaceCaption-15M. To validate the effectiveness of FaceCaption-15M, we first trained a facial language-image pre-training model (FLIP, similar to CLIP) to align facial image with its corresponding captions in feature space. Subsequently, using both image and text encoders and fine-tuning only the linear layer, our FLIP-based models achieved state-of-the-art results on two challenging face-centered tasks. The purpose is to promote research in the field of face-related tasks through the availability of the proposed FaceCaption-15M dataset. All data, codes, and models are publicly available. https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M

翻译：当前，图像-文本驱动的多模态深度学习模型已在诸多领域展现出卓越潜力。实践中，以人脸图像为中心的任务具有广阔的应用前景。本文提出\textbf{FaceCaption-15M}，一个大规模、多样化且高质量的人脸图像及其自然语言描述（人脸图像到文本）数据集。该数据集旨在促进以人脸为中心任务的研究。FaceCaption-15M包含超过1500万对人脸图像及其对应的人脸特征自然语言描述，是迄今最大规模的人脸图像-描述数据集。我们对图像质量、文本自然度、文本复杂度及文本-图像相关性进行了全面分析，以证明FaceCaption-15M的优越性。为验证FaceCaption-15M的有效性，我们首先训练了一个人脸语言-图像预训练模型（FLIP，类似于CLIP），以在特征空间中对齐人脸图像与其对应描述。随后，利用图像和文本编码器并仅微调线性层，我们基于FLIP的模型在两个具有挑战性的人脸中心任务上取得了最先进的结果。本研究旨在通过提供所提出的FaceCaption-15M数据集，推动人脸相关任务领域的研究。所有数据、代码和模型均已公开。https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日