Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.
翻译:多模态大语言模型(MLLMs)已在多种任务中展现出卓越能力。然而,如何有效评估这些模型在人脸感知方面的性能,目前仍很大程度上未被探索。为填补这一空白,我们提出了FaceBench,这是一个包含层次化多视角与多层次属性的数据集,专门用于评估MLLMs的综合人脸感知能力。首先,我们构建了一个层次化的面部属性结构,涵盖五个视角,包含多达三个层级的属性,总计超过210个属性和700个属性值。基于此结构,所提出的FaceBench包含49,919个用于评估的视觉问答(VQA)对和23,841个用于微调的VQA对。此外,我们通过使用我们提出的人脸VQA数据进行训练,进一步开发了一个鲁棒的人脸感知MLLM基线模型——Face-LLaVA。我们对多种主流MLLMs及Face-LLaVA进行了广泛实验以测试其人脸感知能力,并将结果与人类表现进行了比较。结果表明,现有MLLMs在理解细粒度面部属性方面远未达到令人满意的水平,而我们的Face-LLaVA仅使用少量训练数据便显著优于现有的开源模型,并与GPT-4o和Gemini等商业模型性能相当。该数据集将在https://github.com/CVI-SZU/FaceBench发布。