Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1,800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.
翻译:生成包含多个执行复杂动作且能保持其面部身份的人体图像是一项重大挑战。造成这一挑战的主要因素之一是缺乏专门的基准测试。为此,我们引入了MultiHuman-Testbench,这是一个用于严格评估多人图像生成模型的新型基准。该基准包含1800个样本,其中包括精心策划的文本提示,这些提示描述了一系列从简单到复杂的人类动作。这些提示与总共5550张独特的人脸图像相匹配,这些图像经过均匀采样以确保在年龄、种族背景和性别方面的多样性。除了描述文字,我们还提供了人工选择的、与提示精确匹配的姿态条件图像。我们提出了一个多方面的评估套件,采用四个关键指标来量化人脸数量、身份相似性、提示对齐度和动作检测。我们对一系列多样化模型进行了全面评估,包括零样本方法和基于训练的方法,无论是否使用区域先验。我们还提出了利用人体分割和匈牙利匹配来结合图像与区域隔离的新技术,显著提高了身份相似性。我们提出的基准和关键发现为推进多人图像生成研究提供了宝贵的见解和标准化工具。数据集和评估代码将在https://github.com/Qualcomm-AI-research/MultiHuman-Testbench 上提供。