Human-scene vision-language tasks are increasingly prevalent in diverse social applications, yet recent advancements predominantly rely on models specifically tailored to individual tasks. Emerging research indicates that large vision-language models (VLMs) can enhance performance across various downstream vision-language understanding tasks. However, general-domain models often underperform in specialized fields. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks. Specifically, (1) we create a large-scale human-scene multimodal image-text dataset (HumanCaption-10M) sourced from the Internet to facilitate domain-specific alignment; (2) develop a captioning approach for human-centered images, capturing human faces, bodies, and backgrounds, and construct a high-quality Human-Scene image-text dataset (HumanCaptionHQ, about 311k pairs) that contain as much detailed information as possible about human; (3) Using HumanCaption-10M and HumanCaptionHQ, we train a HumanVLM. In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance among multimodal models of comparable scale, particularly excelling in human-related tasks and significantly outperforming similar models, including Qwen2VL and ChatGPT-4o. HumanVLM, alongside the data introduced, will stimulate the research in human-around fields.
翻译:人-场景视觉-语言任务在各类社会应用中日益普遍,然而当前进展主要依赖于针对单一任务专门设计的模型。新兴研究表明,大型视觉-语言模型(VLMs)能够提升多种下游视觉-语言理解任务的性能。然而,通用领域模型在专业领域往往表现欠佳。本研究提出一个领域专用的大型视觉-语言模型——人-场景视觉-语言模型(HumanVLM),旨在为人-场景视觉-语言任务提供基础框架。具体而言:(1)我们从互联网收集并构建了大规模人-场景多模态图文数据集(HumanCaption-10M),以促进领域特定的对齐;(2)开发了面向以人为中心的图像描述方法,捕捉人脸、人体及背景信息,并构建了高质量的人-场景图文数据集(HumanCaptionHQ,约31.1万对),该数据集尽可能包含详尽的人体相关信息;(3)利用HumanCaption-10M与HumanCaptionHQ,我们训练了HumanVLM。在实验中,我们在多种下游任务上评估HumanVLM,结果显示其在同等规模的多模态模型中展现出优异的综合性能,尤其在与人相关的任务中表现突出,显著优于包括Qwen2VL和ChatGPT-4o在内的同类模型。HumanVLM及所引入的数据集将推动人机交互相关领域的研究发展。