HumanVLM: Foundation for Human-Scene Vision-Language Model

Human-scene vision-language tasks are increasingly prevalent in diverse social applications, yet recent advancements predominantly rely on models specifically tailored to individual tasks. Emerging research indicates that large vision-language models (VLMs) can enhance performance across various downstream vision-language understanding tasks. However, general-domain models often underperform in specialized fields. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks. Specifically, (1) we create a large-scale human-scene multimodal image-text dataset (HumanCaption-10M) sourced from the Internet to facilitate domain-specific alignment; (2) develop a captioning approach for human-centered images, capturing human faces, bodies, and backgrounds, and construct a high-quality Human-Scene image-text dataset (HumanCaptionHQ, about 311k pairs) that contain as much detailed information as possible about human; (3) Using HumanCaption-10M and HumanCaptionHQ, we train a HumanVLM. In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance among multimodal models of comparable scale, particularly excelling in human-related tasks and significantly outperforming similar models, including Qwen2VL and ChatGPT-4o. HumanVLM, alongside the data introduced, will stimulate the research in human-around fields.

翻译：人-场景视觉-语言任务在各类社会应用中日益普遍，然而当前进展主要依赖于针对单一任务专门设计的模型。新兴研究表明，大型视觉-语言模型（VLMs）能够提升多种下游视觉-语言理解任务的性能。然而，通用领域模型在专业领域往往表现欠佳。本研究提出一个领域专用的大型视觉-语言模型——人-场景视觉-语言模型（HumanVLM），旨在为人-场景视觉-语言任务提供基础框架。具体而言：（1）我们从互联网收集并构建了大规模人-场景多模态图文数据集（HumanCaption-10M），以促进领域特定的对齐；（2）开发了面向以人为中心的图像描述方法，捕捉人脸、人体及背景信息，并构建了高质量的人-场景图文数据集（HumanCaptionHQ，约31.1万对），该数据集尽可能包含详尽的人体相关信息；（3）利用HumanCaption-10M与HumanCaptionHQ，我们训练了HumanVLM。在实验中，我们在多种下游任务上评估HumanVLM，结果显示其在同等规模的多模态模型中展现出优异的综合性能，尤其在与人相关的任务中表现突出，显著优于包括Qwen2VL和ChatGPT-4o在内的同类模型。HumanVLM及所引入的数据集将推动人机交互相关领域的研究发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/