Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.
翻译:近年来,多模态大语言模型在视频理解领域取得了显著进展。然而,由于缺乏高质量数据,其在涉及人类动作的视频上的性能仍然受限。为解决这一问题,我们引入了一个两阶段数据标注流程。首先,我们设计了从互联网积累包含清晰人类动作视频的策略。其次,视频以标准化的描述格式进行标注,该格式使用人物属性来区分个体,并按时间顺序详细描述其动作与交互。通过此流程,我们构建了两个数据集,即HAICTrain和HAICBench。\textbf{HAICTrain}包含12.6万个由Gemini-Pro生成并经过验证用于训练的视频-描述对。同时,\textbf{HAICBench}包含500个人工标注的视频-描述对和1400个问答对,用于全面评估人类动作理解能力。实验结果表明,使用HAICTrain进行训练不仅能在4个基准测试上显著提升人类动作理解能力,还能改善文本到视频的生成结果。HAICTrain和HAICBench均已发布于https://huggingface.co/datasets/KuaishouHAIC/HAIC。