Multimodal large language models (MLLMs) are flourishing, but mainly focus on images with less attention than videos, especially in sub-fields such as prompt engineering, video chain-of-thought (CoT), and instruction tuning on videos. Therefore, we try to explore the collection of CoT datasets in videos to lead to video OpenQA and improve the reasoning ability of MLLMs. Unfortunately, making such video CoT datasets is not an easy task. Given that human annotation is too cumbersome and expensive, while machine-generated is not reliable due to the hallucination issue, we develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm. Active learning is an interactive strategy between the model and human experts, in this way, the workload of human labeling can be reduced and the quality of the dataset can be guaranteed. With the help of the automatic annotation tool, we strive to contribute three datasets, namely VideoCoT, TopicQA, TopicCoT. Furthermore, we propose a simple but effective benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs. Extensive experiments demonstrate the effectiveness our solution.
翻译:多模态大语言模型(MLLMs)正在蓬勃发展,但其主要关注图像,对视频的关注相对较少,特别是在提示工程、视频思维链(CoT)以及视频指令微调等子领域。因此,我们尝试探索视频中思维链数据集的构建,以推动视频开放问答(OpenQA)的发展,并提升多模态大语言模型的推理能力。然而,制作此类视频思维链数据集并非易事。鉴于人工标注过于繁琐且昂贵,而机器生成又因幻觉问题不可靠,我们在主动学习范式下,开发了一种结合机器与人类专家的自动标注工具。主动学习是模型与人类专家之间的交互策略,通过这种方式,可以减少人工标注的工作量,并保证数据集的质量。借助该自动标注工具,我们致力于贡献三个数据集,即VideoCoT、TopicQA和TopicCoT。此外,我们基于所收集的数据集提出了一个简单而有效的基准方法,该方法利用思维链来最大化多模态大语言模型的复杂推理能力。大量实验证明了我们方案的有效性。