Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
翻译:多模态系统在辅助人类完成程序性活动方面具有巨大潜力,这类活动通常指人们遵循指令以实现目标。尽管应用场景多样,现有系统通常仅在传统分类任务上进行评估,例如动作识别或时序动作分割。本文提出了一种新颖的评估数据集ProMQA,用以衡量系统在面向应用场景中的进展。ProMQA包含401个多模态程序性问答对,这些数据基于用户记录的程序性活动及其对应指令构建。在问答标注过程中,我们采用了一种高效的人机协作方法:在现有标注基础上,通过大语言模型生成问答对,再由人工进行验证。我们进一步提供了基准测试结果,为ProMQA建立了基线性能。实验表明,当前系统(包括具有竞争力的专有多模态模型)的性能与人类表现之间存在显著差距。我们希望本数据集能为模型多模态理解能力的新维度研究提供启示。