The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{http://cyberiada.github.io/CompAct}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.
翻译:大规模多模态模型的兴起为基础模型和推理能力开辟了突破性进展的道路,在多种复杂任务中解锁了变革性应用。然而,一个亟待解决的问题是它们在更强泛化形式上的真实能力,这在多模态场景中尚未得到充分探索。本研究旨在通过使用\textsc{CompAct}(组合活动)数据集来检验序列组合泛化能力,该数据集是精心构建、基于感知的,以自我中心视角的厨房活动视频为丰富背景。数据集中的每个实例由原始视频片段、自然发生的声音和众包逐步描述组合表示。更重要的是,我们的设置确保单个概念在训练集和评估集中分布一致,而其组合在评估集中是新颖的。我们对几种单模态和多模态模型进行了全面评估。结果表明,双模态和三模态模型相比纯文本模型展现出明显优势。这突显了多模态的重要性,同时为这一领域的未来研究指明了方向。