Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.
翻译:多模态对齐技术能够实现跨模态检索:使用一种模态进行查询时,可检索出另一种模态中的对应实例。本文研究一种新颖的对齐场景:其对齐对象分别为(1)以装配图示形式呈现的步骤说明(常见于宜家装配手册),以及(2)来自真实场景视频的片段;这些视频记录了现实世界中的装配动作执行过程。为学习此类对齐关系,我们提出一种新颖的监督对比学习方法,该方法通过一组创新设计的损失函数引导,学习将视频与装配图示中的精细细节进行对齐。为探究该问题并验证方法的有效性,我们构建了名为IAW(真实场景宜家装配数据集)的新型数据集,其中包含来自多样化家具装配系列的183小时视频,以及近8300张对应装配手册中的图示,所有数据均标注了真实对齐关系。基于该数据集,我们定义了两项任务:第一,视频片段与图示间的最近邻检索;第二,针对每个视频的装配步骤与视频片段的对齐。在IAW数据集上的大量实验表明,本方法相较于其他替代方案具有更优越的性能。