The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut "finely"). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our approach on a range of datasets and achieve state-of-the-art results on both adverb prediction and antonym classification. Furthermore, we outperform previous work when we lift two commonly assumed conditions: the availability of action labels during testing and the pairing of adverbs as antonyms. Existing datasets for adverb recognition are either noisy, which makes learning difficult, or contain actions whose appearance is not influenced by adverbs, which makes evaluation less reliable. To address this, we collect a new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently. Videos in AIR are more tightly trimmed and were manually reviewed by multiple annotators to ensure high labelling quality. Results show that models learn better from AIR given its cleaner videos. At the same time, adverb prediction on AIR is challenging, demonstrating that there is considerable room for improvement.
翻译:本工作旨在理解视频中动作的执行方式。即,给定一段视频,我们旨在预测一个副词,用以指示对动作施加的修改(例如"精细地"切割)。我们将此问题建模为回归任务。通过测量动词与副词之间的文本关系,我们生成代表待学习动作变化的回归目标。我们在多个数据集上测试该方法,在副词预测和反义词分类任务上均达到最先进水平。此外,当放宽两个常见假设条件(测试时动作标签的可用性以及副词的反义词配对关系)时,我们超越了先前工作。现有副词识别数据集存在噪声大导致学习困难,或含有的动作外观不受副词影响导致评估不可靠等问题。为解决这一问题,我们收集了新的高质量数据集:食谱中的副词(AIR)。该数据集聚焦食谱教学视频,精选了一组通过不同执行方式会产生明显视觉变化的动作。AIR中的视频经过更精确的剪切,并由多位标注者人工审核以确保标注质量。结果表明,由于视频质量更高,模型从AIR中能获得更好的学习效果。同时,AIR上的副词预测任务具有挑战性,表明该领域仍有显著提升空间。