Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video learning approaches on downstream long-term forecasting tasks including long-term action anticipation and video summary prediction. Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins. Notably, MVP obtains a relative performance gain of over 20% accuracy in video summary forecasting over existing methods.
翻译:长期活动预测是一项极具挑战性的研究问题,因为它需要理解观察动作之间的时序关系,以及人类活动的多变性和复杂性。尽管当前最先进的预测方法依赖于通过昂贵人工标注获得的强监督信号,但其在未见数据上的泛化能力通常较差。为缓解这一问题,我们提出了多尺度视频预训练(MVP),一种新颖的自监督预训练方法,通过预测多个时间尺度上未来视频片段的上下文表示,学习用于预测的鲁棒表征。MVP基于我们的观察:视频中的动作具有多尺度特性,其中原子动作通常发生在短时间尺度上,而更复杂的动作可能跨越更长时间尺度。我们将MVP与下游长期预测任务(包括长期动作预测和视频摘要预测)中最先进的自监督视频学习方法进行了比较。在Ego4D和Epic-Kitchens-55/100数据集上的综合实验表明,MVP以显著优势超越了现有最先进方法。值得注意的是,在视频摘要预测任务上,MVP相比现有方法获得了超过20%的准确率相对提升。