Long-term action anticipation has become an important task for many applications such as autonomous driving and human-robot interaction. Unlike short-term anticipation, predicting more actions into the future imposes a real challenge with the increasing uncertainty in longer horizons. While there has been a significant progress in predicting more actions into the future, most of the proposed methods address the task in a deterministic setup and ignore the underlying uncertainty. In this paper, we propose a novel Gated Temporal Diffusion (GTD) network that models the uncertainty of both the observation and the future predictions. As generator, we introduce a Gated Anticipation Network (GTAN) to model both observed and unobserved frames of a video in a mutual representation. On the one hand, using a mutual representation for past and future allows us to jointly model ambiguities in the observation and future, while on the other hand GTAN can by design treat the observed and unobserved parts differently and steer the information flow between them. Our model achieves state-of-the-art results on the Breakfast, Assembly101 and 50Salads datasets in both stochastic and deterministic settings. Code: https://github.com/olga-zats/GTDA .
翻译:长期行为预测已成为自动驾驶和人机交互等众多应用中的关键任务。与短期预测不同,对未来更长时间序列中的行为进行预测面临着严峻挑战,因为时间跨度越长,不确定性越高。尽管在长期行为预测领域已取得显著进展,但现有方法大多在确定性框架下处理该任务,忽略了潜在的不确定性。本文提出一种新颖的门控时序扩散网络,该网络能够同时对观测数据与未来预测的不确定性进行建模。作为生成器,我们引入了门控预测网络,以互表征形式对视频中已观测帧与未观测帧进行联合建模。一方面,采用过去与未来的互表征使我们能够共同建模观测数据与未来预测中的模糊性;另一方面,门控预测网络可通过结构设计对已观测部分与未观测部分进行差异化处理,并调控二者间的信息流。我们的模型在Breakfast、Assembly101和50Salads数据集上,无论是随机性还是确定性设定下均取得了最先进的性能。代码:https://github.com/olga-zats/GTDA。