Action anticipation involves forecasting future actions by connecting past events to future ones. However, this reasoning ignores the real-life hierarchy of events which is considered to be composed of three main parts: past, present, and future. We argue that considering these three main parts and their dependencies could improve performance. On the other hand, online action detection is the task of predicting actions in a streaming manner. In this case, one has access only to the past and present information. Therefore, in online action detection (OAD) the existing approaches miss semantics or future information which limits their performance. To sum up, for both of these tasks, the complete set of knowledge (past-present-future) is missing, which makes it challenging to infer action dependencies, therefore having low performances. To address this limitation, we propose to fuse both tasks into a single uniform architecture. By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information in online action detection. This method referred to as JOADAA, presents a uniform model that jointly performs action anticipation and online action detection. We validate our proposed model on three challenging datasets: THUMOS'14, which is a sparsely annotated dataset with one action per time step, CHARADES, and Multi-THUMOS, two densely annotated datasets with more complex scenarios. JOADAA achieves SOTA results on these benchmarks for both tasks.
翻译:动作预测旨在通过将过去事件与未来事件相关联来预见未来动作。然而,这种推理忽略了现实生活中的事件层级结构——该结构被认为由过去、现在和未来三个主要部分构成。我们认为,考虑这三个主要部分及其依赖关系能够提升性能。另一方面,在线动作检测是以流式方式预测动作的任务。在此场景下,仅能获取过去和当前的信息,因此现有方法缺失语义或未来信息,从而限制了其性能。总之,对于这两项任务而言,完整的三段知识(过去-现在-未来)均存在缺失,导致动作依赖关系推断困难,最终性能低下。为克服这一限制,我们提出将两项任务融合至统一架构中。通过结合动作预测与在线动作检测,我们的方法能够弥补在线动作检测中未来信息的缺失依赖。该方法名为JOADAA,呈现了一个联合执行动作预测与在线动作检测的统一模型。我们在三个具有挑战性的数据集上验证了所提出的模型:THUMOS'14(每个时间步仅标注单个动作的稀疏标注数据集)、CHARADES与Multi-THUMOS(两个具有更复杂场景的密集标注数据集)。JOADAA在这两项任务上均取得了这些基准数据集的最新结果。