Streaming egocentric action anticipation: An evaluation scheme and approach

Egocentric action anticipation aims to predict the future actions the camera wearer will perform from the observation of the past. While predictions about the future should be available before the predicted events take place, most approaches do not pay attention to the computational time required to make such predictions. As a result, current evaluation schemes assume that predictions are available right after the input video is observed, i.e., presuming a negligible runtime, which may lead to overly optimistic evaluations. We propose a streaming egocentric action evaluation scheme which assumes that predictions are performed online and made available only after the model has processed the current input segment, which depends on its runtime. To evaluate all models considering the same prediction horizon, we hence propose that slower models should base their predictions on temporal segments sampled ahead of time. Based on the observation that model runtime can affect performance in the considered streaming evaluation scenario, we further propose a lightweight action anticipation model based on feed-forward 3D CNNs which is optimized using knowledge distillation techniques with a novel past-to-future distillation loss. Experiments on the three popular datasets EPIC-KITCHENS-55, EPIC-KITCHENS-100 and EGTEA Gaze+ show that (i) the proposed evaluation scheme induces a different ranking on state-of-the-art methods as compared to classic evaluations, (ii) lightweight approaches tend to outmatch more computationally expensive ones, and (iii) the proposed model based on feed-forward 3D CNNs and knowledge distillation outperforms current art in the streaming egocentric action anticipation scenario.

翻译：自我中心动作预测旨在通过观察过往行为，预测佩戴摄像头的用户将执行的未来动作。尽管对未来的预测应在事件实际发生前完成，但多数方法并未关注执行预测所需的计算时间。因此，现有评估方案默认预测在输入视频观察结束后即时可用（即假定运行时忽略不计），这可能导致评估结果过于乐观。为此，我们提出一种流式自我中心动作评估方案：该方案假设模型在线执行预测，且预测结果仅在处理完当前输入片段后方能获得——处理时间取决于模型运行时。为在相同预测时域下评估所有模型，我们建议较慢的模型应基于提前采样的时间片段进行预测。基于模型运行时可能影响流式评估场景性能的发现，我们进一步提出一种轻量级前馈3D CNN动作预测模型，该模型通过知识蒸馏技术优化，并引入新颖的"过去到未来"蒸馏损失函数。在EPIC-KITCHENS-55、EPIC-KITCHENS-100和EGTEA Gaze+三个公开数据集上的实验表明：（i）与经典评估相比，所提评估方案会改变现有最优方法的排名；（ii）轻量级方法往往优于计算量更大的方法；（iii）基于前馈3D CNN与知识蒸馏的所提模型在流式自我中心动作预测场景中超越了当前最优方法。