Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
翻译:人类动作预测通常被视为视频理解问题,其隐含假设是需要密集的时间信息来推理未来动作。在本研究中,我们通过探究在动作预测被约束为单一视觉观察时所能达到的效果,对这一假设提出挑战。我们提出一个根本性问题:单帧图像中已编码了多少关于未来的信息,以及如何有效利用这些信息?基于我们先前关于"一瞥动作预测"的研究,我们对结合互补信息源的单帧动作预测进行了系统性研究。我们分析了RGB外观特征、基于深度的几何线索以及历史动作语义表征的贡献,并探究了不同的多模态融合策略、关键帧选择机制和历史动作信息来源如何影响预测性能。基于这些发现,我们将最有效的设计选择整合到AAG+框架中——这是一个经过优化的单帧预测框架。尽管仅处理单帧图像,AAG+在原始AAG基础上持续提升,并在IKEA-ASM、Meccano和Assembly101等具有挑战性的预测基准测试中,取得了与当前最先进的视频处理方法相当或更优的性能。我们的研究结果为单帧动作预测的局限性和潜力提供了新的见解,并明确了何时需要密集时序建模,何时经过精心选择的单帧观察即可满足需求。