Action reasoning in egocentric video requires capturing fine-grained transitions of hand-object interactions, a task where general-purpose Vision-Language Models (VLMs) often struggle when operating directly on raw pixels. We propose to decouple visual perception from symbolic reasoning by converting videos into Temporal Action Graphs. In a multi-stage prompting pipeline, we first generate dense natural language narratives over short temporal windows as a semantic bottleneck, then formalize them into structured, open-vocabulary graph representations. On the EGTEA and Epic-Kitchens-100 datasets, the symbolic representation unlocks efficient in-context learning: few-shot graph demonstrations yield substantial accuracy gains over zero-shot frame and graph-based inference alike. Even in the zero-shot setting, graph-based reasoning remains competitive with pixel-based inference despite potential pretraining contamination favoring the latter. Across 11 open-weight VLMs from 6 model families ranging from 2B to 235B parameters, our findings indicate that current VLMs are more effective as symbolic reasoners than as direct visual observers. By projecting video into the language domain, we provide a scalable, fine-tuning-free alternative to end-to-end approaches that better leverages these models' latent reasoning strengths. The code will be made public.
翻译:自我中心视频中的动作推理需要捕捉手-物交互的细粒度过渡,而通用视觉语言模型(VLMs)在直接处理原始像素时通常难以胜任此类任务。我们提出通过将视频转换为时序动作图谱,将视觉感知与符号推理解耦。在多阶段提示流水线中,我们首先在短时窗口内生成密集的自然语言叙事作为语义瓶颈,随后将其形式化为结构化的开放词汇图谱表示。在EGTEA和Epic-Kitchens-100数据集上,符号表示实现了高效的上下文学习:小样本图谱演示相较零样本帧和图谱推理均取得了显著精度提升。即使在零样本设置中,基于图谱的推理仍能与基于像素的推理保持竞争力,尽管后者可能存在预训练污染优势。跨越6个模型家族(参数规模从2B到235B)的11个开放权重VLM的实验表明,当前VLM作为符号推理器的有效性优于直接视觉观察器。通过将视频投影到语言域,我们为端到端方法提供了一种可扩展、免微调的替代方案,能够更充分地利用这些模型的潜在推理能力。代码将开源。