Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.
翻译:手物交互(HOI)本质上涉及动态过程,其中人类操作对物体产生不同的时空影响。然而,现有的语义HOI基准要么聚焦于操作本身,要么在粗略层面上关注操作结果,缺乏捕捉HOI底层动态的细粒度时空推理能力。我们提出了HanDyVQA,一个细粒度的视频问答基准,全面覆盖HOI的操作与影响两个方面。HanDyVQA包含六种互补的问题类型(动作、过程、物体、位置、状态变化和物体部件),总共11.1K个多项选择问答对。收集的问答对识别了操作风格、手/物体运动以及部件级状态变化。HanDyVQA还包含用于物体和物体部件问题的10.3K个分割掩码,从而能够评估视频物体分割中物体/部件级推理的能力。我们在该基准上评估了最近的视频基础模型,发现即使表现最佳的模型Gemini-2.5-Pro也仅达到73%的平均准确率,远低于人类表现(97%)。进一步分析显示,在空间关系、运动和部件级几何理解方面仍存在挑战。我们还发现,将明确的HOI相关线索整合到视觉特征中能提升性能,这为开发具有更深层次HOI动态理解的未来模型提供了见解。