Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.
翻译:自我中心自然语言查询(NLQ)定位任务要求模型在长时第一人称视频中定位出回答自由形式文本查询的时间区间。现有方法将视频外观与查询进行融合,但忽略了手部运动——尽管在Ego4D NLQ查询中,约有41%的答案出现在手-物交互操作或其即时结果中。我们提出一种手部轨迹编码器,可将手部骨架序列转换为高语义的手部运动学特征,随后通过带自适应门控的交叉注意力融合策略,将这些特征与预训练的视频-文本特征进行对齐与组合。在Ego4D NLQ v2验证集上,手-物交互查询(R1@IoU=0.3提升+2.54)和数量/状态查询(R1@IoU=0.3提升+4.32)的改进最为显著,表明手部轨迹能提供超越外观的定位线索。