Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.
翻译:以自我为中心的视觉系统正日益普及,为人机交互创造了新的机遇。一个核心挑战是从第一人称视频中估计穿戴者的全身运动,这对于理解人类行为至关重要。然而,由于大多数身体部位在以自我为中心的视角下不可见,这项任务十分困难。先前的方法主要依赖于头部轨迹,这会导致模糊性,或者假设手部被持续追踪,这对于轻量级的以自我为中心设备来说是不现实的。在这项工作中,我们提出了HaMoS,这是第一个手部感知的、序列级扩散框架,它直接以头部轨迹和由视场限制及遮挡引起的间歇性可见手部线索为条件,正如真实世界中的以自我为中心设备那样。为了克服缺乏将多样化相机视角与人体运动配对的数据集的问题,我们引入了一种新颖的增强方法,该方法模拟了此类真实世界条件。我们还证明了诸如身体形状和视场等序列级上下文对于精确的运动重建至关重要,因此采用局部注意力来高效推断长序列。在公共基准测试上的实验表明,我们的方法实现了最先进的准确性和时间平滑性,这向可靠的野外以自我为中心3D运动理解迈出了实用的一步。