EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on mapping 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models.

翻译：人类活动本质上是复杂的，即使是简单的家务任务也涉及大量的物体交互。为了更好地理解这些活动和行为，对其与环境的动态交互进行建模至关重要。近年来，价格亲民的头戴式摄像头和自中心数据的普及，为理解3D环境中的动态人-物交互提供了更便捷高效的途径。然而，现有的大多数人类活动建模方法要么专注于重建手-物或人-场景交互的3D模型，要么专注于3D场景映射，而忽略了与物体的动态交互。少数现有解决方案通常需要来自多个来源的输入，包括多摄像头设置、深度感应摄像头或动觉传感器。为此，我们提出了EgoGaussian，这是首个仅凭RGB自中心输入就能同时重建3D场景并动态追踪3D物体运动的方法。我们利用高斯泼溅独特的离散性质，将动态交互从背景中分割出来。我们的方法采用片段级在线学习流程，利用人类活动的动态特性，使我们能够按时间顺序重建场景的时序演化并追踪刚性物体运动。此外，我们的方法能自动分割物体与背景的高斯分布，为静态场景和动态物体提供3D表征。EgoGaussian在具有挑战性的真实场景视频中超越了先前的NeRF和动态高斯方法，我们也通过定性实验展示了重建模型的高质量。