Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. This limitation is particularly evident in ADL tasks, where understanding detailed human-object interaction and human-centric motion is crucial for applications such as elderly monitoring and cognitive assessment. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose an online ego2exo distillation approach to learn ego-augmented exo representations in LVLMs. While effective, this approach requires paired ego-exo training data, which is impractical to collect for real-world ADL scenarios. Consequently, we develop EgoMimic, a skeleton-guided method that can generate mimicked ego views from exocentric videos. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed EgoPerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at https://github.com/dominickrei/EgoExo4ADL.
翻译:大型视觉语言模型(LVLMs)在视频理解方面已展现出令人印象深刻的能力,然而其在日常生活活动(ADL)中的应用仍受限于其无法捕捉细粒度交互和空间关系。这一局限在ADL任务中尤为明显,因为理解细致的人-物交互和以人为中心的运动对于老年人监护和认知评估等应用至关重要。为解决此问题,我们旨在利用自我中心视角的互补特性来增强LVLM对他者中心ADL视频的理解。为此,我们提出了一种在线自我中心到他者中心的蒸馏方法,以在LVLMs中学习自我中心增强的他者中心表征。尽管该方法有效,但需要配对的自我-他者中心训练数据,而这在实际ADL场景中收集并不现实。因此,我们开发了EgoMimic,一种基于骨架引导的方法,能够从他者中心视频生成模拟的自我中心视角。我们发现,经过自我中心增强的LVLMs的他者中心表征成功学会了提取自我中心视角线索,这通过在六个ADL基准测试和我们专门设计的用于评估从他者中心视频理解自我中心视角能力的EgoPerceptionMCQ基准上的综合评估得以证明。代码、模型和数据将在https://github.com/dominickrei/EgoExo4ADL开源。