Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.
翻译:理解和预测人类的视觉运动协调对于机器人技术、人机交互及辅助技术等应用至关重要。本工作提出了一种基于预测的视觉运动建模任务,其目标是从自我中心的视觉和运动学观测中预测头部姿态、注视方向及上半身运动。我们提出了一种视觉运动协调表征(VCR),该表征学习跨这些多模态信号的结构化时间依赖性。我们扩展了一种基于扩散的运动建模框架,该框架整合了自我中心视觉和运动学序列,从而实现时间上连贯且准确的视觉运动预测。我们的方法在大规模EgoExo4D数据集上进行了评估,展示了在多样化真实世界活动中的强大泛化能力。我们的结果强调了多模态整合在理解视觉运动协调中的重要性,为视觉运动学习和人类行为建模的研究做出了贡献。项目页面:https://vjwq.github.io/VCR/。