As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.
翻译:随着具身智能体在虚拟现实、远程呈现和数字人应用中的核心地位日益凸显,其动作生成必须超越与语音同步的手势:智能体应能转向用户、响应用户移动并保持自然的视线交互。现有方法缺乏这种空间感知能力。我们通过首个实时、完全因果的空间感知对话动作生成方法填补了这一空白,该方法可部署于流式VR头显设备。在给定用户位置和双人对话音频的条件下,我们的方法能生成与语音同步的全身动作,同时根据用户方位调整智能体朝向。我们的架构结合了基于因果Transformer的变分自编码器(配备交错潜变量令牌以实现流式推理)以及基于用户轨迹和音频条件化的流匹配模型。为支持不同的视线偏好,我们引入了带分类器无关引导的视线评分机制,实现学习与控制解耦:模型从数据中学习自然的空间对齐规律,而用户可在推理阶段调节眼神接触强度。在Embody 3D数据集上,我们的方法以超过300 FPS的速度实现了最先进的运动质量——比非因果基线快3倍——同时捕捉了自然对话中微妙的空间动态特征。我们在实时VR系统中验证了该方法,实现了空间感知对话智能体的实时部署。详情请访问 https://evonneng.github.io/sarah/。