Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.
翻译:具身社交智能体最近在生成同步语音和手势方面取得了进展。然而,大多数交互系统本质上仍是反应式的,仅在短时间窗口内对当前感官输入做出响应。相比之下,主动社交行为需要对累积的上下文和意图推断进行深思熟虑,这与实时交互严格的延迟预算相冲突。我们提出了 \emph{ProAct},一个双系统框架,通过将用于流式多模态交互的低延迟\emph{行为系统}与执行长程社交推理并产生高层次主动意图的较慢\emph{认知系统}解耦,从而调和了这种时间尺度冲突。为了在不破坏流畅性的情况下将审慎意图转化为连续的非语言行为,我们引入了一种通过ControlNet以意图为条件的流式流匹配模型。该机制支持异步意图注入,使得在单一运动流中实现反应式与主动式手势之间的无缝过渡成为可能。我们将ProAct部署在物理人形机器人上,并评估其运动质量和交互有效性。在真实世界交互用户研究中,参与者和观察者一致认为ProAct在感知主动性、社交临场感和整体参与度方面优于反应式变体,这证明了双系统主动控制在具身社交交互中的优势。