Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: https://github.com/zhouyuan888888/MuSteerNet.
翻译:视频驱动的人体反应生成旨在合成直接响应观察视频序列的三维人体运动,这对构建类人交互式AI系统至关重要。然而,现有方法往往未能有效利用视频输入来引导人体反应合成,导致反应运动与视频序列内容不匹配。我们发现这一局限性源于视觉观察与反应类型之间存在严重的关联扭曲。鉴于此,我们提出了MuSteerNet——一个简单而有效的框架,通过观察-反应相互引导从视频生成三维人体反应。具体而言,我们首先提出原型反馈引导机制,以缓解关联扭曲:通过门控增量修正调制器和关系边界约束,在从人体反应学习到的原型向量指导下精炼视觉观察。随后,我们引入双耦合反应精炼模块,充分利用修正后的视觉线索进一步引导生成反应运动的精炼,从而有效提升反应质量,使MuSteerNet取得竞争性性能。大量实验和消融研究验证了方法的有效性。代码即将发布:https://github.com/zhouyuan888888/MuSteerNet。