ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

翻译：真实环境中的人类行为本质上是交互式的，个体的动作由周围智能体及场景共同塑造。这类能力在虚拟化身、交互式动画和人机协作等领域至关重要。我们针对实时人际交互至反应生成任务，即根据动态多源线索（包括他人动作、场景几何信息及可选的高层语义输入）生成主视角未来运动。该任务面临根本性挑战：（i）异构的单人、人-人及人-场景域中交互数据有限且碎片化；（ii）在持续在线交互过程中需生成低延迟且高保真的运动响应。为解决上述问题，我们提出ReMoGen（反应式运动生成），一种用于实时交互至反应生成的模块化学习框架。ReMoGen利用从大规模单人运动数据集学习到的通用运动先验，通过独立训练的元交互模块将其适配至目标交互场景，从而在数据稀疏及异构监督条件下实现鲁棒泛化。为实现响应式在线交互，ReMoGen执行段级生成，并配备轻量级帧级段精化模块，该模块可在帧级别融入新观测线索，在不依赖昂贵的全序列推理前提下提升响应速度与时间连贯性。在人-人、人-场景及混合模态交互场景中的大量实验表明，ReMoGen能生成高质量、连贯且响应迅速的运动，并在多样化交互场景中实现有效泛化。