MARRS: Masked Autoregressive Unit-based Reaction Synthesis

This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions conditioned on the action sequence of another person. Currently, autoregressive modeling approaches with vector quantization (VQ) have achieved remarkable performance in motion generation tasks. However, VQ has inherent disadvantages, including quantization information loss, low codebook utilization, etc. In addition, while dividing the body into separate units can be beneficial, the computational complexity needs to be considered. Also, the importance of mutual perception among units is often neglected. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions using continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding each independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Both quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.

翻译：本研究致力于解决一项具有挑战性的任务：人类动作-反应合成，即在给定他人动作序列的条件下生成人类的反应动作。目前，基于向量量化（VQ）的自回归建模方法在动作生成任务中已取得显著性能。然而，VQ存在固有缺陷，包括量化信息损失、码本利用率低等。此外，虽然将身体划分为独立单元可能有益，但计算复杂度仍需考量。同时，单元间相互感知的重要性常被忽视。在本工作中，我们提出MARRS，一个旨在使用连续表示生成协调且细粒度反应动作的新颖框架。首先，我们提出单元区分运动变分自编码器（UD-VAE），其将全身分割为独立的身体单元和手部单元，并对每个单元进行独立编码。随后，我们提出动作条件融合（ACF），该方法涉及随机掩码一部分反应令牌，并从活跃令牌中提取身体与手部的特定信息。此外，我们引入自适应单元调制（AUM），通过利用一个单元的信息自适应地调制另一单元，以促进身体与手部单元间的交互。最后，对于扩散模型，我们采用紧凑的MLP作为每个独立身体单元的噪声预测器，并结合扩散损失来建模每个令牌的概率分布。定量与定性结果均表明，我们的方法实现了卓越的性能。代码将在论文被接受后公开。