Predicting human motion plays a crucial role in ensuring a safe and effective human-robot close collaboration in intelligent remanufacturing systems of the future. Existing works can be categorized into two groups: those focusing on accuracy, predicting a single future motion, and those generating diverse predictions based on observations. The former group fails to address the uncertainty and multi-modal nature of human motion, while the latter group often produces motion sequences that deviate too far from the ground truth or become unrealistic within historical contexts. To tackle these issues, we propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction which can generate samples that are more likely to happen while maintaining a certain level of diversity. Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers. Additionally, we employ the discrete cosine transform to model motion sequences in the frequency space, thereby improving performance. In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization to condition the prediction on past observed motion, we treat all inputs, including conditions, as tokens to create a more lightweight model compared to existing approaches. Extensive experimental studies are conducted on benchmark datasets to validate the effectiveness of our human motion prediction model.
翻译:人体运动预测在确保未来智能再制造系统中人机安全高效紧密协作方面至关重要。现有研究可分为两类:一类专注于准确性,预测单一未来运动;另一类基于观测生成多样化预测。前者未能解决人体运动的不确定性与多模态特性,而后者常产生偏离真实值过远或在历史上下文中不切实际的运动序列。为解决这些问题,我们提出TransFusion——一种创新且实用的基于扩散的3D人体运动预测模型,该模型可在保持一定多样性的同时生成更可能发生的样本。我们的模型以Transformer为骨干网络,并在浅层与深层之间采用长跳跃连接。此外,我们利用离散余弦变换在频域中对运动序列进行建模,从而提升性能。与以往基于扩散的模型不同,它们利用交叉注意力、自适应层归一化等额外模块对历史观测运动进行条件预测;而我们将所有输入(包括条件)视为Token,构建出比现有方法更轻量级的模型。在基准数据集上进行的大量实验验证了我们人体运动预测模型的有效性。