Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.
翻译:文本到动作生成技术近年来发展迅速,为动画制作和人机交互提供了富有表现力的接口。然而,当前模型在处理描述多个同时发生动作的提示时仍存在脆弱性。模型往往无法完整实现复合描述的所有组成部分,而是优先处理单个主导动作并忽略其他动作,导致生成的动作不完整或含义模糊。我们提出MultiAct——一种无需配对数据的推理时组合式文本到动作合成框架,可直接作用于预训练的动作生成器,无需重新训练或修改架构。该方法通过自适应放大与提示中未被充分表征的组件相关的交叉注意力分数,来对抗语义坍塌问题。我们注意到有效调节取决于提示特定的选择(例如需要针对哪些token和层),并引入一种轻量级辅助决策方案来确定最有效的注意力增强参数配置。大量定量和定性评估表明,MultiAct在复合提示上持续优于现有基线方法,在保持动作真实感的同时实现了更优的语义覆盖。项目页面:https://natsala13.github.io/multiact.github.io。