KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

翻译：摘要：基于文本条件的3D人体运动模型现已能够根据提示生成合理的运动，但实际动画与具身智能体的工作流程很少止步于文本：角色可能需要沿手绘轨迹路径移动、触及末端执行器目标、或满足多关节轨迹约束，同时保持语言描述的步态、风格和意图。这暴露出一个控制权衡问题：轨迹控制器应在不覆盖预训练文本条件运动先验的前提下实现精确控制，然而现有解决方案要么复制生成器的大部分结构以重新获取每层控制权限，要么将大量计算开销转移至测试时优化。我们提出KV-Control，一种用于冻结遮罩文本条件运动Transformer的紧凑型注意力侧控制接口。其核心思想是将几何约束作为自注意力中的记忆单元引入，而非通过全局位姿令牌注入或仅在输出端施加约束。为支持该接口，我们协同设计了分部位令牌化的运动基元与控制器：PartVQ学习解剖对齐的部位码本，T-Concat将每帧-部位令牌暴露为可寻址注意力节点，KV-Control在每个自注意力层注入控制条件的键/值记忆，同时保留预训练的查询流、文本交叉注意力、前馈网络及所有骨干权重。该适配器仅在共享轨迹编码器之上增加可训练的注入参数，却在继承的细化协议下以亚厘米级精度追踪根关节和多关节约束，同时保持文本条件运动质量。KV-Control将轨迹条件化重构为轻量级记忆检索，为文生运动生成提供了小巧、精确且透明的控制接口。