Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

翻译：大多数生成式人工智能应用涉及顺序交互：用户输入提示后等待响应，其中反应时间和适应性并非关键因素。相比之下，实时即兴演奏是一种需要实时协调与适应的协作式交互，参与者无法预知对方后续动作，同时需保持多样性以维持创作流。基于强化学习的后训练能够通过同策略交互实现有效适应，但常因利用基于连贯性的奖励而降低输出多样性。这种被称为"奖励破解"的崩溃现象影响众多强化学习后训练流程，在实时即兴演奏中尤为有害——音乐创作依赖动态变化与相互响应。本文提出一种基于策略生成轨迹的新型对抗训练方法，以缓解旋律到和弦伴奏的强化学习后训练中的奖励破解问题。协同演化的判别器将策略轨迹与数据分布分离，而策略在最大化判别器输出的同时兼顾连贯性奖励，从而避免坍缩至平凡输出。我们通过模拟实验（使用固定测试旋律与学习型旋律智能体）评估伴奏质量与输出多样性，并部署实时交互系统开展专家音乐家用户研究。定量评估与用户反馈表明，该方法在输出多样性、和声连贯性、适应速度及用户能动性方面均有提升。本研究为生成式序列模型的强化学习后训练提供了一种简单有效的奖励破解缓解方案。