Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs with human intent. To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://github.com/mars-sep/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.
翻译:通用声音分离面临一个根本性的错位问题:针对低层级信号指标优化的模型,其输出常受到语义污染,难以抑制声学相似源产生的感知显著干扰。我们引入了一种偏好对齐的视角,类似于将大语言模型(LLM)与人类意图对齐。为此,我们提出了MARS-Sep,一个将分离问题重新定义为决策过程的强化学习框架。MARS-Sep并非简单地回归真实掩码,而是学习一个由因子化Beta掩码策略,该策略由一个偏好奖励模型引导,并通过一个稳定的、经过裁剪的信任域替代目标进行优化。该奖励源自一个渐进对齐的音频-文本-视觉编码器,直接激励分离结果与查询提示在语义上保持一致。在多个基准测试上的广泛实验表明,该方法在基于文本、音频和图像的查询分离任务中均取得了稳定的性能提升,在信号指标和语义质量方面均有显著改善。我们的代码公开于 https://github.com/mars-sep/MARS-Sep。声音分离样本可在 https://mars-sep.github.io/ 获取。