Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation. Audio samples and code are available.
翻译:摘要:神经编解码语言模型能够实现高质量的离散语音合成,但其推理过程仍易受令牌级伪影和分布漂移的影响,从而降低感知真实度。我们提出MSpoof-TTS——一种无需训练的推理框架,通过多分辨率欺骗引导提升零样本合成性能,而非依赖偏好优化或重新训练。我们引入基于多分辨率令牌的欺骗检测框架,该框架在多种时间粒度上评估编解码序列,以检测局部不一致或非自然的模式。随后,我们将欺骗检测器集成到分层解码策略中,逐步修剪低质量候选项并重新排序假设。这种判别器引导的生成在不修改模型参数的情况下增强了鲁棒性。实验验证了我们框架在实现稳健且高质量的编解码语音生成方面的有效性。音频样本和代码已公开。