Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.
翻译:近期无需对齐的非自回归文本到语音(TTS)模型将语音合成建模为条件填充任务,从而规避了显式时长预测器与外部对齐器。当语音由神经编解码令牌表示时,填充问题转化为离散形式,这使得离散流匹配(DFM)——一种基于连续时间马尔可夫链(CTMC)的离散生成框架——成为自然之选。然而,面向稳定低步数条件填充的推理时控制方法仍有待探索。我们提出"掩码-采样-修正"(Mask, Sample, Revise)方法,即一种用于无对齐DFM-TTS的推理时CTMC推理栈。该栈融合了三个组件:无预测器引导以加强文本条件约束、提示匹配条件耦合以对齐概率路径与声学提示、以及SC-ReMask(一种调度约束重掩码机制)通过引入令牌到掩码的转换使得早期去掩码决策可被修正。这些组件无需事后微调,并在单一tau跳跃采样器中协同运作。受控消融实验表明,该推理栈在低NFE提示设定下提升了可懂度与鲁棒性,显著优于采用更多步数的无引导及仅引导采样器。