LLM-ForcedAligner：一种基于大语言模型的非自回归、高精度多语言与长语音强制对齐器 (LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech)

Forced alignment (FA) predicts start and end timestamps for words or characters in speech, but existing methods are language-specific and prone to cumulative temporal shifts. The multilingual speech understanding and long-sequence processing abilities of speech large language models (SLLMs) make them promising for FA in multilingual, crosslingual, and long-form speech settings. However, directly applying the next-token prediction paradigm of SLLMs to FA results in hallucinations and slow inference. To bridge the gap, we propose LLM-ForcedAligner, reformulating FA as a slot-filling paradigm: timestamps are treated as discrete indices, and special timestamp tokens are inserted as slots into the transcript. Conditioned on the speech embeddings and the transcript with slots, the SLLM directly predicts the time indices at slots. During training, causal attention masking with non-shifted input and label sequences allows each slot to predict its own timestamp index based on itself and preceding context, with loss computed only at slot positions. Dynamic slot insertion enables FA at arbitrary positions. Moreover, non-autoregressive inference is supported, avoiding hallucinations and improving speed. Experiments across multilingual, crosslingual, and long-form speech scenarios show that LLM-ForcedAligner achieves a 69%~78% relative reduction in accumulated averaging shift compared with prior methods. Checkpoint and inference code are available at https://github.com/QwenLM/Qwen3-ASR.

翻译：强制对齐（FA）旨在预测语音中单词或字符的开始和结束时间戳，但现有方法通常是语言相关的，且容易产生累积性时间偏移。语音大语言模型（SLLMs）具备多语言语音理解与长序列处理能力，使其在多语言、跨语言及长语音场景的FA任务中展现出潜力。然而，直接将SLLMs的下一个词元预测范式应用于FA会导致幻觉与推理速度缓慢。为弥合此差距，我们提出LLM-ForcedAligner，将FA重新定义为一种槽填充范式：时间戳被视为离散索引，并将特殊的时间戳词元作为槽插入到文本转录中。在语音嵌入和带槽转录文本的条件下，SLLM直接在槽位置预测时间索引。训练时，通过采用因果注意力掩码并保持输入与标签序列不移位，使得每个槽能基于自身及前文上下文预测其时间戳索引，且损失仅计算于槽位置。动态槽插入机制支持在任意位置进行FA。此外，该方法支持非自回归推理，从而避免幻觉并提升速度。在多语言、跨语言及长语音场景的实验表明，相较于现有方法，LLM-ForcedAligner在累积平均偏移量上实现了69%~78%的相对降低。模型检查点与推理代码已发布于 https://github.com/QwenLM/Qwen3-ASR。