We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models "who spoke what and when" in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong end-to-end baselines, including Qwen-Omni and Gemini, particularly in handling complex speech overlaps. Moreover, TagSpeech employs a parameter-efficient training paradigm in which the LLM backbone is frozen and only lightweight projectors are trained, resulting in strong performance with low computational cost.
翻译:本文提出TagSpeech,一种基于大语言模型(LLM)的统一框架,利用时序锚定机制实现多说话人语音识别与说话人日志的联合建模。该框架基于两个核心设计:(1)通过串行输出训练(SOT)微调的解耦语义流与说话人流,以学习话轮转换动态;(2)交错式时间锚机制,不仅支持细粒度时间戳预测,同时作为语义理解与说话人追踪之间的同步信号。相较于以往主要关注说话人归属语音识别或隐式日志的研究,TagSpeech解决了细粒度说话人-内容对齐的挑战,并以端到端方式显式建模“何人于何时说出何内容”。在AMI与AliMeeting基准测试上的实验表明,本方法在说话人日志错误率(DER)上相较于包括Qwen-Omni与Gemini在内的强端到端基线模型取得持续提升,尤其在处理复杂语音重叠场景中表现突出。此外,TagSpeech采用参数高效的训练范式,冻结LLM主干网络仅训练轻量级投影器,从而以较低计算成本实现强劲性能。