We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
翻译:我们提出TiCo,一种简单的后训练方法,使口语对话模型(SDMs)能够遵循时间约束指令,生成持续时间可控的响应。该能力对于语音助手和交互代理等实际口语语言系统具有重要价值,因为控制响应持续时间能够提升交互质量。然而,尽管现有模型在生成自然口语响应方面表现强劲,但它们缺乏时间感知能力,难以遵循与持续时间相关的指令(例如,“请生成一段持续约15秒的响应”)。通过对开源和商业SDM进行实证评估,我们发现它们经常无法满足此类时间控制要求。TiCo通过引入口语时间标记(STM)(例如<10.6 seconds>)使模型能够在生成过程中估算已消耗的说话时间,从而解决这一局限。这些标记有助于模型保持时间感知,并调整剩余内容以满足目标持续时间。TiCo简单高效:仅需少量数据且无需额外问答对,而是依赖自生成和强化学习。实验结果表明,TiCo在保持响应质量的同时显著提升了对持续时间约束的遵循能力。