In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.
翻译:本文研究发现,当口语模型在多轮对话开始时被指示采用特定说话风格时,经过数轮交互后无法维持所要求的说话风格,我们将此现象称为口语模型的风格遗忘。我们聚焦于副语言性说话风格,包括情感、口音、音量和语速。通过评估三个专有模型和两个开源口语模型,我们证明当被指示保持特定风格时,所有模型均无法维持一致的说话风格。进一步研究发现,当要求口语模型在后续对话轮次中回忆风格指令时,模型能够回忆起风格要求,但在整个对话过程中无法有效表达该风格。实验表明,明确要求模型回忆风格指令可在一定程度上缓解风格遗忘现象。此外,我们检验了多种提示策略,发现当风格指令置于系统消息而非用户消息时,口语模型难以遵循所需风格,这与系统提示的预期功能相悖。