Spoken language models (SLMs) have advanced rapidly in recent years, accompanied by a growing number of evaluation benchmarks. However, most existing benchmarks emphasize task completion and capability scaling, while remaining poorly aligned with how users interact with SLMs in real-world spoken conversations. Effective spoken interaction requires not only accurate understanding of user intent and content, but also the ability to respond with appropriate interactional strategies. In this paper, we present TELEVAL, a dynamic, user-centered benchmark for evaluating SLMs in realistic Chinese spoken interaction scenarios. TELEVAL consolidates evaluation into two core aspects. Reliable Content Fulfillment assesses whether models can comprehend spoken inputs and produce semantically correct responses. Interactional Appropriateness evaluates whether models act as socially capable interlocutors, requiring them not only to generate human-like, colloquial responses, but also to implicitly incorporate paralinguistic cues for natural interaction. Experiments reveal that, despite strong performance on semantic and knowledge-oriented tasks, current SLMs still struggle to produce natural and interactionally appropriate responses, highlighting the need for more interaction-faithful evaluation.
翻译:近年来,口语语言模型(SLMs)发展迅速,随之涌现出越来越多的评测基准。然而,现有基准大多侧重于任务完成和能力扩展,与用户在真实口语对话中如何与SLMs交互的实际情况契合度不足。有效的口语交互不仅需要准确理解用户意图和内容,还要求具备运用恰当交互策略进行回应的能力。本文提出TELEVAL,一个动态的、以用户为中心的基准,用于在真实的中文口语交互场景中评估SLMs。TELEVAL将评测整合为两个核心方面。可靠内容实现评估模型是否能理解口语输入并生成语义正确的回应。交互适宜性评估模型是否能作为具备社交能力的对话者,这不仅要求其生成类人化、口语化的回应,还需隐含地融入副语言线索以实现自然的交互。实验表明,尽管当前SLMs在语义和知识导向任务上表现强劲,但在生成自然且交互适宜的回应方面仍存在困难,这凸显了进行更忠实于交互本质的评估的必要性。