Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.
翻译:准确且可解释的用户满意度评估(USE)对于理解、评估和持续改进对话系统至关重要。用户在通用型(如ChatGPT和Bing Copilot)及任务导向型(如客服聊天机器人)对话系统中,通过多样化的对话模式表达其满意或不满意情绪。现有基于特征化机器学习模型或文本嵌入的方法在提取泛化模式方面存在不足,且难以实现可解释性。本研究表明,相较于基于嵌入的方法,大语言模型(LLM)能更有效地从用户自然语言表述中提取可解释的满意度信号。此外,通过利用标注样本的监督信息,可采用迭代提示框架定制适用于USE的LLM。由此提出的监督提示用户满意度评分准则(SPUR)方法,不仅具有更高准确性,还能通过包含详细分解的学习准则对用户满意度进行评分,从而实现更强的可解释性。