The popularity of conversational digital assistants has resulted in the availability of large amounts of conversational data which can be utilized for improved user experience and personalized response generation. Building these assistants using popular large language models like ChatGPT also require additional emphasis on prompt engineering and evaluation methods. Textual similarity metrics are a key ingredient for such analysis and evaluations. While many similarity metrics have been proposed in the literature, they have not proven effective for task-oriented conversations as they do not take advantage of unique conversational features. To address this gap, we present TaskDiff, a novel conversational similarity metric that utilizes different dialogue components (utterances, intents, and slots) and their distributions to compute similarity. Extensive experimental evaluation of TaskDiff on a benchmark dataset demonstrates its superior performance and improved robustness over other related approaches.
翻译:对话式数字助手的普及使得大量对话数据得以可用,这些数据可用于改善用户体验和生成个性化回复。使用ChatGPT等流行大语言模型构建这些助手时,还需要额外重视提示工程与评估方法。文本相似度度量是此类分析与评估的关键要素。尽管文献中已提出多种相似度度量,但它们对面向任务型对话的效果不佳,原因在于未充分利用对话的独特特征。为解决这一问题,我们提出TaskDiff——一种利用不同对话组件(话语、意图和槽位)及其分布来计算相似度的新型对话相似度度量。在基准数据集上对TaskDiff进行的大量实验评估表明,其性能优于其他相关方法,且鲁棒性更强。