Task-oriented conversational datasets often lack topic variability and linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.
翻译:面向任务的对话数据集通常缺乏主题多样性和语言多样性。然而,随着在海量多语言且多样化文本数据上预训练的大型语言模型(LLMs)的出现,这些局限性似乎已被克服。然而,在没有基准数据集的情况下,这些模型在对话应用中针对不同语言和领域的泛化能力仍不确定。本文提出了一种面向双语客服对话场景的情感与对话质量的整体性标注方法。通过考虑构成对话的完整实例进行标注,可以形成对对话更为全面的整体视角。此外,这为文本分类模型的开发提供了独特且宝贵的资源。为此,我们提出了情感识别与对话质量评估的基准,并表明在实际生产环境中利用这些模型仍需进一步研究。