Assessing the performance of systems to classify Multi-Party Conversations (MPC) is challenging due to the interconnection between linguistic and structural characteristics of conversations. Conventional evaluation methods often overlook variances in model behavior across different levels of structural complexity on interaction graphs. In this work, we propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. As a proof of concept we focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. To this end, we extract representative diagnostic subdatasets with a fixed number of users and a good structural variety from a large and open corpus of online MPCs. We further frame our work in terms of data minimization, avoiding the use of original usernames to preserve privacy, and propose alternatives to using original text messages. Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension. Using an LLM in a zero-shot setting, we further highlight how sensitivity to prompt variations is task-dependent.
翻译:评估多方对话(MPC)分类系统的性能具有挑战性,因为对话的语言特征与结构特征相互关联。传统评估方法往往忽略模型在交互图不同结构复杂度层级上的行为差异。本研究提出一种方法流程,用于探究模型在对话特定结构属性上的表现。作为概念验证,我们聚焦于回复选择与收件人识别任务以诊断模型缺陷。为此,我们从大规模开源在线多方对话语料库中,提取了具有固定用户数量与良好结构多样性的代表性诊断子数据集。我们进一步将研究置于数据最小化框架下,通过避免使用原始用户名以保护隐私,并提出替代原始文本消息的方案。结果表明,回复选择更依赖于对话的文本内容,而收件人识别则需要捕捉对话的结构维度。通过使用大型语言模型进行零样本实验,我们进一步揭示了提示词变动的敏感度具有任务依赖性。