Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.
翻译:大型语言模型(LLMs)在零样本对话状态追踪(DST)中展现出卓越性能,减少了对任务特定训练的需求。然而,传统的DST基准主要关注结构化的用户-智能体会话,未能捕捉现实世界多用户交互的复杂性。在本研究中,我们在最小化数据集构建成本的同时,评估了LLMs在多用户DST中的鲁棒性。受基于LLM的数据标注最新进展的启发,我们基于言语行为理论生成第二位用户的语句,从而扩展了现有的DST数据集。我们的方法系统地将第二位用户的语句融入对话中,实现了对LLMs在多用户场景下受控评估。实验结果显示,与单用户DST相比,性能显著下降,突显了当前LLMs在多位说话者之间提取和追踪对话状态的局限性。我们的发现强调了未来研究需要增强LLMs以应对多用户DST场景,为开发更现实、更鲁棒的DST模型铺平道路。