Large language models (LLMs) are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we propose a novel approach FnCTOD for solving DST with LLMs through function calling. This method improves zero-shot DST, allowing adaptation to diverse domains without extensive data collection or model tuning. Our experimental results demonstrate that our approach achieves exceptional performance with both modestly sized open-source and also proprietary LLMs: with in-context prompting it enables various 7B or 13B parameter models to surpass the previous state-of-the-art (SOTA) achieved by ChatGPT, and improves ChatGPT's performance beating the SOTA by 5.6% average joint goal accuracy (JGA). Individual model results for GPT-3.5 and GPT-4 are boosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a small collection of diverse task-oriented dialogues, we can equip modestly sized models, specifically a 13B parameter LLaMA2-Chat model, with function-calling capabilities and DST performance comparable to ChatGPT while maintaining their chat capabilities. We have made the code publicly available at https://github.com/facebookresearch/FnCTOD
翻译:大型语言模型(LLMs)因其在通用语境中卓越的理解与生成能力,在对话系统中日益普及。然而,它们在任务导向型对话(TOD)中的效能仍不尽如人意——这类任务不仅需要响应生成,还要求在特定任务与领域内有效进行对话状态跟踪(DST)。本文提出一种新颖方法FnCTOD,通过函数调用解决基于LLMs的DST问题。该方法改进零样本DST,能够适应多种领域,无需大量数据收集或模型调优。实验结果表明,本方法在规模适中的开源及专有LLMs上均取得卓越性能:通过上下文提示,使多种7B或13B参数模型超越此前由ChatGPT实现的最优结果,并将ChatGPT的性能提升至超越该最优结果5.6%的平均联合目标准确率(JGA)。GPT-3.5与GPT-4的单项模型结果分别提升4.8%和14%。我们还证明,通过对少量多样化任务导向型对话进行微调,可使规模适中的模型(具体为13B参数的LLaMA2-Chat模型)具备函数调用能力,并在保持对话能力的同时达到与ChatGPT相当的DST性能。代码已开源发布于https://github.com/facebookresearch/FnCTOD。