We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained Language Models (PLMs) and those utilising the zero-shot and in-context learning capabilities of Large Language Models (LLMs). In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field.
翻译:我们推出DIALIGHT工具包,用于开发与评估多语言任务导向对话(ToD)系统。该工具包支持对基于预训练语言模型(PLM)微调的ToD系统与利用大语言模型(LLM)零样本和情境学习能力的ToD系统进行系统性评估与比较。除自动评估外,该工具包还具备:(i)一个安全易用的网页界面,支持在局部话语层面和全局对话层面进行细粒度人工评估;(ii)基于微服务的后端架构,提升效率与可扩展性。我们的评估表明,PLM微调虽能获得更高的准确性和连贯性,但基于LLM的系统在生成多样化和受欢迎的回应方面表现更优。然而,我们也发现LLM在遵循任务特定指令和生成多语言输出方面存在显著挑战,这为未来研究指明了方向。我们期望这一开源工具包能为致力于开发并合理评估多语言ToD系统的研究人员提供宝贵资源,并降低该领域当前仍较高的准入门槛。