The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics' correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.
翻译:神经网络的兴起与快速发展彻底革新了对话系统的研究,进而引发了关于其自动评估的诸多挑战。开放域对话系统的自动评估作为一项开放性难题,已成为众多研究人员的关注焦点。尽管人们不断致力于提升自动指标与人工评估的相关性,但针对其在多个领域和维度上鲁棒性的评估尝试却少之又少。此外,现有研究主要集中于英语。这些挑战促使人们开发能够在不同领域、维度和语言中可靠运行的自动评估指标。本赛道作为第十一届对话系统技术挑战赛(DSTC11)的一部分,是推动鲁棒与多语言自动评估指标持续努力的一环。本文详细介绍了为参与者提供的数据集与基线模型,并论述了两个拟议子任务的提交方案与结果细节。