The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics' correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.
翻译:神经网络的兴起与快速发展彻底变革了对话系统的研究,随后引发了关于其自动评估的诸多挑战。开放域对话系统的自动评估作为一项开放性难题,始终是众多研究者关注的焦点。尽管学界持续致力于提升自动评估指标与人类评估的相关性,但针对这些指标在跨领域、多维度下的鲁棒性评估工作仍十分有限。此外,现有研究主要集中于英语语言。这些挑战促使我们开发在多个领域、维度和语言中均具备可靠性的自动评估指标。本赛道作为第十一届对话系统技术挑战赛(DSTC11)的组成部分,旨在持续推进鲁棒多语言自动评估指标的研究。本文介绍了为参赛者提供的数据集与基线方法,并详细阐述了两个子任务的提交方案及结果分析。