Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.
翻译:视觉对话系统整合了文本与视觉输入等多模态通信方式,已成为日益热门的研究领域。然而,标准化评估框架的缺失对该领域的发展评估构成了挑战。为此,我们提出**VDialogUE**——一个面向**视觉对话**的**统一评估**基准。该基准定义了五项核心多模态对话任务,并涵盖六个数据集。此外,为全面评估模型在所有任务上的表现,我们基于层次分析法(AHP)开发了一种名为VDscore的新型评估指标。同时,我们提出了一个简洁高效的基线模型**VISIT**(**视觉对话**Transformer),旨在推动通用多模态对话系统的进步。该模型通过两阶段预训练策略逐步构建多模态基础与对话能力。我们相信,VDialogUE基准、评估脚本及基线模型将加速视觉对话系统的发展,并促进更复杂高效的预训练模型的诞生。