Though Dialogue State Tracking (DST) is a core component of spoken dialogue systems, recent work on this task mostly deals with chat corpora, disregarding the discrepancies between spoken and written language.In this paper, we propose OLISIA, a cascade system which integrates an Automatic Speech Recognition (ASR) model and a DST model. We introduce several adaptations in the ASR and DST modules to improve integration and robustness to spoken conversations.With these adaptations, our system ranked first in DSTC11 Track 3, a benchmark to evaluate spoken DST. We conduct an in-depth analysis of the results and find that normalizing the ASR outputs and adapting the DST inputs through data augmentation, along with increasing the pre-trained models size all play an important role in reducing the performance discrepancy between written and spoken conversations.
翻译:尽管对话状态跟踪(DST)是口语对话系统的核心组件,但近期相关研究主要聚焦于聊天语料库,忽视了口语与书面语之间的差异。本文提出OLISIA级联系统,该集成了自动语音识别(ASR)模型与DST模型。我们分别在ASR和DST模块中引入多项改进措施,以增强对口语对话的鲁棒性及模块间集成效果。基于这些改进,本系统在评估口语DST性能的DSTC11 Track 3基准测试中位列第一。通过深入分析实验结果发现,对ASR输出进行归一化处理、通过数据增强调整DST输入,以及增大预训练模型规模,均能有效缩小书面语与口语对话之间的性能差距。