Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST.
翻译:对话状态跟踪在任务型对话系统中提取信息起着关键作用。然而,现有研究主要局限于文本模态,这主要是由于真实人类音频数据集的匮乏。针对这一问题,我们通过研究合成音频数据在基于音频的对话状态跟踪中的应用来解决。为此,我们开发了级联模型和端到端模型,使用合成音频数据集进行训练,并在真实人类语音数据上测试。为了促进针对音频模态的评估,我们引入了一种新的音素F1指标来捕捉发音相似度。实验结果表明,仅依赖合成数据集训练的模型能够泛化至人类语音数据。通过消除对人类语音数据收集的依赖,这些发现为基于音频的对话状态跟踪的显著实际进展铺平了道路。数据和代码可在 https://github.com/JihyunLee1/E2E-DST 获取。