This work demonstrates that substantial gains in zero-shot dialogue state tracking (DST) accuracy can be achieved by increasing the diversity of training data using synthetic data generation techniques. Current DST training resources are severely limited in the number of application domains and slot types they cover due to the high costs of data collection, resulting in limited adaptability to new domains. The presented work overcomes this challenge using a novel, fully automatic data generation approach to create synthetic zero-shot DST training resources. Unlike previous approaches for generating DST data, the presented approach generates entirely new application domains to generate dialogues, complete with silver dialogue state annotations and slot descriptions. This approach is used to create the D0T dataset for training zero-shot DST models, which covers an unprecedented 1,000+ domains. Experiments performed on the MultiWOZ benchmark indicate that training models on diverse synthetic data yields a performance improvement of +6.7% Joint Goal Accuracy, achieving results competitive with much larger models.
翻译:本工作证明,通过合成数据生成技术增加训练数据的多样性,可显著提升零样本对话状态追踪(DST)的准确率。由于数据采集成本高昂,现有DST训练资源在应用领域和槽位类型的覆盖范围上严重受限,导致模型对新领域的适应性不足。本研究采用一种新型全自动数据生成方法克服这一挑战,创建了合成零样本DST训练资源。与以往生成DST数据的方法不同,本方法通过生成全新应用领域来构建对话,同时提供银标准对话状态标注和槽位描述。基于该技术构建的D0T数据集覆盖了前所未有的1000余个领域,可用于训练零样本DST模型。在MultiWOZ基准上的实验表明,在多样化合成数据上训练的模型联合目标准确率提升了6.7%,其性能可与规模更大的模型相媲美。