We demonstrate substantial performance gains in zero-shot dialogue state tracking (DST) by enhancing training data diversity through synthetic data generation. Existing DST datasets are severely limited in the number of application domains and slot types they cover due to the high costs of data collection, restricting their adaptability to new domains. This work addresses this challenge with a novel, fully automatic data generation approach that creates synthetic zero-shot DST datasets. Distinguished from previous methods, our approach can generate dialogues across a massive range of application domains, complete with silver-standard dialogue state annotations and slot descriptions. This technique is used to create the D0T dataset for training zero-shot DST models, encompassing an unprecedented 1,000+ domains. Experiments on the MultiWOZ benchmark show that training models on diverse synthetic data improves Joint Goal Accuracy by 6.7%, achieving results competitive with models 13.5 times larger than ours.
翻译:我们通过合成数据生成增强训练数据多样性,在零样本对话状态跟踪(DST)中展示了显著的性能提升。现有DST数据集由于数据收集成本高昂,所覆盖的应用领域和槽类型数量严重受限,从而限制了其对新领域的适应性。本研究通过一种新颖的全自动数据生成方法解决了这一挑战,该方法可创建合成零样本DST数据集。与以往方法不同,我们的方法能够在大量应用领域生成对话,并附带银标准对话状态标注和槽描述。该技术被用于创建D0T数据集,用于训练零样本DST模型,涵盖了前所未有的1000多个领域。在MultiWOZ基准上的实验表明,在多样化合成数据上训练的模型将联合目标准确率提升了6.7%,达到了比我们模型大13.5倍的模型所具有的竞争性结果。