Structured (dictionary-like) data presents challenges for left-to-right language models, as they can struggle with structured entities for a wide variety of reasons such as formatting and sensitivity to the order in which attributes are presented. Tabular generative models suffer from a different set of limitations such as their lack of flexibility. We introduce Diffusion Models of Structured Knowledge (DiSK) - a new architecture and training approach specialized for structured data. DiSK handles text, categorical, and continuous numerical data using a Gaussian mixture model approach, which allows for improved precision when dealing with numbers. It employs diffusion training to model relationships between properties. Experiments demonstrate DiSK's state-of-the-art performance on tabular data modeling, synthesis, and imputation on over 15 datasets across diverse domains. DiSK provides an effective inductive bias for generative modeling and manipulation of structured data. The techniques we propose could open the door to improved knowledge manipulation in future language models.
翻译:类字典式的结构化数据对自左向右的语言模型构成挑战,这类模型可能因格式差异及属性呈现顺序敏感性等多种因素,在处理结构化实体时表现受限。表格生成模型则面临灵活性不足等另一系列局限。我们提出结构化知识扩散模型(DiSK)——一种专为结构化数据设计的新型架构与训练方法。DiSK采用高斯混合模型策略处理文本、类别及连续数值数据,显著提升数值处理精度;通过扩散训练建模属性间关联。实验表明,DiSK在涵盖多领域的15个数据集上,于表格数据建模、合成与插补任务中达到最优性能。该模型为结构化数据的生成建模与操作提供了有效归纳偏置,所提技术有望为未来语言模型的知识操作能力改进开辟新路径。