In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations for the neighboring objects, such as alignment, size, and overlap. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics (fid, alignment, overlap, MaxIoU and DocSim scores), enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it suitable for modeling geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin.
翻译:本文提出了一种新颖的生成模型——无需自编码器的扩散布局Transformer(Dolfin),该模型在降低复杂度的同时显著提升了建模能力。Dolfin采用基于Transformer的扩散过程来建模布局生成。除了高效的双向(非因果联合)序列表示外,我们进一步提出了一种自回归扩散模型(Dolfin-AR),该模型特别擅长捕捉邻近对象(如对齐、尺寸和重叠)之间的丰富语义相关性。在标准生成布局基准测试中,Dolfin在各项指标(FID、对齐度、重叠度、最大IoU及DocSim分数)上均显著提升了性能,同时增强了过程的透明性与互操作性。此外,Dolfin的应用不仅局限于布局生成,还可用于建模几何结构(如线段)。我们的实验通过定性与定量结果展示了Dolfin的优势。