Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.
翻译:文档人工智能发展迅速,正吸引越来越多的关注。然而,尽管大多数研究工作集中于文档布局分析(DLA),其生成对应任务——文档布局生成——仍未得到充分探索。一个主要障碍在于多样化布局的稀缺:现有研究以具有曼哈顿式结构的学术论文为主,而报纸、杂志等开放世界体裁的布局则严重缺乏代表性。为填补这一空白,我们构建了OmniLayout-1M,这是首个百万规模级的多样化文档布局数据集,涵盖六种常见文档类型,并包含从多个来源收集的当代布局。此外,由于现有方法在复杂领域中表现不佳,且常常难以连贯地安排长序列,我们提出了OmniLayout-LLM,这是一个拥有5亿参数、采用设计的两阶段从粗到细学习范式的模型:1)利用粗粒度类别定义从OmniLayout-1M中学习通用布局原则;2)将知识迁移到具有细粒度标注的特定领域。大量实验表明,我们的方法在M$^{6}$Doc数据集的多个领域上取得了强劲性能,显著超越了现有的布局生成专家模型以及多个最新的通用大语言模型。我们的代码、模型和数据集将公开发布。