Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.
翻译:以PDF格式存储的学术文档可转换为纯文本结构化标记语言,以提升可访问性并实现可扩展的数字图书馆工作流程。标记语言支持更便捷的更新与定制,使学术内容能更灵活地适应多样化应用场景(如语料库构建)。此类通常以PDF格式发布的文档包含数学公式、图表、页眉页脚及表格等复杂元素,以及密集排版的文本内容。现有端到端解码器Transformer模型可将文档截图转换为标记语言,但这些模型存在显著效率缺陷:其从头开始的逐词元解码过程会浪费大量推理步骤来重新生成本可直接从PDF文件提取的密集文本。为解决该问题,我们提出EditTrans——一种混合编辑生成模型,其特性支持在开始生成标记语言前从PDF中识别待编辑文本队列。EditTrans包含一个轻量级分类器,该分类器基于arXiv中162,127页文档数据在文档布局分析模型上微调而成。评估结果表明,相较于端到端解码器Transformer模型,EditTrans在保持转换质量的同时将转换延迟降低达44.5%。我们的代码与可复现数据集生成脚本均已开源。