The lack of generalizability -- in which a model trained on one dataset cannot provide accurate results for a different dataset -- is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While several solutions have been proposed, including newer and updated deep learning models, larger hand-annotated datasets, and the generation of large synthetic datasets, so far there is no "magic bullet" for translating a model trained on a particular domain or historical time period to a new field. Here we present our ongoing work in translating our document layout analysis model from the historical astrophysical literature to the larger corpus of scientific documents within the HathiTrust U.S. Federal Documents collection. We use this example as an avenue to highlight some of the problems with generalizability in the document layout analysis community and discuss several challenges and possible solutions to address these issues. All code for this work is available on The Reading Time Machine GitHub repository (https://github.com/ReadingTimeMachine/htrc_short_conf).
翻译:泛化能力不足——即在一个数据集上训练的模型无法为不同数据集提供准确结果——是文档布局分析领域公认的难题。因此,当模型用于定位科学文献中图表、表格、标题和数学公式等重要页面对象时,往往难以成功应用于新领域。尽管已有多种解决方案提出,包括更新迭代的深度学习模型、更大规模的人工标注数据集以及合成数据集生成技术,但至今仍无"万能解法"能将特定领域或历史时期的训练模型迁移至新领域。本文介绍了我们在将文档布局分析模型从历史天体物理学文献迁移至HathiTrust美国联邦政府文献库中更大规模科学文献时的持续工作进展。我们以此案例为切入点,揭示文档布局分析领域泛化能力存在的部分问题,并讨论若干挑战及可能的解决方案。本工作所有代码均托管于The Reading Time Machine GitHub仓库(https://github.com/ReadingTimeMachine/htrc_short_conf)。