With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation.
翻译:随着过去十年互联网的快速发展,从海量资源中高效提取有价值信息变得日益重要,这对于构建全面的数字生态系统尤为关键,特别是在研究综述与知识理解方面。这些任务的基础在于对科学文档数据进行精准提取与深度挖掘,而这正是构建强大数据基础设施的核心。然而,解析原始数据或从复杂科学文档中提取数据始终面临挑战。当前科学文档的数据提取方法通常采用基于规则(RB)或机器学习(ML)的方式。但使用基于规则的方法处理排版复杂的文章时,编码成本较高。反之,完全依赖机器学习方法则需要对科学文档中复杂内容类型进行标注工作,这同样成本高昂。此外,鲜有研究对科学文档内的层级布局进行系统定义与深入探索。对文档内部结构与要素缺乏全面定义,间接影响了文本分类与目标识别任务的准确性。从分析特定出版物的标准布局与排版方式入手,我们提出了一种名为CTBR(隔室与文本块细化)的新型文档布局分析框架。首先,我们将科学文档划分为层级结构:基础域、隔室与文本块。其次,我们对文本块语义进行深入探究与分类。最后,利用文本块分类结果,基于隔室分割规则实现科学文档内的目标识别。