With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation.
翻译:随着过去十年互联网的迅猛发展,从海量资源中高效提取有价值信息变得日益重要,这对于建立一个全面的数字生态系统至关重要,尤其是在研究综述与理解的背景下。这些任务的基础在于从科学文档中准确提取并深度挖掘数据,这对于构建稳健的数据基础设施至关重要。然而,解析原始数据或从复杂的科学文档中提取数据一直是持续的挑战。当前针对科学文档的数据提取方法通常采用基于规则(RB)或机器学习(ML)的方法。然而,对于排版复杂的文章,使用基于规则的方法可能产生高昂的编码成本。相反,仅依赖机器学习方法则需要对科学文档内复杂的内容类型进行标注工作,这同样成本不菲。此外,很少有研究对科学文档内的层次化版面布局进行彻底的定义和探索。缺乏对文档内部结构和元素的全面定义,间接影响了文本分类和对象识别任务的准确性。从分析特定出版物所使用的标准版面布局和排版的角度出发,我们提出了一种新的文档版面分析框架,称为CTBR(分区与文本块细化)。首先,我们将科学文档定义为层次化划分:基础域、分区和文本块。接着,我们对文本块的含义进行深入探索和分类。最后,我们利用文本块分类的结果,基于规则的分区分割来实现科学文档内的对象识别。