With the widespread use of the internet, it has become increasingly crucial to extract specific information from vast amounts of academic articles efficiently. Data mining techniques are generally employed to solve this issue. However, data mining for academic articles is challenging since it requires automatically extracting specific patterns in complex and unstructured layout documents. Current data mining methods for academic articles employ rule-based(RB) or machine learning(ML) approaches. However, using rule-based methods incurs a high coding cost for complex typesetting articles. On the other hand, simply using machine learning methods requires annotation work for complex content types within the paper, which can be costly. Furthermore, only using machine learning can lead to cases where patterns easily recognized by rule-based methods are mistakenly extracted. To overcome these issues, from the perspective of analyzing the standard layout and typesetting used in the specified publication, we emphasize implementing specific methods for specific characteristics in academic articles. We have developed a novel Text Block Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid. We used the well-known ACL proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% classification accuracy and 90% detection accuracy for tables and figures.
翻译:随着互联网的广泛应用,如何从海量学术论文中高效提取特定信息变得日益关键。数据挖掘技术通常被用于解决这一问题。然而,学术论文的数据挖掘颇具挑战性,因为它需要从结构复杂且非版式统一的文档中自动识别特定模式。当前面向学术论文的数据挖掘方法主要采用基于规则(RB)或机器学习(ML)两种范式。但使用基于规则的方法对排版复杂的文章而言编码成本高昂;而单纯依赖机器学习方法则需要对论文中复杂的内容类型进行标注,同样代价不菲,且可能导致能被基于规则方法轻易识别的模式被错误提取。为克服上述难题,我们从分析特定出版物标准版式与排版特征的角度出发,强调针对学术论文的特定属性实施定制化方法。我们提出了一种全新的文本块精炼框架(TBRF),该框架融合了机器学习与基于规则的混合策略。以知名的ACL会议论文集作为验证实验数据,实验结果表明,我们的方法在表格与图片的识别中实现了超过95%的分类准确率与90%的检测准确率。