With the widespread use of the internet, it has become increasingly crucial to extract specific information from vast amounts of academic articles efficiently. Data mining techniques are generally employed to solve this issue. However, data mining for academic articles is challenging since it requires automatically extracting specific patterns in complex and unstructured layout documents. Current data mining methods for academic articles employ rule-based(RB) or machine learning(ML) approaches. However, using rule-based methods incurs a high coding cost for complex typesetting articles. On the other hand, simply using machine learning methods requires annotation work for complex content types within the paper, which can be costly. Furthermore, only using machine learning can lead to cases where patterns easily recognized by rule-based methods are mistakenly extracted. To overcome these issues, from the perspective of analyzing the standard layout and typesetting used in the specified publication, we emphasize implementing specific methods for specific characteristics in academic articles. We have developed a novel Text Block Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid. We used the well-known ACL proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% classification accuracy and 90% detection accuracy for tables and figures.
翻译:随着互联网的广泛普及,从海量学术文献中高效提取特定信息变得愈发关键。数据挖掘技术通常被用于解决这一问题。然而,针对学术文献的数据挖掘面临挑战,因为需要从复杂且非结构化版式的文档中自动提取特定模式。当前学术文献的数据挖掘方法采用基于规则或机器学习途径。然而,针对复杂排版的文章,使用基于规则的方法会带来高昂的编码成本。另一方面,仅使用机器学习方法需要针对论文中复杂的内容类型进行标注工作,这同样成本高昂。此外,仅依赖机器学习可能导致本应被基于规则的方法轻松识别的模式被错误提取。为克服这些问题,我们从分析特定出版物标准版式与排版的角度出发,强调针对学术文献中的具体特征实施特定方法。我们提出了一种新颖的文本块精炼框架——一种机器学习与基于规则混合的方案。我们以著名的ACL会议论文作为实验数据进行验证实验。实验表明,我们的方法分类准确率超过95%,表格与图表检测准确率达到90%。