Due to the large amount of daily scientific publications, it is impossible to manually review each one. Therefore, an automatic extraction of key information is desirable. In this paper, we examine STEREO, a tool for extracting statistics from scientific papers using regular expressions. By adapting an existing regular expression inclusion algorithm for our use case, we decrease the number of regular expressions used in STEREO by about $33.8\%$. We reveal common patterns from the condensed rule set that can be used for the creation of new rules. We also apply STEREO, which was previously trained in the life-sciences and medical domain, to a new scientific domain, namely Human-Computer-Interaction (HCI), and re-evaluate it. According to our research, statistics in the HCI domain are similar to those in the medical domain, although a higher percentage of APA-conform statistics were found in the HCI domain. Additionally, we compare extraction on PDF and LaTeX source files, finding LaTeX to be more reliable for extraction.
翻译:鉴于每日大量的科学出版物,人工逐一审阅已不可行。因此,自动提取关键信息成为迫切需求。本文研究了STEREO——一种利用正则表达式从科学论文中提取统计数据的工具。通过调整现有的正则表达式包含算法以适应我们的用例,我们将STEREO中使用的正则表达式数量减少了约33.8%。我们从精简后的规则集中发现了可用于创建新规则的通用模式。此外,我们将此前在生命科学与医学领域训练的STEREO应用于新的科学领域——人机交互(HCI),并重新评估其性能。研究表明,HCI领域的统计数据与医学领域相似,但HCI领域中发现更高比例的APA格式合规统计数据。同时,我们比较了从PDF和LaTeX源文件中提取的效果,发现LaTeX提取结果更为可靠。