We present a scholarly corpus from the ACL Anthology to assist Open scientific research in the Computational Linguistics domain, named as ACL OCL. Compared with previous ARC and AAN versions, ACL OCL includes structured full-texts with logical sections, references to figures, and links to a large knowledge resource (semantic scholar). ACL OCL contains 74k scientific papers, together with 210k figures extracted up to September 2022. To observe the development in the computational linguistics domain, we detect the topics of all OCL papers with a supervised neural model. We observe ''Syntax: Tagging, Chunking and Parsing'' topic is significantly shrinking and ''Natural Language Generation'' is resurging. Our dataset is open and available to download from HuggingFace in https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus.
翻译:我们提出一个来自ACL Anthology的学术语料库,名为ACL OCL,旨在辅助计算语言学领域的开放科学研究。相较于先前的ARC和AAN版本,ACL OCL包含结构化全文,涵盖逻辑章节、图表引用以及连接至大型知识资源(Semantic Scholar)的链接。该语料库收录了截至2022年9月的74,000篇科学论文及210,000张图表。为观察计算语言学领域的发展趋势,我们采用有监督神经模型检测了OCL所有论文的主题。研究发现,“句法:词性标注、组块分析及句法分析”主题显著缩减,而“自然语言生成”主题呈现复苏态势。本数据集已在HuggingFace平台开放下载,地址为https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus。