Being able to extract from scientific papers their main points, key insights, and other important information, referred to here as aspects, might facilitate the process of conducting a scientific literature review. Therefore, the aim of our research is to create a tool for automatic aspect extraction from Russian-language scientific texts of any domain. In this paper, we present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion, as well as a baseline algorithm for aspect extraction, based on the multilingual BERT model fine-tuned on our data. We show that there are some differences in aspect representation in different domains, but even though our model was trained on a limited number of scientific domains, it is still able to generalize to new domains, as was proved by cross-domain experiments. The code and the dataset are available at \url{https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts}.
翻译:能够从科学论文中提取其主要观点、关键见解及其他重要信息(此处称为“方面”)可能有助于简化科学文献综述的过程。因此,我们的研究目标是创建一种工具,用于从任意领域的俄语科学文本中自动提取这些方面。本文提出了一个跨领域的俄语科学文本数据集,其中标注了任务、贡献、方法和结论等方面,同时基于我们的数据微调的多语言BERT模型,提出了一种基线算法用于方面提取。我们发现在不同领域中,方面的表示存在一些差异,但尽管我们的模型仅在有限的科学领域中进行了训练,它仍能泛化到新领域,这一点通过跨领域实验得到了证实。代码和数据集可在 \url{https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts} 获取。