The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.
翻译:本文探讨了俄语科学论文多模态数据集的构建,并测试了现有语言模型在自动文本摘要任务中的表现。该数据集的特点在于其多模态数据,包含文本、表格和图像。文章展示了基于SBER的Gigachat和Yandex的YandexGPT两种语言模型的实验结果。该数据集由420篇论文组成,并已在https://github.com/iis-research-team/summarization-dataset上公开。