Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces HunSum-2 an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence similarity. We train baseline models for both extractive and abstractive summarization using the collected dataset. To demonstrate the effectiveness of the trained models, we perform both quantitative and qualitative evaluation. Our dataset, models and code are publicly available, encouraging replication, further research, and real-world applications across various domains.
翻译:训练摘要模型需要大量的训练数据。然而,对于匈牙利语等资源较少的语言,公开可用的模型和数据集极为稀缺。为解决这一空白,本文引入了HunSum-2,这是一个开源匈牙利语语料库,适用于训练生成式与抽取式摘要模型。该数据集由Common Crawl语料库的片段组成,经过彻底的清洗、预处理和去重。除了生成式摘要,我们还利用句子相似性为抽取式摘要生成句子级标签。使用收集的数据集,我们训练了用于生成式和抽取式摘要的基线模型。为展示所训练模型的效果,我们进行了定量和定性评估。我们的数据集、模型和代码均已公开,鼓励各领域的复现、进一步研究及实际应用。