This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Recent neural network models have shown high efficacy in extracting information from semi-structured web pages. However, these models are predominantly applied to domains like e-commerce and are pre-trained using English data, complicating their application to web pages in other languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic) from 161 websites. The dataset is publicly available on GitHub. We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality. Additionally, we pre-trained another state-of-the-art model, DOM-LM, on multilingual data and fine-tuned it on our dataset. We compared both fine-tuned models to existing open-source news data extraction tools, achieving superior extraction metrics.
翻译:本文探讨了从多语言新闻文章网页中自动抽取属性的挑战。近年来,神经网络模型在半结构化网页信息抽取方面展现出高效能。然而,这些模型主要应用于电子商务等领域,且通常基于英文数据进行预训练,这限制了它们在其他语言网页上的应用。我们构建了一个多语言数据集,包含来自161个网站的3,172个标注新闻网页,涵盖六种语言(英语、德语、俄语、中文、韩语和阿拉伯语)。该数据集已在GitHub上公开。我们微调了预训练的最先进模型MarkupLM,以从这些页面中抽取新闻属性,并评估了将页面翻译成英文对抽取质量的影响。此外,我们在多语言数据上预训练了另一个最先进模型DOM-LM,并在我们的数据集上对其进行了微调。我们将这两个微调模型与现有的开源新闻数据抽取工具进行了比较,取得了更优的抽取指标。