Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.
翻译:近期,神经模型已被用于显著提升半结构化网站中信息抽取的性能。然而,持续发展的障碍在于缺乏足够大规模的用于训练这些模型的数据集。本文引入PLAtE(列表属性抽取页面)基准数据集,作为一项具有挑战性的新型网页抽取任务。PLAtE聚焦于购物数据,具体涉及从包含多个商品的产品评论页面中完成以下两项任务的抽取:(1)发现商品列表分割边界;(2)抽取每件商品的属性。PLAtE由从6694个页面中收集的52898个商品和156014个属性构成,成为首个大规模列表页网页抽取数据集。我们采用多阶段方法进行数据收集与标注,并适配三种先进网页抽取模型完成两项任务,通过定量与定性分析比较其优劣。