Web automation holds the potential to revolutionize how users interact with the digital world, offering unparalleled assistance and simplifying tasks via sophisticated computational methods. Central to this evolution is the web element nomination task, which entails identifying unique elements on webpages. Unfortunately, the development of algorithmic designs for web automation is hampered by the scarcity of comprehensive and realistic datasets that reflect the complexity faced by real-world applications on the Web. To address this, we introduce the Klarna Product Page Dataset, a comprehensive and diverse collection of webpages that surpasses existing datasets in richness and variety. The dataset features 51,701 manually labeled product pages from 8,175 e-commerce websites across eight geographic regions, accompanied by a dataset of rendered page screenshots. To initiate research on the Klarna Product Page Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. We make three important contributions. First, we found that a simple Convolutional GNN (GCN) outperforms complex state-of-the-art nomination methods. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page using the aforementioned GCN. These elements are then passed to a large language model for the final nomination. This procedure significantly improves the nomination accuracy by 16.8 percentage points on our challenging dataset, without any need for fine-tuning. Finally, in response to another prevalent challenge in this field - the abundance of training methodologies suitable for element nomination - we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
翻译:Web自动化有望彻底改变用户与数字世界的交互方式,通过复杂计算方法提供前所未有的辅助并简化任务。这一演进的核心是Web元素提名任务,即识别网页上的独特元素。然而,由于缺乏反映真实Web应用复杂性的全面且真实的基准数据集,Web自动化算法设计的开发受到阻碍。为解决此问题,我们推出Klarna产品页面数据集——一个超越现有数据集的全面多样网页集合,涵盖来自八个地理区域8175个电子商务网站的51701个手动标注产品页面,并配有渲染页面截图数据集。为启动对Klarna产品页面数据集的研究,我们系统评估了多种图神经网络在Web元素提名任务上的性能,并做出三项重要贡献:首先,发现简单的卷积图神经网络优于复杂的先进提名方法;其次,提出训练优化流程,利用前述图神经网络从每个页面识别少量相关元素,再交由大型语言模型进行最终提名——该流程在挑战性数据集上将提名准确率提升16.8个百分点,且无需微调;最后,针对该领域另一普遍挑战(即适用于元素提名的训练方法过多),我们提出竞争性提名训练流程这一新型训练方法,进一步提升了提名准确率。