Generalized Entity Matching (GEM), which aims at judging whether two records represented in different formats refer to the same real-world entity, is an essential task in data management. The prompt tuning paradigm for pre-trained language models (PLMs), including the recent PromptEM model, effectively addresses the challenges of low-resource GEM in practical applications, offering a robust solution when labeled data is scarce. However, existing prompt tuning models for GEM face the challenges of prompt design and information gap. This paper introduces an augmented prompt tuning framework for the challenges, which consists of two main improvements. The first is an augmented contextualized soft token-based prompt tuning method that extracts a guiding soft token benefit for the PLMs' prompt tuning, and the second is a cost-effective information augmentation strategy leveraging large language models (LLMs). Our approach performs well on the low-resource GEM challenges. Extensive experiments show promising advancements of our basic model without information augmentation over existing methods based on moderate-size PLMs (average 5.24%+), and our model with information augmentation achieves comparable performance compared with fine-tuned LLMs, using less than 14% of the API fee.
翻译:通用实体匹配(GEM)旨在判断以不同格式表示的两条记录是否指向同一真实世界实体,是数据管理中的关键任务。面向预训练语言模型(PLM)的提示调优范式,包括近期提出的PromptEM模型,有效应对了实际应用中低资源GEM的挑战,在标注数据稀缺时提供了鲁棒解决方案。然而,现有面向GEM的提示调优模型面临提示设计与信息鸿沟的挑战。本文针对这些挑战提出一种增强提示调优框架,包含两项主要改进:其一,基于增强上下文软令牌的提示调优方法,可提取引导性软令牌以优化PLM的提示调优效果;其二,利用大语言模型(LLM)的低成本信息增强策略。本方法在低资源GEM挑战中表现优异。大量实验表明,基于中等规模PLM(平均提升5.24%以上)的基准模型(不含信息增强)较现有方法取得显著进步,而含信息增强的模型在仅使用不足14%API费用的条件下,其性能与微调后的LLM相当。