Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.
翻译:列类型标注对于数据清洗、集成和可视化等任务至关重要。现有解决方案依赖于资源密集型语言模型,这些模型需在特定表集(即源数据湖)中经过良好标注的列上进行微调。本文研究能否将已有的基于预训练语言模型的模型适配到新的(即目标)数据湖,以最小化新数据湖所需的标注量。然而,该过程面临源-目标知识鸿沟、信息性目标数据选择,以及在微调过程中不丢失共享知识等挑战。我们提出LakeHopper框架,该框架通过语言模型交互识别并解决知识鸿沟,采用基于聚类的数据选择方案处理未标注列,并利用渐进式微调机制逐步将源模型适配至目标数据湖。实验结果验证了LakeHopper在低资源与高资源两种设置下,针对两种不同数据湖迁移任务的有效性。