Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.
翻译:现有深度学习方法在语义列类型标注(CTA)中存在重要缺陷:它们依赖训练时固定的语义类型;每个类型需要大量训练样本且推理成本高昂;即使类型保持不变,在新数据集上的性能也可能下降。大语言模型已在广泛任务中展现出强大的零样本分类能力,本文探索其在CTA中的应用。我们提出ArcheType——一种用于上下文采样、提示序列化、模型查询和标签重映射的简洁实用方法,使大语言模型能以完全零样本方式解决CTA问题。我们分别消融研究了方法的每个组件,发现改进上下文采样和标签重映射能带来最一致的性能提升。ArcheType在零样本CTA基准测试(包括本文发布的三个新领域特定基准)上达到最新最优性能,并与经典CTA技术结合使用时,在微调后的SOTAB基准上超越当前最优的DoDuo模型。我们的代码已开源在https://github.com/penfever/ArcheType。