StraTyper: Automated Semantic Type Discovery and Multi-Type Annotation for Dataset Collections

Understanding dataset semantics is crucial for effective search, discovery, and integration pipelines. To this end, column type annotation (CTA) methods associate columns of tabular datasets with semantic types that accurately describe their contents, using pre-trained deep learning models or Large Language Models (LLMs). However, existing approaches require users to specify a closed set of semantic types either at training or inference time, hindering their application to domain-specific datasets where pre-defined labels often lack adequate coverage and specificity. Furthermore, real-world datasets frequently contain columns with values belonging to multiple semantic types, violating the single-type assumption of existing CTA methods. While proprietary LLMs have shown effectiveness for CTA, they incur high monetary costs and produce inconsistent outputs for similar columns, leading to type redundancy that negatively affects downstream applications. To address these challenges, we introduce StraTyper, a cost-effective method for column type discovery (CTD) and multi-type annotation (CMTA) in dataset collections. StraTyper eliminates the need for pre-defined semantic labels by systematically employing LLMs to discovery types tailored to the dataset collection at hand. Through strategic column clustering, controlled type generation, and iterative cascading discovery, StraTyper balances type precision with annotation coverage while minimizing LLM costs. Our experimental evaluation-both manual and LLM-assisted-on real-world benchmarks demonstrates that StraTyper discovers accurate types for both numerical and non-numerical data, achieves substantial cost savings compared to commercial LLMs, and effectively handles multi-typed columns. We further show that StraTyper's annotations improve downstream tasks, including join discovery and schema matching, outperforming LLM-only baselines.

翻译：理解数据集语义对于有效的搜索、发现与集成流程至关重要。为此，列类型标注方法利用预训练的深度学习模型或大型语言模型，将表格数据集的列与其内容准确描述的语义类型相关联。然而，现有方法要求用户在训练或推理时指定一个封闭的语义类型集合，这阻碍了其在领域特定数据集上的应用，因为预定义标签往往缺乏足够的覆盖度和特异性。此外，现实世界的数据集常包含值属于多种语义类型的列，这违反了现有列类型标注方法的单一类型假设。尽管专有大型语言模型在列类型标注上已显示出有效性，但其产生高昂的货币成本，且对相似列产生不一致的输出，导致类型冗余，从而对下游应用产生负面影响。为应对这些挑战，我们提出了StraTyper，一种用于数据集集合中列类型发现与多类型标注的经济高效方法。StraTyper通过系统性地运用大型语言模型来发现针对当前数据集集合定制的类型，从而消除了对预定义语义标签的需求。通过策略性的列聚类、受控的类型生成以及迭代级联发现，StraTyper在最小化大型语言模型成本的同时，平衡了类型精度与标注覆盖率。我们在真实世界基准测试上进行的实验评估——包括人工评估和大型语言模型辅助评估——表明，StraTyper能为数值和非数值数据发现准确的类型，与商用大型语言模型相比实现了显著的成本节约，并能有效处理多类型列。我们进一步证明，StraTyper的标注改善了包括连接发现和模式匹配在内的下游任务，其性能优于仅使用大型语言模型的基线方法。