Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a gap between this performance and its applicability in practice. In this paper, we propose AdaTyper to address one of the most critical deployment challenges: adaptation. AdaTyper uses weak-supervision to adapt a hybrid type predictor towards new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor of AdaTyper combines rule-based methods and a light machine learning model for semantic column type detection. We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries.
翻译:理解关系表的语义对于数据探索和准备系统中的自动化至关重要。理解表的关键来源是其列的语义。随着深度学习的兴起,现在可用的学习型表表示可以应用于语义类型检测,并在基准测试中取得良好性能。然而,我们观察到这种性能与其在实际应用中的适用性之间存在差距。在本文中,我们提出AdaTyper来解决最关键的部署挑战之一:自适应。AdaTyper利用弱监督在推理时通过最少的人工反馈,将混合类型预测器自适应到新的语义类型和偏移的数据分布中。AdaTyper的混合类型预测器结合了基于规则的方法和轻量级机器学习模型,用于语义列类型检测。我们在通过众包手工标注语义列类型的真实数据库表上评估了AdaTyper的自适应性能,发现新类型和现有类型的F1分数均有所提升。AdaTyper在仅看到5个示例后,平均精度接近0.6,显著优于基于人工提供正则表达式或字典的现有自适应方法。