Data profilers play a crucial role in the preprocessing phase of data analysis by identifying quality issues such as missing, extreme, or erroneous values. Traditionally, profilers have relied solely on statistical methods, which lead to high false positives and false negatives. For example, they may incorrectly flag missing values where such absences are expected and normal based on the data's semantic context. To address these, we introduce Cocoon, a data profiling system that integrates LLMs to imbue statistical profiling with semantics. Cocoon enhances traditional profiling methods by adding a three-step process: Semantic Context, Semantic Profile, and Semantic Review. Our user studies show that Cocoon is highly effective at accurately discerning whether anomalies are genuine errors requiring correction or acceptable variations based on the semantics for real-world datasets.
翻译:数据剖析器在数据分析预处理阶段通过识别缺失值、极端值或错误值等质量问题发挥着关键作用。传统剖析器仅依赖统计方法,导致较高的假阳性率和假阴性率。例如,当数据语义上下文表明缺失值属于预期正常情况时,传统方法可能错误地将其标记为异常。为解决这些问题,我们提出Cocoon——一种集成大语言模型为统计剖析赋予语义的数据剖析系统。Cocoon通过新增三阶段流程增强传统剖析方法:语义上下文(Semantic Context)、语义轮廓(Semantic Profile)和语义审查(Semantic Review)。用户研究表明,Cocoon能够基于语义准确判断异常是需纠正的真实错误还是可接受的变异,对真实数据集展现出高效性能。