Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, \rev{the study analyzes the impact of column types on prediction performance.} The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.
翻译:近期研究表明,对数据库模式元素进行自然语言分析可用于指导调优和数据剖析工作。其核心假设是,最先进的自然语言处理方法(即语言模型)能够从模式文本中提取数据属性信息。本文在数据相关性分析背景下验证这一假设:能否通过语言模型分析列名来发现存在数据相关性的列对?首先,本文通过分析数千个Kaggle数据集(可提供下载),构建了用于数据相关性分析的新型基准测试集。其次,利用该数据集研究语言模型基于列名预测相关性的能力。分析涵盖了不同语言模型、多种相关性指标及种类繁多的准确性指标,并揭示了影响预测成功的关键因素(如列名的长度及单词比例)。最后,本研究分析了列类型对预测性能的影响。结果表明,模式文本可作为有效的信息来源,为后续面向自然语言处理增强的数据库调优与数据剖析研究提供方向。