The missing data problem is one of the important issues to address for achieving data quality. While imputation-based methods are designed to achieve data completeness, their efficacy is observed to be diminishing as and when there is increasing in the missingness percentage. Further, extant approaches often struggle to handle mixed-type datasets, typically supporting either numerical and/or categorical data. In this work, we propose LLMDR, automatic data recovery framework which operates in two stage approach, wherein the Stage-I: DBSCAN clustering algorithm is employed to select the most representative samples and in the Stage-II: Multi-LLMs are employed for data recovery considering the local and global representative samples; Later, this framework invokes the consensus algorithm for recommending a more accurate value based on other LLMs of local and global effective samples. Experimental results demonstrate that proposed framework works effectively on various mixed datasets in terms of Accuracy, KS-Statistic, SMAPE, and MSE. Further, we have also shown the advantage of the consensus mechanism for final recommendation in mixed-type data.
翻译:缺失数据问题是实现数据质量需要解决的重要问题之一。虽然基于插补的方法旨在实现数据完整性,但观察到其有效性会随着缺失率的增加而降低。此外,现有方法通常难以处理混合类型数据集,通常仅支持数值型和/或分类数据。在这项工作中,我们提出了LLMDR,一种自动数据恢复框架,采用两阶段方法运行:第一阶段:采用DBSCAN聚类算法选择最具代表性的样本;第二阶段:利用多个大语言模型,结合局部和全局代表性样本进行数据恢复;随后,该框架调用共识算法,基于其他大语言模型对局部和全局有效样本的分析,推荐更准确的值。实验结果表明,所提框架在准确率、KS统计量、SMAPE和MSE指标上,对各种混合数据集均能有效工作。此外,我们还展示了共识机制在混合类型数据中进行最终推荐的优势。