Missing values are pervasive in real-world tabular data and can significantly impair downstream analysis. Imputing them is especially challenging in text-rich tables, where dependencies are implicit, complex, and dispersed across long textual fields. Recent work has explored using Large Language Models (LLMs) for data imputation, yet existing approaches typically process entire tables or loosely related contexts, which can compromise accuracy, scalability, and explainability. We introduce LDI, a novel framework that leverages LLMs through localized reasoning, selecting a compact, contextually relevant subset of attributes and tuples for each missing value. This targeted selection reduces noise, improves scalability, and provides transparent attribution by revealing the dependency relations that justify each selected attribute and the evidence behind each retrieved tuple. It makes clear not only which data influenced a prediction, but also why it was chosen. Through extensive experiments on real and synthetic datasets, we demonstrate that LDI consistently outperforms state-of-the-art imputation methods, achieving up to 8% higher accuracy with hosted LLMs and even greater gains with small local models. The improved interpretability and robustness also make LDI well-suited for high-stakes data management applications. Our code and datasets are publicly available at https://github.com/soroushomidvar/LDI.
翻译:缺失值在现实表格数据中普遍存在,且会严重削弱下游分析效果。在文本富集表格中,由于依赖关系隐含、复杂且分散于长文本字段,对缺失值进行插补尤为困难。近期研究探索了利用大语言模型进行数据插补的方法,但现有方法通常处理整个表格或松散相关的上下文,这可能导致准确性、可扩展性和可解释性受损。我们提出LDI——一种利用大语言模型进行局部化推理的新型框架,可为每个缺失值选择紧凑且上下文相关的属性和元组子集。这种针对性选择可降低噪音、提升可扩展性,并通过揭示支撑每个选定属性的依赖关系及每个检索元组背后的证据,提供透明的归因机制。它不仅清晰展示哪些数据影响了预测,还解释了数据被选中的原因。通过在真实与合成数据集上的广泛实验,我们证明LDI始终优于现有最先进的插补方法:托管大语言模型可获得高达8%的准确率提升,而小型本地模型则能实现更大增益。其增强的可解释性与鲁棒性也使LDI特别适用于高可靠性数据管理应用场景。我们的代码和数据集已在https://github.com/soroushomidvar/LDI公开。