Missing values are pervasive in real-world tabular data and can significantly impair downstream analysis. Imputing them is especially challenging in text-rich tables, where dependencies are implicit, complex, and dispersed across long textual fields. Recent work has explored using Large Language Models (LLMs) for data imputation, yet existing approaches typically process entire tables or loosely related contexts, which can compromise accuracy, scalability, and explainability. We introduce LDI, a novel framework that leverages LLMs through localized reasoning, selecting a compact, contextually relevant subset of attributes and tuples for each missing value. This targeted selection reduces noise, improves scalability, and provides transparent attribution by revealing which data influenced each prediction. Through extensive experiments on real and synthetic datasets, we demonstrate that LDI consistently outperforms state-of-the-art imputation methods, achieving up to 8% higher accuracy with hosted LLMs and even greater gains with local models. The improved interpretability and robustness also make LDI well-suited for high-stakes data management applications.
翻译:现实世界中的表格数据普遍存在缺失值,这会严重影响下游分析。在文本密集型表格中进行插补尤其具有挑战性,因为这类表格中的依赖关系是隐式的、复杂的,并且分散在长文本字段中。近期研究探索了使用大型语言模型(LLMs)进行数据插补,但现有方法通常处理整个表格或松散相关的上下文,这可能会损害准确性、可扩展性和可解释性。我们提出了LDI,一种新颖的框架,它通过局部推理来利用LLMs,为每个缺失值选择一组紧凑且上下文相关的属性和元组。这种有针对性的选择减少了噪声,提高了可扩展性,并通过揭示哪些数据影响了每个预测来提供透明的归因。通过对真实和合成数据集的大量实验,我们证明LDI始终优于最先进的插补方法,在使用托管LLMs时实现了高达8%的准确率提升,在使用本地模型时甚至获得了更大的增益。改进的可解释性和鲁棒性也使LDI非常适合高风险的数据管理应用。