Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.
翻译:结构化表格上的问答不仅需要提供准确答案,还需明确揭示支持答案的具体单元格。现有表格问答系统鲜少提供细粒度归因,导致即使答案正确也常缺乏可验证依据,限制了在高风险场景中的可信度。为此,我们提出TraceBack——一个面向单表问答的可扩展单元格级归因模块化多智能体框架。TraceBack通过剪枝提取相关行列,将问题分解为语义连贯的子问题,并将每个答案片段与其支撑单元格对齐,从而捕捉中间推理步骤中使用的显式与隐式证据。为支持系统性评估,我们发布了CITEBench基准数据集,其包含从ToTTo、FetaQA和AITQA中提取的短语-单元格标注对。我们进一步提出FairScore指标,该无参考度量方法通过对比从预测单元格与答案中提取的原子事实来估算归因精确率与召回率,无需人工单元格标注。实验表明,TraceBack在多个数据集与粒度级别上显著优于现有基线方法,而FairScore能紧密跟踪人工判断并保持方法间相对排序,为基于表格的问答提供了可解释且可扩展的评估方案。