Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.
翻译:大型语言模型(LLMs)在推理包含数千行数值、多个关联工作表以及图表和收据等嵌入式视觉内容的大规模企业电子表格时面临困难。先前最先进的电子表格推理方法通常依赖于单表压缩或全上下文编码,这限制了可扩展性,且未能反映真实用户如何与复杂的多模态工作簿进行交互。我们引入了FRTR-Bench,这是首个用于多模态电子表格推理的大规模基准测试,包含30个企业级Excel工作簿,涵盖近四百万个单元格和超过50个嵌入式图像。为应对这些挑战,我们提出了“从行到推理”(FRTR)——一种先进的多模态检索增强生成框架,该框架将Excel工作簿分解为细粒度的行、列和块嵌入,采用结合词法和稠密检索的混合检索策略(基于互逆排序融合,RRF),并集成多模态嵌入以同时对数值和视觉信息进行推理。我们在六种LLMs上测试了FRTR,在FRTR-Bench上使用Claude Sonnet 4.5实现了74%的答案准确率,相较于先前仅达到24%准确率的最先进方法有显著提升。在SpreadsheetLLM基准测试中,FRTR使用GPT-5实现了87%的准确率,同时与上下文压缩方法相比,令牌使用量减少了约50%。