Octopus：一种面向多表数据发现与单元格级检索的轻量级实体感知系统 (Octopus: A Lightweight Entity-Aware System for Multi-Table Data Discovery and Cell-Level Retrieval)

Tabular data constitute a dominant form of information in modern data lakes and repositories, yet discovering the relevant tables to answer user questions remains challenging. Existing data discovery systems assume that each question can be answered by a single table and often rely on resource-intensive offline preprocessing, such as model training or large-scale content indexing. In practice, however, many questions require information spread across multiple tables -- either independently or through joins -- and users often seek specific cell values rather than entire tables. In this paper, we present Octopus, a lightweight, entity-aware, and training-free system for multi-table data discovery and cell-level value retrieval. Instead of embedding entire questions, Octopus identifies fine-grained entities (column mentions and value mentions) from natural-language queries using an LLM parser. It then matches these entities to table headers through a compact embedding index and scans table contents directly for value occurrences, eliminating the need for heavy content indexing or costly offline stages. The resulting fine-grained alignment not only improves table retrieval accuracy but also facilitates efficient downstream NL2SQL execution by reducing token usage and redundant LLM calls. To evaluate Octopus, we introduce a new benchmark covering both table- and cell-level discovery under multi-table settings, including five datasets for independent discovery and two for join-based discovery. Experimental results show that Octopus consistently outperforms existing systems while achieving substantially lower computational and token costs. Code is available at https://github.com/wenzhilics/octopus.

翻译：表格数据是现代数据湖与存储库中信息的主要形式，然而发现相关表格以回答用户问题仍具挑战性。现有数据发现系统通常假设每个问题可由单个表格回答，且多依赖于资源密集的离线预处理，例如模型训练或大规模内容索引。然而在实际应用中，许多问题需要跨多个表格的信息——无论是独立表格还是通过连接操作——且用户往往寻求具体的单元格值而非整个表格。本文提出Octopus，一种轻量级、实体感知且无需训练的多表数据发现与单元格级值检索系统。Octopus并非对整个问题进行嵌入，而是通过LLM解析器从自然语言查询中识别细粒度实体（列提及与值提及），随后通过紧凑的嵌入索引将这些实体与表头匹配，并直接扫描表格内容以查找值出现位置，从而避免了繁重的内容索引或昂贵的离线处理阶段。这种细粒度对齐不仅提升了表格检索的准确率，还通过减少令牌使用和冗余的LLM调用，促进了高效的下游NL2SQL执行。为评估Octopus，我们提出了一个涵盖多表场景下表级与单元格级发现的新基准，包含五个独立发现数据集和两个基于连接发现的数据集。实验结果表明，Octopus在显著降低计算与令牌成本的同时，持续优于现有系统。代码发布于https://github.com/wenzhilics/octopus。