Unified Data Discovery across Query Modalities and User Intents

Data discovery - retrieving relevant tables from a data lake in response to user queries - is a fundamental building block for downstream analytics. In practice, data discovery must support different query modalities, including natural language (NL) statements and tables, and accommodate diverse user intents, ranging from open-ended enrichment to task-driven inference for applications such as table question answering and fact verification. However, most existing methods are designed for a single query modality or a specific user intent, limiting their generalizability. We propose UniDisc, a unified data discovery framework that supports both NL statements and tables as queries and generalizes across diverse user intents without intent-specific representations or relevance modeling. UniDisc learns a common cross-modal representation model that produces unified representations for queries of different modalities and candidate tables, enabling uniform relevance assessment across discovery scenarios. Since learning such a model typically requires large labeled collections of query-table pairs, which are expensive to obtain, UniDisc instead exploits contextual signals naturally available in data lakes. Specifically, it models NL statements and tables as nodes in a heterogeneous graph with multiple edge types, and applies dual-view neighbor aggregation and joint optimization to learn robust, context-aware representations under limited supervision. These representations support flexible relevance estimation during retrieval. Experiments on seven datasets show that UniDisc consistently outperforms strong baselines on both NL- and table-based discovery.

翻译：数据发现——从数据湖中检索与用户查询相关的表格——是下游分析的基础模块。在实践中，数据发现必须支持不同的查询模式，包括自然语言语句和表格，并适应多样化的用户意图，范围从开放式增强到面向任务的推理（如表格问答和事实验证）。然而，现有方法大多针对单一查询模式或特定用户意图设计，限制了其泛化能力。我们提出UniDisc，一个统一的数据发现框架，它支持自然语言语句和表格作为查询，并能在无需特定意图表示或相关性建模的情况下，泛化至多种用户意图。UniDisc学习了一种通用的跨模态表示模型，为不同模态的查询和候选表格生成统一表示，从而在发现场景中实现一致的相关性评估。由于学习此类模型通常需要大量标注的查询-表格对数据集（获取成本高昂），UniDisc转而利用数据湖中自然可得的上下文信号。具体而言，它将自然语言语句和表格建模为包含多种边类型的异构图中的节点，并采用双视图邻居聚合与联合优化，在有限监督下学习鲁棒的上下文感知表示。这些表示支持检索过程中的灵活相关性估计。在七个数据集上的实验表明，UniDisc在基于自然语言和基于表格的发现任务中均持续优于强基线方法。