Data discovery - retrieving relevant tables from a data lake in response to user queries - is a fundamental building block for downstream analytics. In practice, data discovery must support different query modalities, including natural language (NL) statements and tables, and accommodate diverse user intents, ranging from open-ended enrichment to task-driven inference for applications such as table question answering and fact verification. However, most existing methods are designed for a single query modality or a specific user intent, limiting their generalizability. We propose UniDisc, a unified data discovery framework that supports both NL statements and tables as queries and generalizes across diverse user intents without intent-specific representations or relevance modeling. UniDisc learns a common cross-modal representation model that produces unified representations for queries of different modalities and candidate tables, enabling uniform relevance assessment across discovery scenarios. Since learning such a model typically requires large labeled collections of query-table pairs, which are expensive to obtain, UniDisc instead exploits contextual signals naturally available in data lakes. Specifically, it models NL statements and tables as nodes in a heterogeneous graph with multiple edge types, and applies dual-view neighbor aggregation and joint optimization to learn robust, context-aware representations under limited supervision. These representations support flexible relevance estimation during retrieval. Experiments on seven datasets show that UniDisc consistently outperforms strong baselines on both NL- and table-based discovery.
翻译:数据发现——从数据湖中检索与用户查询相关的表格——是下游分析的基础模块。在实践中,数据发现必须支持不同的查询模式,包括自然语言语句和表格,并适应多样化的用户意图,范围从开放式增强到面向任务的推理(如表格问答和事实验证)。然而,现有方法大多针对单一查询模式或特定用户意图设计,限制了其泛化能力。我们提出UniDisc,一个统一的数据发现框架,它支持自然语言语句和表格作为查询,并能在无需特定意图表示或相关性建模的情况下,泛化至多种用户意图。UniDisc学习了一种通用的跨模态表示模型,为不同模态的查询和候选表格生成统一表示,从而在发现场景中实现一致的相关性评估。由于学习此类模型通常需要大量标注的查询-表格对数据集(获取成本高昂),UniDisc转而利用数据湖中自然可得的上下文信号。具体而言,它将自然语言语句和表格建模为包含多种边类型的异构图中的节点,并采用双视图邻居聚合与联合优化,在有限监督下学习鲁棒的上下文感知表示。这些表示支持检索过程中的灵活相关性估计。在七个数据集上的实验表明,UniDisc在基于自然语言和基于表格的发现任务中均持续优于强基线方法。