Developing the capacity to effectively search for requisite datasets is an urgent requirement to assist data users in identifying relevant datasets considering the very limited available metadata. For this challenge, the utilization of third-party data is emerging as a valuable source for improvement. Our research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The system integrates large language models (LLMs) with external vector databases to identify semantic relationships among diverse types of datasets. The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources and for improving data exploration. Our study includes experimental results on four critical tasks: 1) recommending similar datasets, 2) suggesting combinable datasets, 3) estimating tags, and 4) predicting variables. Our results demonstrate that RAG can enhance the selection of relevant datasets, particularly from different categories, when compared to conventional metadata approaches. However, performance varied across tasks and models, which confirms the significance of selecting appropriate techniques based on specific use cases. The findings suggest that this approach holds promise for addressing challenges in data exploration and discovery, although further refinement is necessary for estimation tasks.
翻译:鉴于可用元数据极其有限,开发有效搜索所需数据集的能力已成为协助数据使用者识别相关数据集的迫切需求。针对这一挑战,第三方数据的利用正成为改进数据发现的重要来源。本研究提出了一种新型数据探索架构,采用检索增强生成技术来增强基于元数据的数据发现能力。该系统将大型语言模型与外部向量数据库相结合,以识别不同类型数据集间的语义关联。所提出的框架为评估异构数据源间的语义相似性及改进数据探索提供了新方法。我们的研究在四项关键任务上进行了实验验证:1)相似数据集推荐,2)可组合数据集建议,3)标签估计,4)变量预测。实验结果表明,与传统元数据方法相比,检索增强生成技术能够显著提升跨类别相关数据集的筛选效果。然而,不同任务和模型间的性能存在差异,这证实了根据具体用例选择合适技术的重要性。研究结果表明,该方法在应对数据探索与发现挑战方面具有潜力,尽管在估计任务方面仍需进一步优化。