As LLM-driven autonomous agents evolve to perform complex, multi-step tasks that require integrating multiple datasets, the problem of discovering relevant data sources becomes a key bottleneck. Beyond the challenge posed by the sheer volume of available data sources, data-source selection is difficult because the semantics of data are extremely nuanced and require considering many aspects of the data. To address this, we introduce the Metadata Reasoner, an agentic approach to metadata reasoning, designed to identify a small set of data sources that are both sufficient and minimal for a given analytical task. The Metadata Reasoner leverages a table-search engine to retrieve candidate tables, and then autonomously consults various aspects of the available metadata to determine whether the candidates fit the requirements of the task. We demonstrate the effectiveness of the Metadata Reasoner through a series of empirical studies. Evaluated on the real-world KramaBench datasets for data selection, our approach achieves an average F1-score of 83.16%, outperforming state-of-the-art baselines by a substantial margin of 32 percentage points. Furthermore, evaluations on a newly-created synthetic benchmark based on the BIRD data lake reveal that the Metadata Reasoner is highly robust against redundant and low-quality tables that may be in the data lake. In this noisy environment, it maintains an average of 85.5% F1-score for selecting the right datasets and demonstrates a 99% success rate in avoiding low-quality data.
翻译:随着大语言模型驱动的自主智能体需要执行整合多个数据集的复杂多步任务,发现相关数据源成为关键瓶颈。除了海量数据源的规模挑战外,数据源选择的困难还在于数据的语义极其细微,需要考虑数据的多个方面。为此,我们提出元数据推理器——一种基于智能体的元数据推理方法,旨在为给定的分析任务识别一组既充分又最简的数据源。该推理器利用表搜索引擎检索候选表,随后自主查阅可用元数据的各个方面,以判断候选表是否符合任务要求。通过一系列实证研究,我们验证了元数据推理器的有效性。在真实世界数据集KramaBench上进行数据选择评估时,该方法平均F1分数达到83.16%,以32个百分点的显著优势超越现有最优基线。此外,基于BIRD数据湖新创建的合成基准测试表明,元数据推理器对数据湖中可能存在的冗余和低质量表具有高度鲁棒性。在这种噪声环境中,该方法在正确数据集选择上保持85.5%的平均F1分数,并展现出99%的规避低质量数据成功率。