Answering real-world user queries, such as product search, often requires accurate retrieval of information from semi-structured knowledge bases or databases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. We design a novel pipeline to synthesize natural and realistic user queries that integrate diverse relational information and complex textual properties, as well as their ground-truth answers. Moreover, we rigorously conduct human evaluation to validate the quality of our benchmark, which covers a variety of practical applications, including product recommendations, academic paper searches, and precision medicine inquiries. Our benchmark serves as a comprehensive testbed for evaluating the performance of retrieval systems, with an emphasis on retrieval approaches driven by large language models (LLMs). Our experiments suggest that the STARK datasets present significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems that can handle both textual and relational aspects.
翻译:回答真实世界用户查询(如商品搜索)通常需要从半结构化知识库或数据库中准确检索信息,这类数据混合了非结构化(例如商品文本描述)和结构化(例如商品实体关系)信息。然而,现有研究大多将文本检索与关系检索任务作为独立方向分别展开。为填补这一空白,我们构建了STARK——一个面向文本与关系知识库的大规模半结构检索基准。我们设计了一条新型流程,用于合成融合多样化关系信息与复杂文本属性、且贴近真实场景的用户查询及其对应的标准答案。此外,我们通过严格的人工评估验证了基准质量,该基准覆盖产品推荐、学术论文检索、精准医疗咨询等实际应用场景。作为评估检索系统性能的综合测试平台,本基准重点关注大语言模型驱动的检索方法。实验表明,STARK数据集对当前检索系统及LLM系统构成了显著挑战,凸显了构建能同时处理文本与关系维度的更强检索系统的迫切需求。