Answering real-world complex queries, such as complex product search, often requires accurate retrieval from semi-structured knowledge bases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, many previous works studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties, together with their ground-truth answers (items). We conduct rigorous human evaluation to validate the quality of our synthesized queries. We further enhance the benchmark with high-quality human-generated queries to provide an authentic reference. STARK serves as a comprehensive testbed for evaluating the performance of retrieval systems driven by large language models (LLMs). Our experiments suggest that STARK presents significant challenges to the current retrieval and LLM systems, highlighting the need for more capable semi-structured retrieval systems. The benchmark data and code are available on https://github.com/snap-stanford/STaRK.
翻译:回答现实世界中的复杂查询(如复杂产品搜索)通常需要从半结构化知识库中进行精确检索,这类知识库同时包含非结构化(例如产品文本描述)和结构化(例如产品实体关系)信息。然而,以往许多研究将文本检索与关系检索任务作为独立课题进行探讨。为弥补这一空白,我们开发了STaRK——一个面向文本与关系知识库的大规模半结构化检索基准。我们的基准涵盖三个领域:产品搜索、学术论文搜索以及精准医学查询。我们设计了一种新颖的流水线来合成融合多样化关系信息与复杂文本属性的真实用户查询及其对应标准答案(项目)。我们通过严格的人工评估来验证合成查询的质量,并进一步引入高质量人工生成查询以提供真实参考。STaRK可作为评估大语言模型驱动检索系统性能的综合测试平台。实验表明,STaRK对当前检索系统及大语言模型构成了显著挑战,凸显了开发更强大半结构化检索系统的必要性。基准数据与代码已发布于https://github.com/snap-stanford/STaRK。