Answering real-world complex queries, such as complex product search, often requires accurate retrieval from semi-structured knowledge bases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational K nowledge Bases. Our benchmark covers three domains/datasets: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties, together with their ground-truth answers (items). We conduct rigorous human evaluation to validate the quality of our synthesized queries. We further enhance the benchmark with high-quality human-generated queries to provide an authentic reference. STARK serves as a comprehensive testbed for evaluating the performance of retrieval systems driven by large language models (LLMs). Our experiments suggest that STARK presents significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems. The benchmark data and code are available on https://github.com/snap-stanford/stark.
翻译:回答现实世界中的复杂查询(例如复杂产品搜索),通常需要从半结构化知识库中精确检索,这类知识库融合了非结构化信息(如产品文本描述)与结构化信息(如产品实体关系)。然而,以往的研究大多将文本检索与关系检索任务作为独立课题分别探讨。为填补这一研究空白,我们构建了STaRK——一个面向文本与关系知识库的大规模半结构化检索基准测试。该基准涵盖三个领域/数据集:产品搜索、学术论文搜索以及精准医学查询。我们设计了一条新颖的流水线,用于合成融合多样化关系信息与复杂文本属性的真实用户查询,并同时生成对应的真实答案(条目)。通过严格的人工评估,我们验证了合成查询的质量。此外,我们进一步用高质量人工编写的查询来增强基准,以提供真实的参照。STaRK可作为评估大语言模型(LLM)驱动检索系统性能的综合性测试平台。实验结果表明,STaRK对现有的检索与LLM系统构成了显著挑战,凸显了构建更强大检索系统的需求。基准数据集与代码已发布于https://github.com/snap-stanford/stark。