KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance

E-commerce search serves as a central interface, connecting user demands with massive product inventories and plays a vital role in our daily lives. However, in real-world applications, it faces challenges, including highly ambiguous queries, noisy product texts with weak semantic order, and diverse user preferences, all of which make it difficult to accurately capture user intent and fine-grained product semantics. In recent years, significant advances in large language models (LLMs) for semantic representation and contextual reasoning have created new opportunities to address these challenges. Nevertheless, existing e-commerce search datasets still suffer from notable limitations: queries are often heuristically constructed, cold-start users and long-tail products are filtered out, query and product texts are anonymized, and most datasets cover only a single stage of the search pipeline. Collectively, these issues constrain research on LLM-based e-commerce search. To address these challenges, we construct and release KuaiSearch. To the best of our knowledge, it is the largest e-commerce search dataset currently available. KuaiSearch is built upon real user search interactions from the Kuaishou platform, preserving authentic user queries and natural-language product texts, covering cold-start users and long-tail products, and systematically spanning three key stages of the search pipeline: recall, ranking, and relevance judgment. We conduct a comprehensive analysis of KuaiSearch from multiple perspectives, including products, users, and queries, and establish benchmark experiments across several representative search tasks. Experimental results demonstrate that KuaiSearch provides a valuable foundation for research on real-world e-commerce search.

翻译：电商搜索作为连接用户需求与海量商品库存的核心界面，在我们的日常生活中发挥着至关重要的作用。然而，在实际应用中，它面临着高度模糊的查询、语义顺序薄弱的嘈杂商品文本以及多样化的用户偏好等挑战，这些都使得准确捕捉用户意图与细粒度商品语义变得困难。近年来，大型语言模型在语义表征和上下文推理方面取得了显著进展，为解决这些挑战创造了新的机遇。尽管如此，现有的电商搜索数据集仍存在明显局限：查询通常基于启发式方法构建，冷启动用户和长尾商品被过滤，查询和商品文本被匿名化处理，且大多数数据集仅覆盖搜索管道的单一阶段。这些问题共同限制了基于大语言模型的电商搜索研究。为应对这些挑战，我们构建并发布了KuaiSearch。据我们所知，它是目前可用的最大规模电商搜索数据集。KuaiSearch基于快手平台真实的用户搜索交互构建，保留了真实的用户查询和自然语言商品文本，涵盖了冷启动用户和长尾商品，并系统地覆盖了搜索管道的三个关键阶段：召回、排序与相关性判断。我们从商品、用户和查询等多个角度对KuaiSearch进行了全面分析，并在多个代表性搜索任务上建立了基准实验。实验结果表明，KuaiSearch为真实世界电商搜索的研究提供了宝贵的基准。