KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance

E-commerce search serves as a central interface, connecting user demands with massive product inventories and plays a vital role in our daily lives. However, in real-world applications, it faces challenges, including highly ambiguous queries, noisy product texts with weak semantic order, and diverse user preferences, all of which make it difficult to accurately capture user intent and fine-grained product semantics. In recent years, significant advances in large language models (LLMs) for semantic representation and contextual reasoning have created new opportunities to address these challenges. Nevertheless, existing e-commerce search datasets still suffer from notable limitations: queries are often heuristically constructed, cold-start users and long-tail products are filtered out, query and product texts are anonymized, and most datasets cover only a single stage of the search pipeline. Collectively, these issues constrain research on LLM-based e-commerce search. To address these challenges, we construct and release KuaiSearch. To the best of our knowledge, it is the largest e-commerce search dataset currently available. KuaiSearch is built upon real user search interactions from the Kuaishou platform, preserving authentic user queries and natural-language product texts, covering cold-start users and long-tail products, and systematically spanning three key stages of the search pipeline: recall, ranking, and relevance judgment. We conduct a comprehensive analysis of KuaiSearch from multiple perspectives, including products, users, and queries, and establish benchmark experiments across several representative search tasks. Experimental results demonstrate that KuaiSearch provides a valuable foundation for research on real-world e-commerce search.

翻译：电商搜索作为连接用户需求与海量商品库存的核心接口，在我们的日常生活中扮演着至关重要的角色。然而，在实际应用中，电商搜索面临着诸多挑战：用户查询往往具有高度模糊性，商品文本存在噪声且语义顺序性弱，用户偏好多样，这些因素使得准确捕捉用户意图和细粒度商品语义变得困难。近年来，大语言模型在语义表示和上下文推理方面取得的显著进展，为解决这些挑战带来了新的机遇。尽管如此，现有的电商搜索数据集仍存在明显局限：查询通常基于启发式方法构建，冷启动用户和长尾商品被过滤，查询与商品文本被匿名化处理，且多数数据集仅涵盖搜索流程的单一阶段。这些问题共同制约了基于大语言模型的电商搜索研究。为应对这些挑战，我们构建并发布了KuaiSearch。据我们所知，这是目前可用的最大规模电商搜索数据集。KuaiSearch基于快手平台真实的用户搜索交互行为构建，保留了真实的用户查询和自然语言商品文本，覆盖了冷启动用户和长尾商品，并系统性地涵盖了搜索流程的三个关键阶段：召回、排序和相关性判定。我们从商品、用户、查询等多个角度对KuaiSearch进行了全面分析，并在多个代表性搜索任务上建立了基准实验。实验结果表明，KuaiSearch为真实世界电商搜索研究提供了宝贵的基础。