Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb

Wendy Ran Wei,Hao Li,Weiwei Guo,Xiaowei Liu,Xueyin Chen,Dillon Davis,Malay Haldar,Soumyadip Banerjee,Kedar Bellare,Huiji Gao,Stephanie Moyerman,Sanjeev Katariya

Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb's natural language search. For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage. We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces verbose queries with KL divergence of 12.03 vs. real users; our seed-guided approach achieves 0.66, a 7.5x improvement. For attribute type distributions, our approach achieves the lowest KL divergence (0.04), outperforming even seed queries (0.09). Experiments show our approach produces harder evaluation examples than the no-seed baseline (79% vs. 97% pairwise accuracy), providing discriminative signal for model improvement. We deploy production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation.

翻译：部署自然语言搜索系统面临严峻的冷启动挑战：缺乏真实用户查询以学习语言模式，且缺少相关性标签以训练排序模型。我们提出了一种利用大语言模型生成合成查询和标签的框架，为爱彼迎自然语言搜索的模型训练和评估提供支持。在查询生成方面，我们通过结合预订会话中的对比房源对与用户研究中的种子查询，在真实性与多样性之间取得平衡，并在真实用户数据可用时实现从冷启动到热启动的过渡。在标签生成方面，我们引入对比生成方法，通过构造性设计生成主题性标签，并采用虚拟法官标注技术实现更广的覆盖范围。我们将本方法与无种子对比基线方法和InPars风格基线方法进行对比。在查询长度方面，InPars基线方法生成的冗长查询与真实用户的KL散度为12.03，而我们的种子引导方法仅为0.66，性能提升7.5倍。在属性类型分布方面，本方法取得最低KL散度（0.04），甚至优于种子查询（0.09）。实验表明，本方法比无种子基线方法生成更难的评估样本（成对准确率79% vs. 97%），为模型改进提供区分性信号。我们已部署生产流水线，每日生成合成样本用于基于嵌入的检索与排序评估。