To improve relevance scoring on Pinterest Search, we integrate Large Language Models (LLMs) into our search relevance model, leveraging carefully designed text representations to predict the relevance of Pins effectively. Our approach uses search queries alongside content representations that include captions extracted from a generative visual language model. These are further enriched with link-based text data, historically high-quality engaged queries, user-curated boards, Pin titles and Pin descriptions, creating robust models for predicting search relevance. We use a semi-supervised learning approach to efficiently scale up the amount of training data, expanding beyond the expensive human labeled data available. By utilizing multilingual LLMs, our system extends training data to include unseen languages and domains, despite initial data and annotator expertise being confined to English. Furthermore, we distill from the LLM-based model into real-time servable model architectures and features. We provide comprehensive offline experimental validation for our proposed techniques and demonstrate the gains achieved through the final deployed system at scale.
翻译:为改进Pinterest搜索的相关性排序,我们将大型语言模型(LLMs)集成至搜索相关性模型中,通过精心设计的文本表征有效预测Pin内容的关联度。该方法结合搜索查询与内容表征,后者包含从生成式视觉语言模型提取的标题信息,并进一步融合基于链接的文本数据、历史高质量互动查询、用户策展的画板、Pin标题与Pin描述,从而构建预测搜索相关性的稳健模型。我们采用半监督学习方法高效扩展训练数据规模,突破昂贵人工标注数据的限制。通过运用多语言LLMs,系统将训练数据扩展至未见的语言与领域,尽管初始数据与标注者专业能力仅限于英语。此外,我们将基于LLM的模型蒸馏为可实时服务的模型架构与特征。对所提技术进行了全面的离线实验验证,并通过最终大规模部署系统展示了性能提升。