Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.
翻译:传统的视觉搜索,与搜索和推荐系统类似,遵循多阶段级联架构范式,以平衡效率与转化率。具体而言,查询图像经过特征提取、召回、预排序和排序等阶段,最终向用户呈现语义相似且符合其偏好的商品。查询中同一对象的多视角表征差异与各阶段的优化目标存在冲突,使得在用户体验和转化率两方面难以达到帕累托最优。本文提出一种端到端的生成式框架 OneVision 来解决这些问题。OneVision 基于 VRQ(一种视觉对齐残差量化编码)构建,该编码能够对齐同一对象在不同视角下的巨大表征差异,同时尽可能保留每个商品的独有特征。随后,采用多阶段语义对齐方案,在保持强视觉相似性先验的同时,有效融入用户特定信息以生成个性化偏好。在离线评估中,OneVision 的性能与在线多阶段级联架构相当,同时通过动态剪枝将推理效率提升了 21%。在 A/B 测试中,它取得了显著的在线改进:商品点击率提升 2.15%,转化率提升 2.27%,订单量提升 3.12%。这些结果表明,以语义 ID 为核心的生成式架构能够统一检索与个性化,同时简化服务路径。