Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.
翻译:消费者对消费者(C2C)市场面临独特的检索挑战:简短、模糊的查询;嘈杂的用户生成商品列表;以及严格的生产约束。本文报告了我们在日本最大的C2C市场Mercari上,为提升搜索质量而构建领域感知日语文本嵌入方法的实验。我们尝试在购买驱动的查询-商品标题对上微调模型,并使用角色特定前缀来建模查询与商品的不对称性。为满足生产约束,我们应用Matryoshka表示学习以获得紧凑、对截断鲁棒的嵌入。基于历史搜索日志的离线评估显示,相较于强大的通用编码器,该方法取得了持续的性能提升,尤其是在用Matryoshka截断替代PCA压缩时改进尤为显著。人工评估进一步凸显了该方法在专有名词处理、市场特定语义建模以及词项重要性对齐方面的优势。此外,初步的在线A/B测试表明,在保持交易频率的同时,每用户收入和搜索流程效率均获得了统计显著的提升。结果表明,领域感知嵌入能够大规模提升相关性和效率,并为更丰富的LLM时代搜索体验奠定了实用基础。