In this article, we present our approach to single-modality visual representation learning. Understanding visual representations of product content is vital for recommendations, search, and advertising applications in e-commerce. We detail and contrast techniques used to fine-tune large-scale visual representation learning models in an efficient manner under low-resource settings, including several pretrained backbone architectures, both in the convolutional neural network as well as the vision transformer family. We highlight the challenges for e-commerce applications at-scale and highlight the efforts to more efficiently train, evaluate, and serve visual representations. We present ablation studies evaluating the representation offline performance for several downstream tasks, including our visually similar ad recommendations. To this end, we present a novel text-to-image generative offline evaluation method for visually similar recommendation systems. Finally, we include online results from deployed machine learning systems in production at Etsy.
翻译:本文介绍了我们针对单模态视觉表示学习的方法。理解产品内容的视觉表示对于电子商务中的推荐、搜索和广告应用至关重要。我们详细阐述并对比了在低资源环境下高效微调大规模视觉表示学习模型的技术,包括多种预训练骨干架构,涵盖卷积神经网络和视觉变换器系列。我们强调了规模化电子商务应用面临的挑战,并说明了在更高效地训练、评估和提供视觉表示方面所做的努力。我们进行了消融研究,评估了多项下游任务的离线表示性能,包括视觉相似广告推荐。为此,我们提出了一种新颖的文本到图像生成式离线评估方法,用于视觉相似推荐系统。最后,我们展示了Etsy生产环境中已部署机器学习系统的在线结果。