In this article, we present our approach to single-modality vision representation learning. Understanding vision representations of product content is vital for recommendations, search, and advertising applications in e-commerce. We detail and contrast techniques used to fine tune large-scale vision representation learning models in an efficient manner under low-resource settings, including several pretrained backbone architectures, both in the convolutional neural network as well as the vision transformer family. We highlight the challenges for e-commerce applications at-scale and highlight the efforts to more efficiently train, evaluate, and serve visual representations. We present ablation studies for several downstream tasks, including our visually similar ad recommendations. We evaluate the offline performance of the derived visual representations in downstream tasks. To this end, we present a novel text-to-image generative offline evaluation method for visually similar recommendation systems. Finally, we include online results from deployed machine learning systems in production at Etsy.
翻译:本文介绍了我们在单模态视觉表征学习方面的研究方法。理解产品内容的视觉表征对于电子商务中的推荐、搜索和广告应用至关重要。我们详细阐述并对比了在低资源环境下高效微调大规模视觉表征学习模型的技术,涵盖卷积神经网络和视觉变换器系列中的若干预训练骨干架构。我们重点指出了大规模电子商务应用面临的挑战,并强调了更高效地训练、评估和部署视觉表征的努力方向。针对多项下游任务(包括视觉相似广告推荐)进行了消融研究,评估了所获视觉表征在下游任务中的离线性能。为此,我们提出了一种用于视觉相似推荐系统的全新文本到图像生成式离线评估方法。最后,我们还提供了Etsy生产环境中部署的机器学习系统的在线结果。