An Empirical Study of Self-supervised Learning with Wasserstein Distance

In this study, we delve into the problem of self-supervised learning (SSL) utilizing the 1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the cosine similarity is often utilized as an objective function; however, it has not been well studied when utilizing the Wasserstein distance. Training the Wasserstein distance is numerically challenging. Thus, this study empirically investigates a strategy for optimizing the SSL with the Wasserstein distance and finds a stable training procedure. More specifically, we evaluate the combination of two types of TWD (total variation and ClusterTree) and several probability models, including the softmax function, the ArcFace probability model, and simplicial embedding. We propose a simple yet effective Jeffrey divergence-based regularization method to stabilize optimization. Through empirical experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the softmax function and TWD can obtain significantly lower results than the standard SimCLR. Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the model performance depends on the combination of TWD and probability model, and that the Jeffrey divergence regularization helps in model training. Finally, we show that the appropriate combination of the TWD and probability model outperforms cosine similarity-based representation learning.

翻译：本研究深入探讨了利用树结构上的1-瓦瑟斯坦距离（即树瓦瑟斯坦距离TWD）进行自监督学习的问题，其中TWD定义为两个树嵌入向量之间的L1距离。在自监督学习方法中，余弦相似度常被用作目标函数，但关于瓦瑟斯坦距离的应用尚未得到充分研究。训练瓦瑟斯坦距离在数值上具有挑战性，因此本研究通过实证方法探索了优化瓦瑟斯坦距离自监督学习的策略，并发现了稳定的训练流程。具体而言，我们评估了两种TWD（全变差和ClusterTree）与多种概率模型（包括softmax函数、ArcFace概率模型和单纯形嵌入）的组合。我们提出了一种简单而有效的基于杰弗里散度的正则化方法来稳定优化过程。通过在STL10、CIFAR10、CIFAR100和SVHN上的实证实验，我们发现softmax函数与TWD的简单组合相比标准SimCLR获得了显著更低的结果。此外，TWD与SimSiam的简单组合无法训练模型。我们发现模型性能取决于TWD与概率模型的组合方式，而杰弗里散度正则化有助于模型训练。最后，我们证明适当的TWD与概率模型组合优于基于余弦相似度的表征学习。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日