Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to data generated by topic models, self-supervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform on par with posterior inference using the correct model, while outperforming posterior inference using misspecified models.
翻译:自监督学习显著提升了众多自然语言处理任务的性能。然而,自监督学习如何发现有效表示,以及它为何优于概率模型等传统方法,至今仍不甚明了。本文聚焦于主题建模情境,揭示自监督学习的一个关键优势:当应用于主题模型生成的数据时,自监督学习能够避免对特定模型的依赖,从而不易受模型误设的影响。具体而言,我们证明,基于重建或对比样本的通用自监督学习目标,均能恢复通用主题模型的有效后验信息。实验表明,这些相同的目标在表现上可与使用正确模型的后验推断媲美,同时优于使用误设模型的后验推断。