Pretrained language models are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning to NLP remains an unsolved problem. This is largely due to the discreteness of subword units brought by tokenization of language models, limiting small perturbations of inputs to form semantics-preserved positive pairs. In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. Drawing from cognitive and linguistic sciences, we introduce an unsupervised visual sentence representation learning framework, employing visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to texts to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision, achieving comparable performance in semantic textual similarity (STS) to existing state-of-the-art NLP methods. Additionally, we unveil our method's inherent zero-shot cross-lingual transferability and a unique leapfrogging pattern across languages during iterative training. To our knowledge, this is the first representation learning method devoid of traditional language models for understanding sentence and document semantics, marking a stride closer to human-like textual comprehension. Our code is available at https://github.com/gowitheflow-1998/Pixel-Linguist
翻译:预训练语言模型长期以来在捕捉句子和文档级语义方面表现欠佳。尽管已进行大量研究,但将基于扰动的无监督视觉表示学习方法迁移至自然语言处理领域仍是一个未解难题。这主要源于语言模型分词机制带来的子词单元离散性,限制了通过输入微小扰动构建语义保持正样本对的能力。本研究将句子级文本语义学习概念化为视觉表示学习过程。借鉴认知与语言科学,我们提出一种无监督视觉句子表示学习框架,采用基于视觉的文本扰动方法(如拼写错误和词序打乱),与人类认知模式共鸣,使文本扰动呈现连续性。该方法通过大规模无监督主题对齐训练和自然语言推理监督进一步增强,在语义文本相似度任务上取得与现有最优自然语言处理方法相当的性能。此外,我们揭示了该方法固有的零样本跨语言迁移能力,以及迭代训练中跨语言独特的跳跃式进化模式。据我们所知,这是首个不依赖传统语言模型来理解句子和文档语义的表示学习方法,标志着向类人文本理解迈进一步。我们的代码开源在 https://github.com/gowitheflow-1998/Pixel-Linguist