Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.
翻译:无监督文本表示学习是自然语言处理中的基础任务,有助于利用网络未标注文本改进搜索和推荐系统。近期实证研究发现,高质量表示与输入文本的关键词元具有对齐特性,揭示了表示空间与词汇空间之间的潜在关联。受此启发,我们重新审视生成式任务,提出用于文本表示学习的无监督生成框架Text2Token。该框架基于词元目标预测任务,利用精心构建的目标词元分布作为监督信号。为构建高质量目标词元分布,我们通过先进嵌入器分析词元对齐特性,识别出两类关键词元:(1)文本中的语义核心词元;(2)超越文本的语义派生词元。基于这些发现,我们提出数据驱动与模型衍生的双路径方法,分别从数据或大语言模型主干中构建合成词元目标。在MTEB v2基准测试上的实验表明,Text2Token与采用无监督对比学习的最先进嵌入器LLM2Vec达到相当性能。进一步分析显示,词汇空间与表示空间在训练过程中协同优化并趋向最优解,为未来研究提供了新思路与见解。