Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast to static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.
翻译:提取术语与短语的稠密表示是面向高度专业化领域的知识发现平台中一项至关重要的任务。稠密表示作为下游组件的特征,具有从搜索结果排序到摘要生成等多种应用。创建稠密表示的常见方法包括:通过自监督设置训练领域专用嵌入,或使用基于相似性任务训练的句子编码器模型。与静态嵌入不同,句子编码器虽能避免词汇外(OOV)问题,但会带来显著的计算成本。本文提出一种完全无监督的文本编码方法:通过训练小型字符级模型,以重构大型预训练嵌入矩阵为目标。采用该方法训练的模型不仅能达到技术领域句子编码器的质量水平,还实现了体积缩小5倍、推理速度提升高达10倍(即使在高端GPU上)的显著优势。