Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. The code is available at \faGithub~ \url{https://github.com/tigerchen52/PEARL} \end{abstract}
翻译:短语表示在数据科学和自然语言处理中扮演着重要角色,惠及实体对齐、记录链接、模糊连接以及释义分类等多种任务。当前最先进的方法涉及使用对比学习对预训练语言模型进行微调以获取短语嵌入。然而,我们发现了可改进之处。首先,这些预训练模型往往过于复杂,并且需要在包含上下文句子的语料库上进行预训练。第二,利用短语类型和形态能够生成更精确且更灵活的短语表示。我们提出了一种改进框架,以无上下文的方式学习短语表示。该框架将短语类型分类作为辅助任务,并更有效地将字符级信息整合到短语表示中。此外,我们设计了三种粒度的数据增强方法以增加训练样本的多样性。在广泛任务上的实验表明,我们的方法生成的短语嵌入优于先前方法,同时模型尺寸更小。代码见\faGithub~ \url{https://github.com/tigerchen52/PEARL} \end{abstract}