Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. [PEARL-small]: https://huggingface.co/Lihuchen/pearl_small; [PEARL-base]: https://huggingface.co/Lihuchen/pearl_base; [Code and Dataset]: https://github.com/tigerchen52/PEARL
翻译:短语表示在数据科学和自然语言处理中扮演着重要角色,惠及实体对齐、记录链接、模糊连接和释义分类等多种任务。当前最先进的方法涉及使用对比学习对预训练语言模型进行微调以获取短语嵌入。然而,我们发现了改进空间。首先,这些预训练模型往往不必要地复杂,并且需要在使用带有上下文句子的语料库上进行预训练。其次,利用短语类型和形态可以实现更精确且更灵活的短语表示。我们提出了一种改进框架,以无上下文方式学习短语表示。该框架将短语类型分类作为辅助任务,并将字符级信息更有效地整合到短语表示中。此外,我们设计了三种粒度的数据增强以增加训练样本的多样性。我们在广泛任务上的实验表明,与先前方法相比,我们的方法在需要更小模型规模的同时生成了更优的短语嵌入。[PEARL-small]: https://huggingface.co/Lihuchen/pearl_small; [PEARL-base]: https://huggingface.co/Lihuchen/pearl_base; [代码与数据集]: https://github.com/tigerchen52/PEARL