Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging LLMs that achieve competitive performance on these tasks. LENS consolidates the vocabulary space through token embedding clustering to handle the issue of token redundancy in LLM vocabularies. To further improve performance, we investigate bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexical matching with redundant vocabularies by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact representations with dimensionality comparable to dense counterparts. Furthermore, LENS inherently supports efficient embedding dimension pruning without any specialized objectives like Matryoshka Representation Learning. Notably, combining LENS with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e., BEIR).
翻译:近期的大语言模型(LLMs)在通用文本嵌入任务中展现出卓越性能。尽管密集嵌入在相关研究中占据主导地位,我们首次提出基于词典的嵌入(LENS),利用大语言模型在这些任务中实现具有竞争力的性能。LENS通过令牌嵌入聚类来整合词汇空间,以解决大语言模型词汇中的令牌冗余问题。为进一步提升性能,我们研究了双向注意力机制及多种池化策略。具体而言,LENS通过将每个维度分配给特定的令牌聚类(其中语义相似的令牌被归组),简化了冗余词汇上的词汇匹配。大量实验表明,LENS在Massive Text Embedding Benchmark(MTEB)上优于密集嵌入,并提供了与密集嵌入维度相当的紧凑表示。此外,LENS天然支持高效的嵌入维度剪枝,无需像Matryoshka表示学习那样的专门目标。值得注意的是,将LENS与密集嵌入结合,在MTEB的检索子集(即BEIR)上实现了最先进的性能。