In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.
翻译:在自然语言处理领域,连续向量表示对捕获单个词语的语义至关重要。然而,在处理词语集合的表示时,传统的基于向量的方法往往在表达能力上存在局限,且缺乏集合运算(如并集、交集、补集)的基本能力。受量子逻辑启发,我们实现了预训练词嵌入空间中词语集合及其对应集合运算的表示。通过将方法建立在线性子空间基础上,我们能够高效计算各类集合运算,并在连续空间中实现隶属度函数的软计算。此外,我们允许直接在词向量中计算F值,从而建立与句子相似度评估的直接关联。在广泛使用的预训练词嵌入和基准测试中的实验表明,我们的基于子空间的集合运算在句子相似度与集合检索任务中均持续优于基于向量的方法。