We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
翻译:我们提出子句编码器(sub-sentence encoder),这是一种通过对比学习得到的上下文嵌入模型,用于对文本进行细粒度语义表示。与句子嵌入的标准做法(将整个文本序列的含义编码为固定长度向量)不同,子句编码器学习生成对应不同原子命题(即文本序列中表达的最小语义单元)的独立上下文嵌入。通过对比学习,子句编码器能够识别不同文本序列中命题之间的(推断出的)语义等价关系。实验表明,子句编码器在细粒度文本归因的支持事实检索以及文本间条件语义相似性识别等应用中效果显著。实际应用中,我们证明子句编码器在推理成本与空间复杂度上与句子编码器保持同等水平。