This work aims to build a text embedder that can capture characteristics of texts specified by user instructions. Despite its tremendous potential to deploy user-oriented embeddings, none of previous approaches provides a concrete solution for it. This paper offers a new viewpoint, which treats the instruction as a question about the input text and encodes the expected answers to obtain the representation accordingly. Intuitively, texts with the same (implicit) semantics would share similar answers following the instruction, thus leading to more similar embeddings. Specifically, we propose InBedder that instantiates this embed-via-answering idea by only fine-tuning language models on abstractive question answering tasks. InBedder demonstrates significantly improved instruction-following capabilities according to our proposed instruction awareness tests and instruction robustness tests, when applied to both large language models (LLMs) (e.g., llama-2-7b) and smaller encoder-based LMs (e.g., roberta-large). Additionally, our qualitative analysis of clustering outcomes, achieved by applying different instructions to the same corpus, demonstrates a high degree of interpretability.
翻译:本文旨在构建一种能够捕捉用户指令所指定文本特征的文本嵌入器。尽管面向用户的嵌入部署潜力巨大,但此前方法均未提供具体解决方案。本文提出全新视角:将指令视为关于输入文本的问题,并通过编码预期答案来获得相应表征。直观而言,具有相同(隐含)语义的文本将遵循指令共享相似答案,从而产生更相近的嵌入。具体而言,我们提出InBedder方法,仅通过在抽象式问答任务上微调语言模型即可实例化这种"嵌入即答案"的思想。根据我们提出的指令感知测试与指令鲁棒性测试,InBedder在大语言模型(如Llama-2-7b)和较小的编码器型语言模型(如RoBERTa-large)上均展现出显著提升的指令遵循能力。此外,我们对同一语料施加不同指令后获得的聚类结果进行定性分析,显示出高度的可解释性。