LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.
翻译:通常,检索外部知识的LLM智能体会先生成文本形式的搜索查询,再运行独立的嵌入模型将其编码为向量。这种双模型流程增加了基础设施复杂性和延迟,却存在冗余:LLM本身已在隐藏状态中编码了完整的对话上下文。我们提出通过添加轻量级投影头,将隐藏状态直接映射到嵌入空间,从而为LLM智能体赋予原生检索能力,无需依赖独立的嵌入模型。通过结合对齐损失、对比损失和排序蒸馏损失进行训练,我们的方法在保持基线检索质量97%的同时,使LLM智能体能够使用自身表征进行搜索。在QReCC对话式搜索基准上的实验表明,相较于标准的生成-编码流程,本方法在Recall@10和MRR@10指标上具有竞争力,系统性消融实验也验证了各损失分量的贡献。