Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages. Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem -- that is, users can perform a conversation in various ways, and these alternate conversations are unrecorded. Consequently, they often struggle to generalize to diverse conversations in real-world scenarios. In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug). ConvAug first generates multi-level augmented conversations to capture the diverse nature of conversational contexts. Inspired by human cognition, we devise a cognition-aware process to mitigate the generation of false positives, false negatives, and hallucinations. Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space. A contrastive learning objective is then employed to train a better conversational context encoder. Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug.
翻译:对话式搜索利用多轮自然语言上下文来检索相关段落。现有的对话式密集检索模型大多将对话视为固定的问答序列,忽视了严重的数据稀疏性问题——即用户可能以多种方式展开对话,而这些替代性的对话形式并未被记录。因此,这些模型通常难以泛化到真实场景中多样化的对话情境。本文提出一种基于LLM认知数据增强的对话式密集检索泛化框架(ConvAug)。首先,ConvAug生成多层级增强对话以捕捉对话上下文的多样性特征。受人类认知机制启发,我们设计了一种认知感知处理流程,用于减少假正例、假负例及幻觉生成。此外,我们开发了难度自适应样本过滤器,能够为复杂对话筛选具有挑战性的样本,从而为模型提供更大的学习空间。随后采用对比学习目标来训练更优的对话上下文编码器。在四种公开数据集上进行的常规场景与零样本场景的广泛实验表明,ConvAug在有效性、泛化性和适用性方面均表现优异。