Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning

There has been a surge in the use of large language models (LLM) conversational agents to generate responses based on long-term history from multiple sessions. However, existing long-term open-domain dialogue datasets lack complex, real-world personalization and fail to capture implicit reasoning-where relevant information is embedded in subtle, syntactic, or semantically distant connections rather than explicit statements. In such cases, traditional retrieval methods fail to capture relevant context, and long-context modeling also becomes inefficient due to numerous complicated persona-related details. To address this gap, we introduce ImplexConv, a large-scale long-term dataset with 2,500 examples, each containing approximately 100 conversation sessions, designed to study implicit reasoning in personalized dialogues. Additionally, we propose TaciTree, a novel hierarchical tree framework that structures conversation history into multiple levels of summarization. Instead of brute-force searching all data, TaciTree enables an efficient, level-based retrieval process where models refine their search by progressively selecting relevant details. Our experiments demonstrate that TaciTree significantly improves the ability of LLMs to reason over long-term conversations with implicit contextual dependencies.

翻译：近年来，基于大规模语言模型（LLM）的对话代理在利用多会话长期历史生成回复方面应用激增。然而，现有的长期开放域对话数据集缺乏复杂、真实的个性化特征，且未能捕捉隐式推理——即相关信息并非通过显式陈述，而是蕴含于细微的、句法的或语义距离较远的关联之中。在此类场景下，传统检索方法难以捕获相关上下文，而长上下文建模也因大量复杂的人物相关细节变得低效。为填补这一空白，我们提出了ImplexConv，一个包含2,500个样本的大规模长期对话数据集，每个样本涵盖约100个对话会话，专为研究个性化对话中的隐式推理而设计。此外，我们提出了TaciTree，一种新颖的层次树框架，该框架将对话历史结构化为多级摘要。TaciTree并非暴力搜索所有数据，而是支持一种高效的、基于层级的检索过程，使模型能够通过逐步选择相关细节来优化搜索。实验表明，TaciTree显著提升了LLM在具有隐式上下文依赖的长期对话中进行推理的能力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日