In the field of natural language processing, open-domain chatbots have emerged as an important research topic. However, a major limitation of existing open-domain chatbot research is its singular focus on short single-session dialogue, neglecting the potential need for understanding contextual information in multiple consecutive sessions that precede an ongoing dialogue. Among the elements that compose the context in multi-session conversation settings, the time intervals between sessions and the relationships between speakers would be particularly important. Despite their importance, current research efforts have not sufficiently addressed these dialogical components. In this paper, we introduce a new 1M multi-session dialogue dataset, called Conversation Chronicles, for implementing a long-term conversation setup in which time intervals and fine-grained speaker relationships are incorporated. Following recent works, we exploit a large language model to produce the data. The extensive human evaluation shows that dialogue episodes in Conversation Chronicles reflect those properties while maintaining coherent and consistent interactions across all the sessions. We also propose a dialogue model, called ReBot, which consists of chronological summarization and dialogue generation modules using only around 630M parameters. When trained on Conversation Chronicles, ReBot demonstrates long-term context understanding with a high human engagement score.
翻译:在自然语言处理领域,开放域聊天机器人已成为重要的研究方向。然而,现有开放域聊天机器人研究的一大局限性在于其仅关注短时单轮对话,忽视了理解当前对话之前多个连续会话中上下文信息的潜在需求。在多轮对话场景构成的上下文要素中,会话间的时间间隔与说话者间的关系尤为重要。尽管具有关键意义,当前研究尚未充分探讨这些对话组件。本文提出名为"对话编年史"(Conversation Chronicles)的百万级多轮对话数据集,用于实现融合时间间隔与细粒度说话者关系的长期对话配置。基于近期研究工作,我们利用大语言模型生成该数据集。广泛的人工评估表明,该数据集的对话片段在保持所有会话间连贯一致交互的同时,充分反映了上述特性。我们进一步提出名为ReBot的对话模型,该模型仅使用约6.3亿参数,由时序摘要模块与对话生成模块构成。经过对话编年史数据集训练后,ReBot展现出具有较高人类参与度评分的长期上下文理解能力。