Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
翻译:大语言模型(LLMs)能够进行类人对话,但与人类不同,由于其叠加特性,它们是无状态的。然而,在多轮、多智能体交互过程中,LLMs开始表现出一致、类角色的行为,暗示了一种涌现的终身学习形式。尽管如此,现有基准测试往往未能捕捉这些动态,主要侧重于静态、开放式的评估。为弥补这一不足,我们引入了LIFESTATE-BENCH,一个旨在评估LLMs终身学习能力的基准测试。它包含两个情景数据集:《哈姆雷特》和一个合成剧本集合,两者均富含叙事结构和角色互动。我们的事实核查评估探究了模型在参数化和非参数化方法下的自我意识、情景记忆检索和关系追踪能力。在Llama3.1-8B、GPT-4-turbo和DeepSeek R1等模型上的实验表明,非参数化方法在管理状态学习方面显著优于参数化方法。然而,随着交互的延长,所有模型都表现出灾难性遗忘的挑战,突显了终身学习领域需要进一步的发展。