LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
翻译:大语言模型日益被用作长周期对话代理,然而评估其记忆能力的每项主流基准都将用户信息视为有待存储与检索的静态事实。这种模型存在根本性偏差。人类会改变想法,在长时间交互过程中,观点漂移、过度对齐和确认偏误等现象开始显著发挥作用。BeliefShift引入了一个纵向基准,专门设计用于评估多会话大语言模型交互中的信念动态变化。该基准涵盖三个维度:时间信念一致性、矛盾检测和证据驱动修正。数据集包含2,400条人类标注的多会话交互轨迹,覆盖健康、政治、个人价值观和产品偏好领域。我们评估了包括GPT-4o、Claude 3.5 Sonnet、Gemini 1.5 Pro、LLaMA-3和Mistral-Large在内的七种模型,分别在零样本和检索增强生成(RAG)设置下进行测试。结果揭示了一个明确的权衡:激进个性化的模型抗漂移能力较弱,而基于事实的模型则容易遗漏合理的信念更新。我们进一步引入了四项新型评估指标:信念修正准确率(BRA)、漂移连贯性得分(DCS)、矛盾解决率(CRR)和证据敏感度指数(ESI)。