Large language models (LLMs) are increasingly used as sources of historical information, motivating the need for scalable audits on contested events and politically charged narratives in settings that mirror real user interactions. We introduce \texttt{HistoricalMisinfo, a curated dataset of $500$ contested events from $45$ countries, each paired with a factual reference narrative and a documented revisionist reference narrative. To approximate real-world usage, we instantiate each event in $11$ prompt scenarios that reflect common communication settings (e.g., questions, textbooks, social posts, policy briefs). Using an LLM-as-a-judge protocol that compares model outputs to the two references, we evaluate LLMs varying across model architectures in two conditions: (i) neutral user prompts that ask for factually accurate information, and (ii) robustness prompts in which the user explicitly requests the revisionist version of the event. Under neutral prompts, models are generally closer to factual references, though the resulting scores should be interpreted as reference-alignment signals rather than definitive evidence of human-interpretable revisionism. Robustness prompting yields a strong and consistent effect: when the user requests the revisionist narrative, all evaluated models show sharply higher revisionism scores, indicating limited resistance or self-correction. HistoricalMisinfo provides a practical foundation for benchmarking robustness to revisionist framing and for guiding future work on more precise automatic evaluation of contested historical claims to ensure a sustainable integration of AI systems within society. Our code is available at https://github.com/francescortu/PreservingHistoricalTruth
翻译:大语言模型(LLMs)正日益被用作历史信息来源,这促使我们需要在模拟真实用户交互的场景中,对争议性事件和政治敏感叙事进行可扩展的审计。我们提出了 \texttt{HistoricalMisinfo},这是一个包含来自 45 个国家的 500 个争议事件的精选数据集,每个事件都配有一个事实性参考叙事和一个有据可查的修正主义参考叙事。为了近似真实世界的使用情况,我们将每个事件实例化为 11 种提示场景,这些场景反映了常见的交流环境(例如,提问、教科书、社交媒体帖子、政策简报)。通过使用一种将模型输出与两个参考叙事进行比较的 LLM-as-a-judge 协议,我们评估了在不同模型架构下 LLMs 的两种表现:(i)中性用户提示,要求提供事实准确的信息;(ii)鲁棒性提示,其中用户明确要求事件的修正主义版本。在中性提示下,模型输出通常更接近事实性参考,但由此产生的分数应被解释为参考对齐信号,而非人类可解读的修正主义的明确证据。鲁棒性提示产生了强烈且一致的效果:当用户请求修正主义叙事时,所有被评估的模型都显示出显著更高的修正主义分数,表明其抵抗或自我纠正能力有限。HistoricalMisinfo 为基准测试对修正主义框架的鲁棒性,以及指导未来关于争议性历史主张更精确自动评估的工作,以确保人工智能系统在社会中的可持续整合,提供了一个实用基础。我们的代码可在 https://github.com/francescortu/PreservingHistoricalTruth 获取。