Large language models (LLMs) are increasingly used as sources of historical information, motivating the need for scalable audits on contested events and politically charged narratives in settings that mirror real user interactions. We introduce \textsc{\texttt{HistoricalMisinfo}}, a curated dataset of $500$ contested events from $45$ countries, each paired with a factual reference narrative and a documented revisionist reference narrative. To approximate real-world usage, we instantiate each event in $11$ prompt scenarios that reflect common communication settings (e.g., questions, textbooks, social posts, policy briefs). Using an LLM-as-a-judge protocol that compares model outputs to the two references, we evaluate LLMs varying across model architectures in two conditions: (i) neutral user prompts that ask for factually accurate information, and (ii) robustness prompts in which the user explicitly requests the revisionist version of the event. Under neutral prompts, models are generally closer to factual references, though the resulting scores should be interpreted as reference-alignment signals rather than definitive evidence of human-interpretable revisionism. Robustness prompting yields a strong and consistent effect: when the user requests the revisionist narrative, all evaluated models show sharply higher revisionism scores, indicating limited resistance or self-correction. \textsc{\texttt{HistoricalMisinfo}} provides a practical foundation for benchmarking robustness to revisionist framing and for guiding future work on more precise automatic evaluation of contested historical claims to ensure a sustainable integration of AI systems within society.
翻译:大型语言模型(LLMs)正日益被用作历史信息来源,这促使我们需要在模拟真实用户交互的场景中,对争议性事件和政治敏感叙事进行可扩展的审计。我们提出了\textsc{\texttt{HistoricalMisinfo}},一个包含来自45个国家的500个争议事件的精选数据集,每个事件均配对一个事实性参考叙事和一个有文献记载的修正主义参考叙事。为贴近实际使用场景,我们将每个事件实例化为11种反映常见交流情境(如提问、教科书、社交帖子、政策简报)的提示场景。通过采用LLM作为评判者的协议,将模型输出与两个参考叙事进行比较,我们评估了在不同模型架构下LLMs的两种表现:(i)中性用户提示要求提供事实准确信息;(ii)鲁棒性提示中用户明确要求事件的修正主义版本。在中性提示下,模型通常更接近事实性参考,但所得分数应被解读为参考对齐信号,而非人类可解释的修正主义的决定性证据。鲁棒性提示产生了强烈且一致的效果:当用户要求修正主义叙事时,所有被评估模型均显示出显著更高的修正主义分数,表明其抵抗或自我纠正能力有限。\textsc{\texttt{HistoricalMisinfo}}为基准测试对修正主义框架的鲁棒性提供了实用基础,并为指导未来更精确地自动评估争议性历史主张的工作铺平了道路,以确保人工智能系统在社会中的可持续融入。