Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
翻译:大型语言模型(LLM)是否对不同用户提供不同的事实信息?本文介绍ConsistencyAI,一个用于衡量大型语言模型针对不同用户角色的事实一致性的独立基准。ConsistencyAI测试当不同人口背景的用户提出相同问题时,模型是否给出事实不一致的回答。该基准设计未受LLM提供商参与,旨在提供公正的评估与问责机制。实验中,我们使用包含15个主题(每个主题要求5个事实)的提示词对19个LLM进行查询,每个模型重复查询100次,每次在提示词中添加从模拟总人口中选取的不同用户角色上下文。我们将响应处理为句子嵌入,计算跨角色余弦相似度,并通过加权平均跨角色余弦相似度计算事实一致性分数。在100个角色的实验中,得分范围从0.9065到0.7896,平均值为0.8656,我们将其作为基准阈值。xAI的Grok-3表现最一致,而多个轻量级模型排名最低。一致性因主题而异:就业市场最不一致,G7国家领导人信息最一致,疫苗或巴以冲突等议题则因提供商不同产生分歧。这些结果表明,提供商和主题共同影响事实一致性。我们公开代码和交互演示以支持可复现评估,并鼓励开发与角色无关的提示策略。