The emergent capabilities of large language models (LLMs) have prompted interest in using them as surrogates for human subjects in opinion surveys. However, prior evaluations of LLM-based opinion simulation have relied heavily on costly, domain-specific survey data, and mixed empirical results leave their reliability in question. To enable cost-effective, early-stage evaluation, we introduce a quality control assessment designed to test the viability of LLM-simulated opinions on Likert-scale tasks without requiring large-scale human data for validation. This assessment comprises two key tests: \emph{logical consistency} and \emph{alignment with stakeholder expectations}, offering a low-cost, domain-adaptable validation tool. We apply our quality control assessment to an opinion simulation task relevant to AI-assisted content moderation and fact-checking workflows -- a socially impactful use case -- and evaluate seven LLMs using a baseline prompt engineering method (backstory prompting), as well as fine-tuning and in-context learning variants. None of the models or methods pass the full assessment, revealing several failure modes. We conclude with a discussion of the risk management implications and release \texttt{TopicMisinfo}, a benchmark dataset with paired human and LLM annotations simulated by various models and approaches, to support future research.
翻译:大语言模型(LLMs)涌现的能力引发了将其作为人类受试者替代品用于意见调查的兴趣。然而,先前对基于LLM的观点模拟的评估严重依赖成本高昂、领域特定的调查数据,且混合的实证结果使其可靠性存疑。为实现成本效益高的早期评估,我们引入了一种质量控制评估方法,旨在测试LLM在李克特量表任务上模拟观点的可行性,无需大规模人类数据进行验证。该评估包含两个关键测试:逻辑一致性和与利益相关者期望的一致性,提供了一种低成本、可适应不同领域的验证工具。我们将此质量控制评估应用于与AI辅助内容审核和事实核查工作流程相关的观点模拟任务——一个具有社会影响力的用例——并评估了七种LLM,使用基线提示工程方法(背景故事提示)以及微调和上下文学习变体。所有模型或方法均未通过完整评估,揭示了若干失败模式。最后,我们讨论了风险管理的影响,并发布了TopicMisinfo基准数据集,该数据集包含由不同模型和方法模拟的配对人类与LLM标注,以支持未来研究。