Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
翻译:大型语言模型有时会产生虚假或误导性回答。解决此问题的两种方法是诚实诱导——通过修改提示或权重使模型如实回答,以及谎言检测——对给定回答是否虚假进行分类。先前研究在专门训练用于说谎或隐瞒信息的模型上评估此类方法,但这些人工构造可能无法反映自然发生的欺骗行为。我们转而研究来自中国开发者的开源权重LLM,这些模型被训练用于审查政治敏感话题:Qwen3模型经常对法轮功或天安门抗议等主题产生虚假陈述,但偶尔也能正确回答,表明它们拥有被训练所压制的知识。以此为测试平台,我们评估了一系列诱导与谎言检测技术。在诚实诱导方面,不使用聊天模板的采样、少样本提示以及在通用诚实数据上的微调最能可靠提升真实回答的比例。对于谎言检测,提示被审查模型对其自身回答进行分类的表现接近未经审查模型的上限,而基于无关数据训练的线性探针提供了更经济的替代方案。最强的诚实诱导技术也能迁移至包括DeepSeek R1在内的前沿开源权重模型。值得注意的是,没有任何技术能完全消除虚假回答。我们已公开所有提示、代码及对话记录。