Large language models (LLMs) reproduce misinformation by learning the linguistic patterns that make falsehoods persuasive, such as hedging, false presuppositions, and citation fabrication, rather than merely memorizing false facts. We propose model immunization: supervised fine-tuning on curated (false claim, correction) pairs injected as small "vaccine doses" (5-10\% of tokens) alongside truthful data. Unlike post-hoc filtering or preference-based alignment, immunization provides direct negative supervision on labeled falsehoods. Across four open-weight model families, immunization improves TruthfulQA accuracy by 12 points and misinformation rejection by 30 points with negligible capability loss. We outline design requirements, which includes, dosage, labeling, quarantine, diversity and call for standardized vaccine corpora and benchmarks that test generalization, making immunization a routine component of responsible LLM development
翻译:大型语言模型(LLMs)通过习得使虚假信息具有说服力的语言模式(如模糊措辞、虚假预设和捏造引用)来复现错误信息,而非仅仅记忆虚假事实。我们提出模型免疫方法:在真实数据基础上,注入少量“疫苗剂量”(占词元的5-10%)——即经过筛选的(虚假主张,修正)配对数据——进行监督微调。与事后过滤或基于偏好的对齐方法不同,免疫机制能对标注的虚假信息提供直接的负向监督。在四个开源模型系列上的实验表明,免疫方法将TruthfulQA准确率提升12个百分点,虚假信息拒绝率提高30个百分点,且能力损失可忽略不计。我们阐述了设计要素,包括剂量、标注、隔离和多样性要求,并呼吁建立标准化的疫苗语料库及测试泛化能力的基准,使免疫成为负责任LLM开发的常规组成部分。