Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.
翻译:机器遗忘旨在从预训练模型中移除特定训练数据的影响而无需从头重新训练,由于安全、隐私和法律方面的考量,这对大语言模型日益重要。尽管先前工作主要在静态单轮设置中评估遗忘效果,但在实际交互使用下的遗忘鲁棒性仍未得到充分探索。本文通过考察两种常见交互模式——自我修正与对话条件查询,研究遗忘在交互环境中是否保持稳定。我们发现,在静态评估中看似已被遗忘的知识常可通过交互恢复。虽然更强的遗忘能提升表观鲁棒性,但这往往导致行为僵化而非真正的知识擦除。我们的研究结果表明,静态评估可能高估实际效果,并强调需确保交互场景下的稳定遗忘。