Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.
翻译:尽管机器遗忘对于从大语言模型中移除私有、有害或受版权保护的内容至关重要,但现有基准往往无法准确反映模型实际学习的“遗忘范围”。我们形式化定义了领域级与实例级两种不同的遗忘粒度,并提出了BiForget——一个用于合成高质量遗忘集的自动化框架。与先前依赖外部生成器的工作不同,BiForget利用目标模型自身,通过种子引导和对抗性提示来激发与其内部知识分布相匹配的数据。我们在多个基准上的实验表明,该方法在相关性、多样性和效率之间实现了更优的平衡。量化结果显示,在《哈利·波特》领域,与现有最优方法相比,其相关性提升约${\sim}20$,多样性提升约${\sim}0.05$,同时总数据量减少一半。最终,该方法促进了更稳健的遗忘效果和更优的效用保持,为评估大语言模型遗忘提供了更严谨的基础。