Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem. We release the code and data for our experiments publicly, and you can find the relevant links here: https://github.com/Dylan102938/mood-bench.

翻译：许多大型语言模型在安全与对齐方面出现的失效源于分布外情境：即模型开发者未能预见的异常提示或响应模式。我们通过引入名为"分布外错位"的基准（Misalignment Out Of Distribution, MOOD），系统研究了语言模型监测流水线能否检测这些分布外对齐失效。由于经过海量安全数据集训练的现成模型难以发现真正的分布外失效，我们通过在MOOD中设置受限训练集（用于训练自主监测器）以及七个包含多样化对齐失效的测试集（均偏离训练分布）来规避此问题。利用MOOD，我们发现防护模型（安全分类器）在分布外泛化方面常显不足。为解决这一问题，我们提出将防护模型与分布外检测器相结合。通过测试四种分布外检测器，发现防护模型与基于马氏距离及困惑度的分布外检测器组合，可将召回率从39%提升至45%。我们还验证了跨模型规模的积极扩展趋势：对于结合防护模型与分布外检测器的监测系统而言，引入分布外检测实现的召回增益，甚至超过使用参数规模大20倍的防护模型。研究表明分布外检测应成为语言模型监测的关键组成部分，并为该重要问题的后续研究奠定基础。我们已公开实验代码与数据，相关链接详见：https://github.com/Dylan102938/mood-bench。