Although debiased large language models (LLMs) excel at handling known or low-bias prompts, they often fail on unfamiliar and high-bias prompts. We demonstrate via out-of-distribution (OOD) detection that these high-bias prompts cause a distribution shift, degrading static model performance. To enable real-time correction, we propose CAP-TTA, a test-time adaptation framework. CAP-TTA triggers context-aware LoRA updates only when a bias-risk score exceeds a set threshold. By utilizing an offline precomputed diagonal preconditioner, it ensures fast and stable optimization. Across multiple benchmarks and human evaluations, CAP-TTA effectively reduces toxicity/bias score with significantly lower latency than standard optimization methods (e.g., AdamW or SGD). Furthermore, it prevents catastrophic forgetting, and substantially improves narrative fluency over state-of-the-art baselines without compromising debiasing performance.
翻译:尽管去偏的大语言模型(LLMs)在处理已知或低偏差提示时表现优异,但它们常常在面对不熟悉或高偏差提示时失效。我们通过分布外(OOD)检测表明,这些高偏差提示会导致分布偏移,从而降低静态模型的性能。为了实现实时校正,我们提出CAP-TTA,一种测试时自适应框架。CAP-TTA仅在偏差风险评分超过设定阈值时触发上下文相关的LoRA更新。通过利用离线预计算的对角预条件器,它确保了快速稳定的优化。在多个基准测试和人工评估中,CAP-TTA有效降低了毒性/偏差分数,同时延迟显著低于标准优化方法(如AdamW或SGD)。此外,它防止了灾难性遗忘,并在不影响去偏性能的前提下,显著提升了叙事流畅性,超越了现有最先进的基线方法。