Although debiased large language models (LLMs) excel at handling known or low-bias prompts, they often fail on unfamiliar and high-bias prompts. We demonstrate via out-of-distribution (OOD) detection that these high-bias prompts cause a distribution shift, degrading static model performance. To enable real-time correction, we propose CAP-TTA, a test-time adaptation framework. CAP-TTA triggers context-aware LoRA updates only when a bias-risk score exceeds a set threshold. By utilizing an offline precomputed diagonal preconditioner, it ensures fast and stable optimization. Across multiple benchmarks and human evaluations, CAP-TTA effectively reduces toxicity/bias score with significantly lower latency than standard optimization methods (e.g., AdamW or SGD). Furthermore, it prevents catastrophic forgetting, and substantially improves narrative fluency over state-of-the-art baselines without compromising debiasing performance.
翻译:尽管去偏的大语言模型(LLMs)在处理已知或低偏置提示时表现优异,但其在面对陌生且高偏置提示时往往会失效。我们通过分布外(OOD)检测证明,这些高偏置提示会导致分布偏移,进而降低静态模型的性能。为实现实时修正,我们提出了CAP-TTA——一个测试时自适应框架。CAP-TTA仅在偏置风险评分超过设定阈值时触发上下文感知的LoRA更新。通过利用离线预计算的对角预调节器,它确保了快速且稳定的优化。在多个基准测试和人工评估中,CAP-TTA有效降低了毒性/偏置评分,且其延迟显著低于标准优化方法(如AdamW或SGD)。此外,它避免了灾难性遗忘,并在不妥协去偏性能的前提下,显著提升了叙事流畅度,优于现有最优基线方法。