Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems. The source code can be found at https://aisuko.github.io/fair-tot/.
翻译:大型语言模型(LLM)日益广泛地应用于在线内容审核系统的毒性评估中,其中跨人口群体的公平性对于实现公正处理至关重要。然而,LLM对于微妙表达(尤其是涉及隐性仇恨言论的内容)常产生不一致的毒性判断,这揭示了难以通过标准训练修正的潜在偏见。这引出了一个现有方法常忽视的关键问题:应何时调用修正机制以确保公平可靠的评估?为此,我们提出FairToT——一种通过提示引导的毒性评估在推理阶段提升LLM公平性的框架。FairToT能够识别可能发生人口统计学相关变异的情况,并确定何时需要应用额外评估。此外,我们引入两个可解释的公平性指标,用于检测此类案例并在不修改模型参数的情况下提升推理一致性。在基准数据集上的实验表明,FairToT在保持稳定可靠毒性预测的同时,有效降低了群体层面的差异,这证明推理阶段优化为基于LLM的毒性评估系统提供了一种有效且实用的公平性提升途径。源代码可在https://aisuko.github.io/fair-tot/获取。