We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with more heavily aligned (censored) counterparts in a deployed-model setting when deployed using political personas. While uncensored models are often framed as offering a less constrained perspective, our results reveal a trade-off: censored models outperform their uncensored counterparts in both accuracy and robustness, achieving 69.0\% versus 64.1\% strict accuracy. However, this higher performance is also associated with greater resistance to persona-based influence, while uncensored models are more malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticated auditing frameworks that account for fairness, calibration, and ideological consistency. Taken together, these results point to censorship-as-deployed rather than safety alignment in isolation as the more appropriate frame for interpreting model differences.
翻译:我们研究了大语言模型(LLMs)在检测隐性和显性仇恨言论方面的效能,探讨了在部署模型中,使用政治人格时,安全对齐程度最低(未审查)的模型与对齐程度更高(已审查)的模型相比有何表现。虽然未审查的模型常被认为提供了更少限制的视角,但我们的结果揭示了一种权衡:已审查模型在准确性和鲁棒性方面均优于未审查模型,严格准确率分别为69.0%和64.1%。然而,这种更高的性能也伴随着对基于人格影响的更大抵抗力,而未审查模型则更容易受到意识形态框架的影响。此外,我们识别出所有模型在理解讽刺等细微语言方面均存在严重缺陷。我们还发现不同目标群体之间的性能存在惊人的公平性差异,以及系统性的过度自信导致自我报告的可信度不可靠。这些发现挑战了LLMs作为客观仲裁者的观念,并突显了需要更复杂的审计框架来考虑公平性、校准和意识形态一致性。综合来看,这些结果表明,解释模型差异的更合适框架是部署中的审查,而非孤立的安全对齐。