This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
翻译:本研究提出了一种面向隐式仇恨言论的语境化检测框架,该框架通过多智能体系统实现,包含一个中央仲裁智能体与动态构建的社区智能体(代表特定人口统计群体)。我们的方法显式整合了来自公开知识源的社会文化语境,实现了身份感知的内容审核,其在具有挑战性的ToxiGen数据集上超越了现有最先进的提示方法(零样本提示、少样本提示、思维链提示)及其他替代方案。我们通过引入平衡准确率作为分类公平性的核心评估指标,增强了性能评估的技术严谨性,该指标综合考虑了真阳性率与真阴性率之间的权衡。实验表明,我们提出的社区驱动协商框架在所有目标群体上均显著提升了分类准确率与公平性。