This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
翻译:本研究提出了一种针对隐含仇恨言论的情境化检测框架,该框架通过多智能体系统实现,包含一个中央仲裁智能体及动态构建的代表特定人口群体的社区智能体。我们的方法显式整合了来自公开知识源的社会文化情境,实现了超越当前最优提示方法(零样本提示、少样本提示、思维链提示)及替代方案的、具备身份识别能力的审核机制,并在具有挑战性的ToxiGen数据集上得到验证。我们通过引入平衡准确率作为分类公平性的核心评估指标——该指标综合考虑了真阳性率与真阴性率之间的权衡——从而提升了性能评估的技术严谨性。实验表明,我们提出的社区驱动协商框架在所有目标群体上均显著提升了分类准确率与公平性。