Technological advances in the Internet and online social networks have brought many benefits to humanity. At the same time, this growth has led to an increase in hate speech, the main global threat. To improve the reliability of black-box models used for hate speech detection, post-hoc approaches such as LIME, SHAP, and LRP provide the explanation after training the classification model. In contrast, multi-task approaches based on the HateXplain benchmark learn to explain and classify simultaneously. However, results from HateXplain-based algorithms show that predicted attention varies considerably when it should be constant. This attention variability can lead to inconsistent interpretations, instability of predictions, and learning difficulties. To solve this problem, we propose the BiAtt-BiRNN-HateXplain (Bidirectional Attention BiRNN HateXplain) model which is easier to explain compared to LLMs which are more complex in view of the need for transparency, and will take into account the sequential aspect of the input data during explainability thanks to a BiRNN layer. Thus, if the explanation is correctly estimated, thanks to multi-task learning (explainability and classification task), the model could classify better and commit fewer unintentional bias errors related to communities. The experimental results on HateXplain data show a clear improvement in detection performance, explainability and a reduction in unintentional bias.
翻译:互联网与在线社交网络的技术进步为人类带来了诸多益处。与此同时,这种增长也导致了作为全球主要威胁的仇恨言论的激增。为提高用于仇恨言论检测的黑盒模型的可信度,LIME、SHAP和LRP等事后方法在分类模型训练后提供解释。相比之下,基于HateXplain基准的多任务方法则同时学习解释与分类。然而,基于HateXplain的算法结果显示,预测注意力在应当保持恒定之时却存在显著波动。这种注意力变异性可能导致不一致的解释、预测的不稳定性以及学习困难。为解决此问题,我们提出了BiAtt-BiRNN-HateXplain(双向注意力BiRNN HateXplain)模型。相较于因透明性需求而更为复杂的大型语言模型(LLMs),该模型更易于解释,并通过BiRNN层在可解释性过程中充分考虑输入数据的序列特性。因此,若解释能被正确估计,借助多任务学习(可解释性与分类任务),模型可提升分类性能,并减少与社群相关的无意识偏见错误。在HateXplain数据上的实验结果表明,该模型在检测性能、可解释性方面均有显著提升,同时减少了无意识偏见。