Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labelled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs...
翻译:刻板印象是对社会群体的普遍化假设,即使是采用上下文学习的最先进大语言模型也难以准确识别它们。由于刻板印象具有主观性,其构成可能因文化、社会和个人视角而有很大差异,因此鲁棒的可解释性至关重要。可解释模型确保人类用户能够理解和验证这些细微的判断,从而促进信任和问责。我们通过引入HEARTS(可解释、可持续且鲁棒的文本刻板印象检测整体框架)来应对这些挑战,该框架提升了模型性能,最小化了碳足迹,并提供了透明、可解释的说明。我们建立了扩展多粒度刻板印象数据集(EMGSD),包含六个群体的57,201条标注文本,涵盖了如LGBTQ+等代表性不足的人口统计群体以及地域刻板印象。消融研究证实,在EMGSD上微调的BERT模型优于在单个组件上训练的模型。随后,我们使用SHAP分析了一个经过微调的碳高效ALBERT-V2模型,以生成词元级重要性值,确保与人类理解保持一致,并通过比较SHAP和LIME的输出计算可解释性置信度分数...