Ensuring fairness in natural language processing for moral sentiment classification is challenging, particularly under cross-domain shifts where transformer models are increasingly deployed. Using the Moral Foundations Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC), this work evaluates BERT and DistilBERT in a multi-label setting with in-domain and cross-domain protocols. Aggregate performance can mask disparities: we observe pronounced asymmetry in transfer, with Twitter->Reddit degrading micro-F1 by 14.9% versus only 1.5% for Reddit->Twitter. Per-label analysis reveals fairness violations hidden by overall scores; notably, the authority label exhibits Demographic Parity Differences of 0.22-0.23 and Equalized Odds Differences of 0.40-0.41. To address this gap, we introduce the Moral Fairness Consistency (MFC) metric, which quantifies the cross-domain stability of moral foundation detection. MFC shows strong empirical validity, achieving a perfect negative correlation with Demographic Parity Difference (rho = -1.000, p < 0.001) while remaining independent of standard performance metrics. Across labels, loyalty demonstrates the highest consistency (MFC = 0.96) and authority the lowest (MFC = 0.78). These findings establish MFC as a complementary, diagnosis-oriented metric for fairness-aware evaluation of moral reasoning models, enabling more reliable deployment across heterogeneous linguistic contexts. .
翻译:在道德情感分类的自然语言处理中确保公平性具有挑战性,尤其是在Transformer模型日益部署的跨领域偏移场景下。本研究利用道德基础推特语料库(MFTC)和道德基础Reddit语料库(MFRC),在领域内与跨领域协议的多标签设置下评估BERT和DistilBERT模型。聚合性能可能掩盖差异:我们观察到迁移中存在显著不对称性,Twitter→Reddit迁移使微平均F1值下降14.9%,而Reddit→Twitter迁移仅下降1.5%。按标签分析揭示了被总体分数掩盖的公平性违规现象;值得注意的是,权威标签表现出0.22-0.23的人口统计均等差异和0.40-0.41的均衡几率差异。为弥补这一差距,我们提出了道德公平一致性(MFC)度量标准,用于量化道德基础检测的跨领域稳定性。MFC展现出强大的实证效度,与人口统计均等差异达到完全负相关(ρ = -1.000, p < 0.001),同时保持与标准性能指标的独立性。在所有标签中,忠诚标签表现出最高的一致性(MFC = 0.96),而权威标签最低(MFC = 0.78)。这些发现确立了MFC作为道德推理模型公平性评估的补充性诊断导向指标,能够在异构语言环境中实现更可靠的部署。