As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade-offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi-dimensional evaluation for responsible model selection. Our implementation is available at: https://github.com/raise-framework/raise.
翻译:随着人工智能系统进入高风险领域,评估必须超越预测准确性,涵盖可解释性、公平性、鲁棒性和可持续性。我们提出了RAISE(负责任人工智能评分与评估)这一统一框架,该框架量化模型在上述四个维度的表现,并将其聚合为单一的整体责任评分。我们在金融、医疗保健和社会经济学领域的结构化数据集上评估了三种深度学习模型:多层感知机(MLP)、表格残差网络(Tabular ResNet)和特征标记化Transformer(Feature Tokenizer Transformer)。研究结果揭示了关键权衡:MLP表现出优异的可持续性和鲁棒性,Transformer在可解释性和公平性方面表现卓越但环境代价极高,而表格残差网络则提供了均衡的性能特征。这些结果表明,没有任何单一模型能在所有责任标准上占优,凸显了负责任模型选择中多维评估的必要性。我们的实现代码发布于:https://github.com/raise-framework/raise。