解释偏差是一种产物：揭示事后特征归因中隐藏的词汇与位置偏好 (Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution) - 专知论文

会员服务 ·

0

偏差 · 特征归因 · 系统 · 得分 · 梯度 ·

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

翻译：解释偏差是一种产物：揭示事后特征归因中隐藏的词汇与位置偏好

Jonathan Kamp,Roos Bakker,Dominique Blok

Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model comparison, with models that score high on one type score low on the other. We also find signs that anomalous explanations are more likely to be biased.

翻译：高质量的解释能增强对语言模型与数据的理解。特征归因方法，如积分梯度，是一种事后解释器，可提供词元层级的洞察。然而，由于不同方法的内在偏差，对同一输入的解释可能存在巨大差异。意识到此问题的用户可能质疑其效用，而未察觉的用户则可能给予不当信任。本研究超越归因方法间的表面不一致性，通过一个与模型及方法无关的、包含三项评估指标的框架，系统化地构建其偏差结构。我们系统评估了两个Transformer模型的词汇偏差与位置偏差（即输入中的“内容”与“位置”）：首先在人工数据上的受控伪随机分类任务中；随后在自然数据上半受控的因果关系检测任务中。我们在模型比较中发现词汇偏差与位置偏差之间存在权衡——在一种偏差上得分高的模型在另一种上得分较低。我们还发现异常解释更可能存在偏差的迹象。

0

相关内容

【博士论文】论视觉 Transformer (Vision Transformers) 中的归纳偏置

【博士论文】论视觉 Transformer (Vision Transformers) 中的归纳偏置

专知会员服务

9+阅读 · 2月13日

多样化偏好优化

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日

基于因果推断的推荐系统去偏研究

基于因果推断的推荐系统去偏研究

专知会员服务

21+阅读 · 2024年11月10日

【阿姆斯特丹博士论文】人工智能的自然归纳偏差，261页pdf

【阿姆斯特丹博士论文】人工智能的自然归纳偏差，261页pdf

专知会员服务

23+阅读 · 2024年2月11日

机器学习可解释如何客观评估？CMU-Yeh博士论文《可解释机器学习的客观标准》，148页pdf

机器学习可解释如何客观评估？CMU-Yeh博士论文《可解释机器学习的客观标准》，148页pdf

专知会员服务

79+阅读 · 2022年11月23日

MIT最新论文《对可解释特征的需求：动机和分类》：在机器学习模型的组成元素中建立可解释性

MIT最新论文《对可解释特征的需求：动机和分类》：在机器学习模型的组成元素中建立可解释性

专知会员服务

25+阅读 · 2022年6月30日

【AAAI 2022】机器学习模型的解释方法效果如何？MIT、微软学者为你解读，Do Feature Attribution Methods Correctly Attribute Features?

【AAAI 2022】机器学习模型的解释方法效果如何？MIT、微软学者为你解读，Do Feature Attribution Methods Correctly Attribute Features?

专知会员服务

31+阅读 · 2022年3月12日

「因果推理」概述论文，13页pdf

专知会员服务

101+阅读 · 2021年3月20日

可解释强化学习，Explainable Reinforcement Learning: A Survey

可解释强化学习，Explainable Reinforcement Learning: A Survey

专知会员服务

132+阅读 · 2020年5月14日

可视化特征属性基线的影响，Visualizing the Impact of Feature Attribution Baselines

可视化特征属性基线的影响，Visualizing the Impact of Feature Attribution Baselines

专知会员服务

10+阅读 · 2020年1月16日

「因果推理」概述论文，13页pdf

「因果推理」概述论文，13页pdf

专知

16+阅读 · 2021年3月20日

医疗健康领域的短文本解析探索----文本纠错

医疗健康领域的短文本解析探索----文本纠错

深度学习自然语言处理

10+阅读 · 2020年8月5日

一文读懂依存句法分析

一文读懂依存句法分析

AINLP

16+阅读 · 2019年4月28日

别说还不懂依存句法分析

别说还不懂依存句法分析

人工智能头条

23+阅读 · 2019年4月8日

【机器学习】深入剖析机器学习中的统计思想

【机器学习】深入剖析机器学习中的统计思想

产业智能官

17+阅读 · 2019年1月24日

【UC伯克利】可解释性机器学习：定义、方法和应用

【UC伯克利】可解释性机器学习：定义、方法和应用

专知

70+阅读 · 2019年1月19日

稀疏性的3个优势 -《稀疏统计学习及其应用》

稀疏性的3个优势 -《稀疏统计学习及其应用》

遇见数学

15+阅读 · 2018年10月24日

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

机器学习研究会

19+阅读 · 2018年3月11日

图上的归纳表示学习

图上的归纳表示学习

科技创新与创业

23+阅读 · 2017年11月9日

特征工程的特征理解（一）

特征工程的特征理解（一）

机器学习研究会

10+阅读 · 2017年10月23日

面向特征提取的低秩与稀疏图嵌入理论与算法研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向推荐系统中异构隐式反馈建模的迁移学习技术研究

国家自然科学基金

5+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

基于非独立同分布学习理论的图模型词义消歧及领域适应方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

高维回归模型的预测稳定性研究

国家自然科学基金

3+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

面向大数据的群体偏好决策分析研究

国家自然科学基金

6+阅读 · 2014年12月31日

测量误差数据下约束线性模型的有偏估计及变量选择研究

国家自然科学基金

0+阅读 · 2014年12月31日

含有隐变量的因果结构学习与统计因果推断

国家自然科学基金

21+阅读 · 2013年12月31日

CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations

Arxiv

0+阅读 · 2月19日

Rethinking Explainable Disease Prediction: Synergizing Accuracy and Reliability via Reflective Cognitive Architecture

Arxiv

0+阅读 · 2月8日

Explaining Grokking in Transformers through the Lens of Inductive Bias

Arxiv

0+阅读 · 2月6日

Hidden in Plain Sight -- Class Competition Focuses Attribution Maps

Arxiv

0+阅读 · 2月5日

Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Arxiv

0+阅读 · 2月4日

Physics as the Inductive Bias for Causal Discovery

Arxiv

0+阅读 · 2月3日

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

Arxiv

0+阅读 · 2月2日

BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Arxiv

0+阅读 · 1月30日

GRANITE: A Generalized Regional Framework for Identifying Agreement in Feature-Based Explanations

Arxiv

0+阅读 · 1月30日

Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

Arxiv

0+阅读 · 1月28日

VIP会员

文章信息

相关主题

相关VIP内容

【博士论文】论视觉 Transformer (Vision Transformers) 中的归纳偏置

【博士论文】论视觉 Transformer (Vision Transformers) 中的归纳偏置

专知会员服务

9+阅读 · 2月13日

多样化偏好优化

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日

基于因果推断的推荐系统去偏研究

基于因果推断的推荐系统去偏研究

专知会员服务

21+阅读 · 2024年11月10日

【阿姆斯特丹博士论文】人工智能的自然归纳偏差，261页pdf

【阿姆斯特丹博士论文】人工智能的自然归纳偏差，261页pdf

专知会员服务

23+阅读 · 2024年2月11日

机器学习可解释如何客观评估？CMU-Yeh博士论文《可解释机器学习的客观标准》，148页pdf

机器学习可解释如何客观评估？CMU-Yeh博士论文《可解释机器学习的客观标准》，148页pdf

专知会员服务

79+阅读 · 2022年11月23日

MIT最新论文《对可解释特征的需求：动机和分类》：在机器学习模型的组成元素中建立可解释性

MIT最新论文《对可解释特征的需求：动机和分类》：在机器学习模型的组成元素中建立可解释性

专知会员服务

25+阅读 · 2022年6月30日

【AAAI 2022】机器学习模型的解释方法效果如何？MIT、微软学者为你解读，Do Feature Attribution Methods Correctly Attribute Features?

【AAAI 2022】机器学习模型的解释方法效果如何？MIT、微软学者为你解读，Do Feature Attribution Methods Correctly Attribute Features?

专知会员服务

31+阅读 · 2022年3月12日

「因果推理」概述论文，13页pdf

专知会员服务

101+阅读 · 2021年3月20日

可解释强化学习，Explainable Reinforcement Learning: A Survey

可解释强化学习，Explainable Reinforcement Learning: A Survey

专知会员服务

132+阅读 · 2020年5月14日

可视化特征属性基线的影响，Visualizing the Impact of Feature Attribution Baselines

可视化特征属性基线的影响，Visualizing the Impact of Feature Attribution Baselines

专知会员服务

10+阅读 · 2020年1月16日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体记忆深度剖析：评价指标与系统局限性的分类体系及实证分析

《可信人工智能赋能系统的支柱》

【CMU博士论文】可靠轨迹预测的分层基石：数据、评估与方法

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

相关资讯

「因果推理」概述论文，13页pdf

「因果推理」概述论文，13页pdf

专知

16+阅读 · 2021年3月20日

医疗健康领域的短文本解析探索----文本纠错

医疗健康领域的短文本解析探索----文本纠错

深度学习自然语言处理

10+阅读 · 2020年8月5日

一文读懂依存句法分析

一文读懂依存句法分析

AINLP

16+阅读 · 2019年4月28日

别说还不懂依存句法分析

别说还不懂依存句法分析

人工智能头条

23+阅读 · 2019年4月8日

【机器学习】深入剖析机器学习中的统计思想

【机器学习】深入剖析机器学习中的统计思想

产业智能官

17+阅读 · 2019年1月24日

【UC伯克利】可解释性机器学习：定义、方法和应用

【UC伯克利】可解释性机器学习：定义、方法和应用

专知

70+阅读 · 2019年1月19日

稀疏性的3个优势 -《稀疏统计学习及其应用》

稀疏性的3个优势 -《稀疏统计学习及其应用》

遇见数学

15+阅读 · 2018年10月24日

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

机器学习研究会

19+阅读 · 2018年3月11日

图上的归纳表示学习

图上的归纳表示学习

科技创新与创业

23+阅读 · 2017年11月9日

特征工程的特征理解（一）

特征工程的特征理解（一）

机器学习研究会

10+阅读 · 2017年10月23日

相关论文

CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations

Arxiv

0+阅读 · 2月19日

Rethinking Explainable Disease Prediction: Synergizing Accuracy and Reliability via Reflective Cognitive Architecture

Arxiv

0+阅读 · 2月8日

Explaining Grokking in Transformers through the Lens of Inductive Bias

Arxiv

0+阅读 · 2月6日

Hidden in Plain Sight -- Class Competition Focuses Attribution Maps

Arxiv

0+阅读 · 2月5日

Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Arxiv

0+阅读 · 2月4日

Physics as the Inductive Bias for Causal Discovery

Arxiv

0+阅读 · 2月3日

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

Arxiv

0+阅读 · 2月2日

BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Arxiv

0+阅读 · 1月30日

GRANITE: A Generalized Regional Framework for Identifying Agreement in Feature-Based Explanations

Arxiv

0+阅读 · 1月30日

Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

Arxiv

0+阅读 · 1月28日

相关基金

面向特征提取的低秩与稀疏图嵌入理论与算法研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向推荐系统中异构隐式反馈建模的迁移学习技术研究

国家自然科学基金

5+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

基于非独立同分布学习理论的图模型词义消歧及领域适应方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

高维回归模型的预测稳定性研究

国家自然科学基金

3+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

面向大数据的群体偏好决策分析研究

国家自然科学基金

6+阅读 · 2014年12月31日

测量误差数据下约束线性模型的有偏估计及变量选择研究

国家自然科学基金

0+阅读 · 2014年12月31日

含有隐变量的因果结构学习与统计因果推断

国家自然科学基金

21+阅读 · 2013年12月31日

微信扫码咨询专知VIP会员