关于后验更新的奖励表示注释 (Notes on the Reward Representation of Posterior Updates) - 专知论文

会员服务 ·

0

基准 · 表示 · 注释（编程） · 通道 · 上下文 ·

Notes on the Reward Representation of Posterior Updates

翻译：关于后验更新的奖励表示注释

Pedro A. Ortega

from arxiv, Technical report, 9 pages

Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.

翻译：现代控制与强化学习中的许多思想将决策视为推断过程：从基准分布出发，并在信号到达时对其进行更新。本文探讨了这种观点何时能够成为字面意义上的机制而非隐喻。我们研究了KL正则化软更新在特定固定概率模型内精确对应贝叶斯后验的特殊情形，此时更新变量成为信息传递的真实通道。在此机制下，行为变化仅由该通道承载的证据驱动：更新必须能够解释为对基准分布的证据重加权。这导出了一个精确的识别结果：后验更新决定了驱动行为变化的相对性、上下文依赖的激励信号，但无法唯一确定绝对奖励值——后者仍受上下文特定基准的模糊性影响。若要求在不同更新方向上保持可重复使用的延续价值，则会进一步产生连接不同条件顺序对应奖励描述的相干性约束。

0

相关内容

Mamba之后是什么？朝着更具表现力的递归更新规则迈进

Mamba之后是什么？朝着更具表现力的递归更新规则迈进

专知会员服务

15+阅读 · 2025年1月18日

【CVPR2024】卷积提示"遇见了语言模型的持续学习

【CVPR2024】卷积提示"遇见了语言模型的持续学习

专知会员服务

18+阅读 · 2024年4月1日

【CVPR2024】Token 转换的重要性：面向视觉 Transformer 的忠实事后解释

【CVPR2024】Token 转换的重要性：面向视觉 Transformer 的忠实事后解释

专知会员服务

21+阅读 · 2024年3月23日

NLP新范式-预训练，提示(Prompt)，预测！CMU刘鹏飞等论文综述预训练语言模型提示学习进展

NLP新范式-预训练，提示(Prompt)，预测！CMU刘鹏飞等论文综述预训练语言模型提示学习进展

专知会员服务

71+阅读 · 2021年7月31日

【Facebook-Yuandong Tian】在RL中为搜索和探索找到良好的表示，附71页PPT与视频

专知会员服务

19+阅读 · 2021年4月16日

【CVPR2020-亚马逊】后向兼容表示学习，BackwardCompatible RepresentationLearning

【CVPR2020-亚马逊】后向兼容表示学习，BackwardCompatible RepresentationLearning

专知会员服务

13+阅读 · 2020年3月27日

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

专知会员服务

70+阅读 · 2020年1月17日

【WSDM 2020】RecVAE:一种新的变分自编码器，用于具有隐式反馈的Top-N推荐（RecVAE: a New Variational Autoencoder for Top-NRecommendations with Implicit Feedback）

【WSDM 2020】RecVAE:一种新的变分自编码器，用于具有隐式反馈的Top-N推荐（RecVAE: a New Variational Autoencoder for Top-NRecommendations with Implicit Feedback）

专知会员服务

32+阅读 · 2019年12月26日

在线变分推断，76页ppt，A Regret Bound for Online Variational Inference

在线变分推断，76页ppt，A Regret Bound for Online Variational Inference

专知会员服务

21+阅读 · 2019年12月2日

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

专知会员服务

35+阅读 · 2019年11月30日

强化学习《奖励函数设计: Reward Shaping》详细解读

强化学习《奖励函数设计: Reward Shaping》详细解读

深度强化学习实验室

18+阅读 · 2020年9月1日

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

专知

22+阅读 · 2020年3月14日

【MIT-伯克利-ICLR2020】对比表示蒸馏，Contrastive Representation Distillation

【MIT-伯克利-ICLR2020】对比表示蒸馏，Contrastive Representation Distillation

专知

54+阅读 · 2020年3月12日

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

专知

21+阅读 · 2019年11月14日

用Attention玩转CV，一文总览自注意力语义分割进展

用Attention玩转CV，一文总览自注意力语义分割进展

机器之心

14+阅读 · 2019年8月26日

加速机器学习：从主动学习到BERT和流体标注

加速机器学习：从主动学习到BERT和流体标注

AINLP

15+阅读 · 2018年12月12日

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

新智元

12+阅读 · 2018年7月13日

论强化学习和概率推断的等价性：一种全新概率模型

论强化学习和概率推断的等价性：一种全新概率模型

机器之心

26+阅读 · 2018年5月5日

【强化学习】强化学习/增强学习/再励学习介绍

【强化学习】强化学习/增强学习/再励学习介绍

产业智能官

10+阅读 · 2018年2月23日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

面向Bug报告的软件故障重现方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

关联规则集上的知识发现

国家自然科学基金

9+阅读 · 2015年12月31日

复杂产品设计制造变更的传播机理、影响和信息协同方法

国家自然科学基金

0+阅读 · 2014年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

机制转化下的最优停时问题研究---以金融中投资决策分析为例

国家自然科学基金

2+阅读 · 2014年12月31日

基于信息更新的在役桥梁结构模糊随机可靠度研究

国家自然科学基金

0+阅读 · 2014年12月31日

连续变量量子误差修正的实验研究

国家自然科学基金

0+阅读 · 2014年12月31日

Representations

Arxiv

0+阅读 · 2月16日

TabMGP: Martingale posterior with TabPFN

Arxiv

0+阅读 · 2月15日

Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority

Arxiv

0+阅读 · 2月9日

Rewards as Labels: Revisiting RLVR from a Classification Perspective

Arxiv

0+阅读 · 2月5日

When Is Generalized Bayes Bayesian? A Decision-Theoretic Characterization of Loss-Based Updating

Arxiv

0+阅读 · 2月2日

Predicting and improving test-time scaling laws via reward tail-guided search

Arxiv

0+阅读 · 2月1日

EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

Arxiv

0+阅读 · 1月30日

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

Arxiv

0+阅读 · 1月28日

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

Arxiv

0+阅读 · 1月27日

When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

Arxiv

0+阅读 · 1月22日

VIP会员

文章信息

相关主题

注释（编程）

相关VIP内容

Mamba之后是什么？朝着更具表现力的递归更新规则迈进

Mamba之后是什么？朝着更具表现力的递归更新规则迈进

专知会员服务

15+阅读 · 2025年1月18日

【CVPR2024】卷积提示"遇见了语言模型的持续学习

【CVPR2024】卷积提示"遇见了语言模型的持续学习

专知会员服务

18+阅读 · 2024年4月1日

【CVPR2024】Token 转换的重要性：面向视觉 Transformer 的忠实事后解释

【CVPR2024】Token 转换的重要性：面向视觉 Transformer 的忠实事后解释

专知会员服务

21+阅读 · 2024年3月23日

NLP新范式-预训练，提示(Prompt)，预测！CMU刘鹏飞等论文综述预训练语言模型提示学习进展

NLP新范式-预训练，提示(Prompt)，预测！CMU刘鹏飞等论文综述预训练语言模型提示学习进展

专知会员服务

71+阅读 · 2021年7月31日

【Facebook-Yuandong Tian】在RL中为搜索和探索找到良好的表示，附71页PPT与视频

专知会员服务

19+阅读 · 2021年4月16日

【CVPR2020-亚马逊】后向兼容表示学习，BackwardCompatible RepresentationLearning

【CVPR2020-亚马逊】后向兼容表示学习，BackwardCompatible RepresentationLearning

专知会员服务

13+阅读 · 2020年3月27日

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

专知会员服务

70+阅读 · 2020年1月17日

【WSDM 2020】RecVAE:一种新的变分自编码器，用于具有隐式反馈的Top-N推荐（RecVAE: a New Variational Autoencoder for Top-NRecommendations with Implicit Feedback）

【WSDM 2020】RecVAE:一种新的变分自编码器，用于具有隐式反馈的Top-N推荐（RecVAE: a New Variational Autoencoder for Top-NRecommendations with Implicit Feedback）

专知会员服务

32+阅读 · 2019年12月26日

在线变分推断，76页ppt，A Regret Bound for Online Variational Inference

在线变分推断，76页ppt，A Regret Bound for Online Variational Inference

专知会员服务

21+阅读 · 2019年12月2日

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

专知会员服务

35+阅读 · 2019年11月30日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

强化学习《奖励函数设计: Reward Shaping》详细解读

强化学习《奖励函数设计: Reward Shaping》详细解读

深度强化学习实验室

18+阅读 · 2020年9月1日

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

专知

22+阅读 · 2020年3月14日

【MIT-伯克利-ICLR2020】对比表示蒸馏，Contrastive Representation Distillation

【MIT-伯克利-ICLR2020】对比表示蒸馏，Contrastive Representation Distillation

专知

54+阅读 · 2020年3月12日

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

专知

21+阅读 · 2019年11月14日

用Attention玩转CV，一文总览自注意力语义分割进展

用Attention玩转CV，一文总览自注意力语义分割进展

机器之心

14+阅读 · 2019年8月26日

加速机器学习：从主动学习到BERT和流体标注

加速机器学习：从主动学习到BERT和流体标注

AINLP

15+阅读 · 2018年12月12日

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

新智元

12+阅读 · 2018年7月13日

论强化学习和概率推断的等价性：一种全新概率模型

论强化学习和概率推断的等价性：一种全新概率模型

机器之心

26+阅读 · 2018年5月5日

【强化学习】强化学习/增强学习/再励学习介绍

【强化学习】强化学习/增强学习/再励学习介绍

产业智能官

10+阅读 · 2018年2月23日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Representations

Arxiv

0+阅读 · 2月16日

TabMGP: Martingale posterior with TabPFN

Arxiv

0+阅读 · 2月15日

Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority

Arxiv

0+阅读 · 2月9日

Rewards as Labels: Revisiting RLVR from a Classification Perspective

Arxiv

0+阅读 · 2月5日

When Is Generalized Bayes Bayesian? A Decision-Theoretic Characterization of Loss-Based Updating

Arxiv

0+阅读 · 2月2日

Predicting and improving test-time scaling laws via reward tail-guided search

Arxiv

0+阅读 · 2月1日

EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

Arxiv

0+阅读 · 1月30日

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

Arxiv

0+阅读 · 1月28日

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

Arxiv

0+阅读 · 1月27日

When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

Arxiv

0+阅读 · 1月22日

相关基金

面向Bug报告的软件故障重现方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

非确定型Web服务流程重组的可靠性验证技术

国家自然科学基金

1+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

关联规则集上的知识发现

国家自然科学基金

9+阅读 · 2015年12月31日

复杂产品设计制造变更的传播机理、影响和信息协同方法

国家自然科学基金

0+阅读 · 2014年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

机制转化下的最优停时问题研究---以金融中投资决策分析为例

国家自然科学基金

2+阅读 · 2014年12月31日

基于信息更新的在役桥梁结构模糊随机可靠度研究

国家自然科学基金

0+阅读 · 2014年12月31日

连续变量量子误差修正的实验研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员