SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR - 专知论文

会员服务 ·

0

GROUP · 秩 · binary · 方差 · 均值 ·

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

翻译：暂无翻译

Siddharth Aphale,Kelly Liu

from arxiv, 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

翻译：暂无翻译

0

相关内容

GROUP

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

中文版 | 侦察：超越ISR中的"R"

中文版 | 侦察：超越ISR中的"R"

专知会员服务

17+阅读 · 2025年4月14日

SFT 记忆，RL 泛化：基础模型后训练的比较研究

SFT 记忆，RL 泛化：基础模型后训练的比较研究

专知会员服务

24+阅读 · 2025年2月3日

【机身结构疲劳损伤跟踪】《预测和概率性单飞行器跟踪评估 (P2IAT)》美空军研究实验室2022最新81页报告

【机身结构疲劳损伤跟踪】《预测和概率性单飞行器跟踪评估 (P2IAT)》美空军研究实验室2022最新81页报告

专知会员服务

19+阅读 · 2022年12月3日

【CVPR2022】跨模态检索的协同双流视觉语言预训练模型

【CVPR2022】跨模态检索的协同双流视觉语言预训练模型

专知会员服务

21+阅读 · 2022年4月21日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

专知会员服务

11+阅读 · 2020年5月25日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

基于动态时空图CNNs的交通流预测，Dynamic Spatio-temporal Graph-based CNNs for Traffic Flow Prediction

基于动态时空图CNNs的交通流预测，Dynamic Spatio-temporal Graph-based CNNs for Traffic Flow Prediction

专知会员服务

136+阅读 · 2020年3月8日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

FX2TRT-Pytorch转TensorRT新方式-实践torch.fx第三篇

FX2TRT-Pytorch转TensorRT新方式-实践torch.fx第三篇

极市平台

21+阅读 · 2022年11月7日

RL解决'BipedalWalkerHardcore-v2' (SOTA)

RL解决'BipedalWalkerHardcore-v2' (SOTA)

CreateAMind

31+阅读 · 2019年7月17日

【泡泡一分钟】FarSight：从户外图像中实现远距离深度估计

【泡泡一分钟】FarSight：从户外图像中实现远距离深度估计

泡泡机器人SLAM

11+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

利用动态深度学习预测金融时间序列基于Python

利用动态深度学习预测金融时间序列基于Python

量化投资与机器学习

18+阅读 · 2018年10月30日

【泡泡点云时空】PPFNet：三维点鲁棒匹配的全局上下文感知局部特征（CVPR2018-9）

【泡泡点云时空】PPFNet：三维点鲁棒匹配的全局上下文感知局部特征（CVPR2018-9）

泡泡机器人SLAM

11+阅读 · 2018年8月22日

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

专知

19+阅读 · 2018年5月31日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

基于共性视觉特征与反馈机制的SAR图像目标检测方法研究

国家自然科学基金

3+阅读 · 2017年12月31日

脉冲电压下SF6气体放电极性效应研究

国家自然科学基金

0+阅读 · 2015年12月31日

不规则问题驱动下的多维度SAR回波混合粒度并行模拟

国家自然科学基金

0+阅读 · 2015年12月31日

极大倾角光纤光栅SPR的超痕量生化传感基础研究

国家自然科学基金

0+阅读 · 2015年12月31日

高分辨率单极化SAR图像慢动船只散射特性稳健高层表征研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于氧化石墨烯的超灵敏度真空检漏校准基础问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

大场景高速旋转下直升机旋翼桨叶运动参数立体视觉测量方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

LRP6基因R611C突变致心肌肥厚机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

阵列式高速红外测温系统研究

国家自然科学基金

1+阅读 · 2014年12月31日

SF6断路器开断过程中灭弧室内动态温度场的测量与特性研究

国家自然科学基金

0+阅读 · 2014年12月31日

A Conditional Timing Protection Level: Holdover-Limited Undetected Time Error Under GNSS Spoofing

Arxiv

0+阅读 · 6月23日

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

Arxiv

0+阅读 · 6月22日

Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift

Arxiv

0+阅读 · 6月21日

Gated MLPs as Symmetry-Broken Rank-1 Bilinear Attention

Arxiv

0+阅读 · 6月20日

Deep RL- Tuned Mo del-Free Adaptive Control for Lower-Limb Exoskeletons During Sit-to-Stand Transitions

Arxiv

0+阅读 · 6月20日

A UAV-Mounted Sensor Network for Close-Range Inspection of Wind Turbine Rotor Blades

Arxiv

0+阅读 · 6月19日

Overfitted high-dimensional matrix factorizations via adaptive spectral shrinkage

Arxiv

0+阅读 · 6月17日

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Arxiv

0+阅读 · 6月17日

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

Arxiv

0+阅读 · 6月16日

An Epistemic Analysis of Random Coordinated Attack

Arxiv

0+阅读 · 6月16日

VIP会员

文章信息

相关主题

最新内容

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

3+阅读 · 今天6:30

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

4+阅读 · 今天6:18

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

4+阅读 · 今天6:08

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

4+阅读 · 今天5:54

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

4+阅读 · 今天5:22

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

5+阅读 · 今天5:15

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

5+阅读 · 今天3:42

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

4+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

3+阅读 · 6月24日

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

专知会员服务

9+阅读 · 6月24日

重新思考无人机时代的生存能力

重新思考无人机时代的生存能力

专知会员服务

8+阅读 · 6月24日

装甲突击旅：现代战争思考、战斗与组织

装甲突击旅：现代战争思考、战斗与组织

专知会员服务

6+阅读 · 6月24日

在人工智能加速决策环境中拓展OODA循环

在人工智能加速决策环境中拓展OODA循环

专知会员服务

8+阅读 · 6月24日

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

专知会员服务

7+阅读 · 6月24日

军事欺骗：供作战战术指挥官使用的工具

军事欺骗：供作战战术指挥官使用的工具

专知会员服务

6+阅读 · 6月24日

相关VIP内容

中文版 | 侦察：超越ISR中的"R"

中文版 | 侦察：超越ISR中的"R"

专知会员服务

17+阅读 · 2025年4月14日

SFT 记忆，RL 泛化：基础模型后训练的比较研究

SFT 记忆，RL 泛化：基础模型后训练的比较研究

专知会员服务

24+阅读 · 2025年2月3日

【机身结构疲劳损伤跟踪】《预测和概率性单飞行器跟踪评估 (P2IAT)》美空军研究实验室2022最新81页报告

【机身结构疲劳损伤跟踪】《预测和概率性单飞行器跟踪评估 (P2IAT)》美空军研究实验室2022最新81页报告

专知会员服务

19+阅读 · 2022年12月3日

【CVPR2022】跨模态检索的协同双流视觉语言预训练模型

【CVPR2022】跨模态检索的协同双流视觉语言预训练模型

专知会员服务

21+阅读 · 2022年4月21日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

专知会员服务

11+阅读 · 2020年5月25日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

基于动态时空图CNNs的交通流预测，Dynamic Spatio-temporal Graph-based CNNs for Traffic Flow Prediction

基于动态时空图CNNs的交通流预测，Dynamic Spatio-temporal Graph-based CNNs for Traffic Flow Prediction

专知会员服务

136+阅读 · 2020年3月8日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

热门VIP内容

开通专知VIP会员享更多权益服务

网状网络及其在军事领域的运用

无美国参与的欧洲战争方式（万字长文）

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

相关资讯

FX2TRT-Pytorch转TensorRT新方式-实践torch.fx第三篇

FX2TRT-Pytorch转TensorRT新方式-实践torch.fx第三篇

极市平台

21+阅读 · 2022年11月7日

RL解决'BipedalWalkerHardcore-v2' (SOTA)

RL解决'BipedalWalkerHardcore-v2' (SOTA)

CreateAMind

31+阅读 · 2019年7月17日

【泡泡一分钟】FarSight：从户外图像中实现远距离深度估计

【泡泡一分钟】FarSight：从户外图像中实现远距离深度估计

泡泡机器人SLAM

11+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

利用动态深度学习预测金融时间序列基于Python

利用动态深度学习预测金融时间序列基于Python

量化投资与机器学习

18+阅读 · 2018年10月30日

【泡泡点云时空】PPFNet：三维点鲁棒匹配的全局上下文感知局部特征（CVPR2018-9）

【泡泡点云时空】PPFNet：三维点鲁棒匹配的全局上下文感知局部特征（CVPR2018-9）

泡泡机器人SLAM

11+阅读 · 2018年8月22日

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

专知

19+阅读 · 2018年5月31日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

A Conditional Timing Protection Level: Holdover-Limited Undetected Time Error Under GNSS Spoofing

Arxiv

0+阅读 · 6月23日

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

Arxiv

0+阅读 · 6月22日

Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift

Arxiv

0+阅读 · 6月21日

Gated MLPs as Symmetry-Broken Rank-1 Bilinear Attention

Arxiv

0+阅读 · 6月20日

Deep RL- Tuned Mo del-Free Adaptive Control for Lower-Limb Exoskeletons During Sit-to-Stand Transitions

Arxiv

0+阅读 · 6月20日

A UAV-Mounted Sensor Network for Close-Range Inspection of Wind Turbine Rotor Blades

Arxiv

0+阅读 · 6月19日

Overfitted high-dimensional matrix factorizations via adaptive spectral shrinkage

Arxiv

0+阅读 · 6月17日

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Arxiv

0+阅读 · 6月17日

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

Arxiv

0+阅读 · 6月16日

An Epistemic Analysis of Random Coordinated Attack

Arxiv

0+阅读 · 6月16日

相关基金

基于共性视觉特征与反馈机制的SAR图像目标检测方法研究

国家自然科学基金

3+阅读 · 2017年12月31日

脉冲电压下SF6气体放电极性效应研究

国家自然科学基金

0+阅读 · 2015年12月31日

不规则问题驱动下的多维度SAR回波混合粒度并行模拟

国家自然科学基金

0+阅读 · 2015年12月31日

极大倾角光纤光栅SPR的超痕量生化传感基础研究

国家自然科学基金

0+阅读 · 2015年12月31日

高分辨率单极化SAR图像慢动船只散射特性稳健高层表征研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于氧化石墨烯的超灵敏度真空检漏校准基础问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

大场景高速旋转下直升机旋翼桨叶运动参数立体视觉测量方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

LRP6基因R611C突变致心肌肥厚机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

阵列式高速红外测温系统研究

国家自然科学基金

1+阅读 · 2014年12月31日

SF6断路器开断过程中灭弧室内动态温度场的测量与特性研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员