A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.
翻译:在智能体评估与训练中,一个日益凸显的失败模式是:模型可通过利用捷径而非解决预期任务来获得高评估分数,从而产生欺骗性性能。这使得评估分数作为真实任务解决能力的度量指标变得不可靠。我们提出CapCode框架,该框架通过引入随机测试来构建编码数据集,且其可达最优非作弊性能被人为设定上限低于完美水平。这种上限性能设计赋予了评估分数更清晰的意义:显著高于上限的分数不合理,因此为作弊行为提供证据。为防范作弊,我们提出基于CapCode原则的CapReward奖励设计方案,旨在抑制对超出上限性能的优化。多数据集实验表明,CapCode能在检测作弊的同时保持模型性能排名,而CapReward可减少作弊行为,使模型更符合预期任务规范。