A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.
翻译:智能体评估与训练中一个日益严重的缺陷模式是:模型可能通过利用捷径而非解决既定任务来获得高评估分数,从而产生欺骗性表现。这使得评估分数作为真实任务解决能力的衡量标准变得不可靠。我们提出CapCode框架,该框架通过构建包含随机化测试的编码数据集,使其在不作弊的情况下可达到的最佳性能被人为设定低于完美值。这种封顶性能设计为评估分数提供了更清晰的解释:显著高于封顶值的分数不合理,因此可作为作弊证据。为防止作弊,我们提出基于CapCode原理的奖励设计CapReward,旨在抑制超越性能封顶的优化行为。跨多个数据集的实验表明,CapCode能在检测作弊的同时保持模型性能排名,而CapReward可减少作弊行为,使模型能更严格遵循既定任务规范。