When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.
翻译:在训练强大的AI系统执行复杂任务时,提供能够抵御优化压力的训练信号可能较为困难。其中一个担忧是"测量篡改"现象,即AI系统操控多项测量指标以制造良好结果的假象,而非实际达成预期目标。本文构建了四个基于文本的新数据集,用于评估大型语言模型上的测量篡改检测技术。具体而言,给定旨在判断某结果是否发生的文本输入与测量指标集合,以及能够准确预测测量指标的基座模型,研究目标是判断在"所有测量指标均指示结果已发生"的样例中,究竟是真实现了该结果,还是由测量篡改导致。我们展示的技术在多数数据集上优于简单基线,但尚未达到最优性能。我们认为当前技术与数据集仍有显著提升空间,并对未来针对测量篡改的研究充满期待。