When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.
翻译:在训练能够执行复杂任务的强大AI系统时,提供对优化具有鲁棒性的训练信号可能具有挑战性。其中一个担忧是\textit{测量篡改},即AI系统操纵多个测量数据以制造良好结果的假象,而非实际达成预期目标。在本研究中,我们构建了四个全新的基于文本的数据集,用于评估大型语言模型上测量篡改检测技术的性能。具体而言,给定一组文本输入和旨在判断某个结果是否发生的测量数据,以及一个能够准确预测测量结果的基模型,目标是确定所有测量数据均指示结果发生的样本中,该结果是否真实发生,还是由测量篡改所致。我们展示了在大多数数据集上优于简单基线的方法,但未能达到最优性能。我们认为在技术和数据集两方面均有显著的改进空间,并对未来应对测量篡改的研究充满期待。