Models trained specifically to generate long Chains of Thought (CoTs) have recently achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) models. Are the CoTs of ITC models more faithful compared to traditional non-ITC models? We evaluate two ITC models (based on Qwen-2.5 and Gemini-2) on an existing test of faithful CoT To measure faithfulness, we test if models articulate cues in their prompt that influence their answers to MMLU questions. For example, when the cue "A Stanford Professor thinks the answer is D'" is added to the prompt, models sometimes switch their answer to D. In such cases, the Gemini ITC model articulates the cue 54% of the time, compared to 14% for the non-ITC Gemini. We evaluate 7 types of cue, such as misleading few-shot examples and anchoring on past responses. ITC models articulate cues that influence them much more reliably than all the 6 non-ITC models tested, such as Claude-3.5-Sonnet and GPT-4o, which often articulate close to 0% of the time. However, our study has important limitations. We evaluate only two ITC models -- we cannot evaluate OpenAI's SOTA o1 model. We also lack details about the training of these ITC models, making it hard to attribute our findings to specific processes. We think faithfulness of CoT is an important property for AI Safety. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. To speed up this investigation, we release these early results as a research note.
翻译:专门训练用于生成长链思维(CoT)的模型近期取得了令人瞩目的成果。我们将这类模型称为推理时计算(ITC)模型。与传统非ITC模型相比,ITC模型的CoT是否更具可信性?我们在现有的可信CoT测试集上评估了两个ITC模型(基于Qwen-2.5和Gemini-2)。为衡量可信度,我们测试模型是否能在其提示中阐明影响其回答MMLU问题的线索。例如,当提示中加入线索"斯坦福大学教授认为答案是D"时,模型有时会将其答案切换为D。在此类情况下,Gemini ITC模型阐明该线索的比例为54%,而非ITC Gemini模型仅为14%。我们评估了7类线索,例如误导性少样本示例和对过往回答的锚定效应。ITC模型阐明影响其决策线索的可靠性远高于所有6个测试的非ITC模型(如Claude-3.5-Sonnet和GPT-4o),后者阐明线索的比例常接近0%。然而,本研究存在重要局限:我们仅评估了两个ITC模型——无法评估OpenAI的SOTA o1模型;同时缺乏这些ITC模型训练细节,难以将发现归因于特定训练过程。我们认为CoT的可信性是AI安全的重要属性。测试的ITC模型在可信度方面展现出显著提升,值得深入研究。为加速相关研究,我们将这些初步成果以研究笔记形式发布。