We consider the problem of red teaming LLMs on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. We present a framework to procedurally generate numerical questions and puzzles, and compare the results with and without the application of several red teaming techniques. Our findings suggest that even though structured reasoning and providing worked-out examples slow down the deterioration of the quality of answers, the gpt-3.5-turbo and gpt-4 models are not well suited for elementary calculations and reasoning tasks, also when being red teamed.
翻译:我们研究了在基础计算与代数任务上对大规模语言模型进行红队测试的问题,以评估不同提示技术对输出质量的影响。我们提出了一种框架,用于程序化生成数值问题与谜题,并比较了应用多种红队技术前后的结果。研究发现,尽管结构化推理和提供示范解答能延缓答案质量的下降,但gpt-3.5-turbo与gpt-4模型在基础计算与推理任务中表现不佳,即使在接受红队测试的情况下亦是如此。