Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes ($+10\%$ absolute improvements in all models).
翻译:面向代码生成的大型语言模型通常通过自然语言指令对齐训练,以紧密遵循用户的意图与需求。然而,在许多实际场景中,模型往往难以在实用性与安全性之间把握复杂的平衡边界,尤其当面对高度复杂但可能具有恶意的指令时。本研究提出INDICT框架:一种通过内部批判性对话机制,同时为模型提供安全性指导与实用性指导的新范式。该内部对话系统由安全导向的批判模块与实用导向的批判模块构成双协作体系。每个批判模块均能结合外部知识——通过相关代码片段及网络搜索、代码解释器等工具获取——对给定任务及相应生成的代码响应进行分析。我们将此双批判系统同时应用于代码生成阶段与代码执行阶段,分别为大型语言模型提供前瞻性指导与事后修正指导。我们在5个基准测试的8种编程语言上选取8项多样化任务进行评估,使用的模型参数规模涵盖7B至70B。实验结果表明,我们的方法能够提供兼具安全性分析与实用性分析的高水平批判反馈,显著提升输出代码的质量(所有模型均实现$+10\%$的绝对性能提升)。