Code Soliloquies for Accurate Calculations in Large Language Models

High-quality conversational datasets are integral to the successful development of Intelligent Tutoring Systems (ITS) that employ a Large Language Model (LLM) backend. These datasets, when used to fine-tune the LLM backend, significantly enhance the quality of interactions between students and ITS. A common strategy for developing these datasets involves generating synthetic student-teacher dialogues using advanced GPT-4 models. However, challenges arise when these dialogues demand complex calculations, common in subjects like physics. Despite its advanced capabilities, GPT-4's performance falls short in reliably handling even simple multiplication tasks, marking a significant limitation in its utility for these subjects. To address these challenges, this paper introduces an innovative stateful prompt design. Our approach generates a mock conversation between a student and a tutorbot, both roles simulated by GPT-4. Each student response triggers a soliloquy (an inner monologue) in the GPT-tutorbot, which assesses whether its response would necessitate calculations. If so, it proceeds to script the required code in Python and then uses the resulting output to construct its response to the student. Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive. Our findings show that our Higgs model -- a LLaMA finetuned with datasets generated through our novel stateful prompt design -- proficiently utilizes Python for computations. Consequently, finetuning with our datasets enriched with code soliloquies enhances not just the accuracy but also the computational reliability of Higgs' responses.

翻译：高质量对话数据集对于采用大型语言模型（LLM）后端的智能辅导系统（ITS）的成功开发至关重要。这些数据集在微调LLM后端时，能显著提升学生与ITS之间的交互质量。开发此类数据集的一种常见策略是使用先进的GPT-4模型生成合成师生对话。然而，当这些对话需要复杂计算时（这在物理等学科中很常见），挑战便随之而来。尽管GPT-4功能强大，但其在可靠处理简单乘法任务方面表现不足，这限制了它在这些学科中的实用性。为解决这些问题，本文提出了一种创新的有状态提示设计。我们的方法生成一段学生与辅导机器人（tutorbot）之间的模拟对话，两个角色均由GPT-4模拟。每次学生响应都会触发GPT辅导机器人的独白（内心独白），评估其响应是否需要计算。如果需要，它会用Python编写所需代码，随后利用计算结果构建对学生的响应。我们的方法显著提升了合成对话数据集的质量，尤其是在计算密集型学科中。研究结果表明，我们的Higgs模型（基于LLaMA微调，使用我们新颖的有状态提示设计生成的数据集）能熟练利用Python进行计算。因此，使用包含代码独白的增强数据集进行微调，不仅提高了Higgs响应的准确性，还增强了其计算可靠性。