Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs). To address this issue, we developed LLaMa-SciQ, an educational chatbot designed to assist college students in solving and understanding MCQs in STEM fields. We begin by fine-tuning and aligning the models to human preferences. After comparing the performance of Mistral-7B and LLaMa-8B, we selected the latter as the base model due to its higher evaluation accuracy. To further enhance accuracy, we implement Retrieval-Augmented Generation (RAG) and apply quantization to compress the model, reducing inference time and increasing accessibility for students. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset. However, RAG does not improve performance and even reduces it, likely due to retriever issues or the model's unfamiliarity with context. Despite this, the quantized model shows only a 5% loss in performance, demonstrating significant efficiency improvements.
翻译:大型语言模型(LLM)在处理需要数学推理的任务时常常遇到困难,尤其是在多项选择题(MCQ)方面。为了解决这个问题,我们开发了LLaMa-SciQ,这是一个旨在帮助大学生解决和理解STEM领域多项选择题的教育聊天机器人。我们首先对模型进行微调,并将其与人类偏好对齐。在比较了Mistral-7B和LLaMa-8B的性能后,我们选择了后者作为基础模型,因为它具有更高的评估准确率。为了进一步提高准确率,我们实现了检索增强生成(RAG),并应用量化技术来压缩模型,从而减少推理时间并提高学生的可访问性。在数学推理方面,LLaMa-SciQ在GSM8k数据集上达到了74.5%的准确率,在MATH数据集上达到了30%。然而,RAG并未提升性能,甚至降低了性能,这可能是由于检索器问题或模型对上下文不熟悉所致。尽管如此,量化后的模型仅表现出5%的性能损失,显示出显著的效率提升。