In the rapidly evolving field of natural language processing, the translation of linguistic descriptions into mathematical formulation of optimization problems presents a formidable challenge, demanding intricate understanding and processing capabilities from Large Language Models (LLMs). This study compares prominent LLMs, including GPT-3.5, GPT-4, and Llama-2-7b, in zero-shot and one-shot settings for this task. Our findings show GPT-4's superior performance, particularly in the one-shot scenario. A central part of this research is the introduction of `LM4OPT,' a progressive fine-tuning framework for Llama-2-7b that utilizes noisy embeddings and specialized datasets. However, this research highlights a notable gap in the contextual understanding capabilities of smaller models such as Llama-2-7b compared to larger counterparts, especially in processing lengthy and complex input contexts. Our empirical investigation, utilizing the NL4Opt dataset, unveils that GPT-4 surpasses the baseline performance established by previous research, achieving an F1-score of 0.63, solely based on the problem description in natural language, and without relying on any additional named entity information. GPT-3.5 follows closely, both outperforming the fine-tuned Llama-2-7b. These findings not only benchmark the current capabilities of LLMs in a novel application area but also lay the groundwork for future improvements in mathematical formulation of optimization problems from natural language input.
翻译:在自然语言处理这一快速发展的领域,将语言描述转化为优化问题的数学表述是一项艰巨的挑战,这要求大语言模型具备深刻的理解与处理能力。本研究对比了包括GPT-3.5、GPT-4及Llama-2-7b在内的主流大语言模型在零样本与单样本设定下的表现。研究结果表明,GPT-4具有更优性能,尤其在单样本场景中表现突出。本研究的核心是引入一种针对Llama-2-7b的渐进式微调框架“LM4OPT”,该框架利用噪声嵌入与专门数据集。然而,本研究揭示了较小模型(如Llama-2-7b)在上下文理解能力上与大型模型相比存在显著差距,尤其是在处理冗长且复杂的输入上下文时。基于NL4Opt数据集的实证研究表明,GPT-4超越了先前研究建立的基线性能,仅依赖自然语言问题描述(未使用任何额外命名实体信息)即达到0.63的F1分数。GPT-3.5紧随其后,两者表现均优于经过微调的Llama-2-7b。这些发现不仅为当前大语言模型在新应用领域的能力提供了基准,也为未来从自然语言输入中生成优化问题数学表述的改进奠定了基础。