LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source datasets in operations research domain lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, a algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency, enabling faster retrieval of correct solutions.
翻译:大型语言模型展现出先进的推理能力,为将自然语言问题转化为数学模型提供了潜力。然而,运筹学领域现有的开源数据集缺乏对建模过程(如变量定义)的详细标注,仅关注目标值,这阻碍了强化学习的应用。为解决此问题,我们发布了StructuredOR数据集,该数据集带有全面的标签,以捕捉完整的数学建模过程。我们进一步提出了BPP-Search算法,该算法通过集成Beam搜索、过程奖励模型和成对偏好算法,将强化学习融入思维树结构中。这种方法能够高效探索树结构,避免穷举搜索,同时提高准确性。在StructuredOR、NL4OPT和MAMO-ComplexLP数据集上进行的大量实验表明,BPP-Search显著优于现有最先进方法。在基于树的推理中,BPP-Search在准确性和效率方面表现优异,能够更快地检索到正确解。