There have been wide spread claims in the literature about the emergent reasoning capabilities of Pretrained Large Language Models. However, recent studies, have found that their ability to plan remains questionable. Through our experiments using GPT-2, we empirically demonstrate that the performance of a finetuned baseline remains poor because it violates pre-conditions of actions in the plans that it generates. To improve the planning capabilities of a finetuned LLM, we train a verifier, which can classify actions as being valid or invalid in a particular state. By randomly sampling actions from the same dataset, we generate examples of invalid actions which are then used to train a verifier which can check for action applicability. In the presence of diverse sampling from a generator and a verifier which can prune invalid trajectories, we show significant gains in the success rate on the Blocksworld domain. Additionally, we show that finetuning the GPT-2 generator itself to create the verifier generalizes better than finetuning the base GPT-2. Lastly, we investigate the role of the sampling temperature which can be used to control the exploration-exploitation tradeoff.
翻译:文献中广泛宣称预训练大型语言模型具备涌现推理能力,但近期研究发现其规划能力仍存疑。我们通过基于GPT-2的实验实证表明,微调后基线模型的性能依然较差,因其生成的规划违反了动作前提条件。为提升微调后大语言模型的规划能力,我们训练了一个验证器,可对特定状态下动作的有效性进行分类。通过从同一数据集随机采样动作,我们生成了无效动作样本,用于训练能检验动作适用性的验证器。在结合生成器的多样化采样与可剪枝无效轨迹的验证器后,我们在积木世界领域取得了成功率显著提升。此外,我们证明将GPT-2生成器微调为验证器的方法比直接微调原始GPT-2具有更好的泛化性。最后,我们探究了用于控制探索-利用权衡的采样温度参数作用。