Various human activities can be abstracted into a sequence of actions in natural text, i.e. cooking, repairing, manufacturing, etc. Such action sequences heavily depend on the executing order, while disorder in action sequences leads to failure of further task execution by robots or AI agents. Therefore, to verify the order reasoning capability of current neural models in sequential tasks, we propose a challenging benchmark , named STEPS. STEPS involves two subtask settings, focusing on determining the rationality of given next step in recipes and selecting the reasonable step from the multi-choice question, respectively. We describe the data construction and task formulations, and benchmark most of significant Large Language Models (LLMs). The experimental results demonstrate 1) The commonsense reasoning of action orders in sequential tasks are challenging to resolve via zero-shot prompting or few-shot in-context learning for LLMs; 2) Prompting method still significantly lags behind tuning-based method on STEPS.
翻译:各类人类活动均可抽象为自然文本中的动作序列,例如烹饪、维修、制造等。此类动作序列高度依赖执行顺序,而动作序列的混乱将导致机器人或AI智能体无法完成后续任务。因此,为验证当前神经模型在序列任务中的顺序推理能力,我们提出了一个具有挑战性的基准测试——STEPS。该基准包含两项子任务设置:分别聚焦于判定菜谱中给定下一步骤的合理性,以及从多项选择题中选取合理步骤。我们阐述了数据构建与任务形式化方法,并对主流大语言模型进行了基准测试。实验结果表明:1)大语言模型通过零样本提示或少样本上下文学习难以解决序列任务中动作顺序的常识推理;2)在STEPS任务中,提示方法的表现仍显著弱于基于微调的方法。