We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.
翻译:摘要:我们探索了现代大语言模型(LLMs)在一种新颖的受限环境中的创造性问题解决能力。为此,我们创建了MACGYVER数据集——一个自动生成的包含超过1,600个真实世界问题的集合,这些问题被有意设计为触发对物品的创新使用并需要跳出常规思维。随后,我们将该数据集同时呈现给LLMs和人类参与者,以比较和对比他们的问题解决能力。MACGYVER对两个群体都具有挑战性,但体现为独特且互补的方式。例如,人类在熟悉的任务中表现出色,但在需要领域特定知识时则面临困难,导致结果方差较高。相比之下,LLMs因接触多样化专业知识而能尝试更广泛的问题,但会因提出物理上不可行的行动而失败。最后,我们提供了对LLMs的详细错误分析,并展示了通过新颖的提示技术(如迭代逐步反思与发散-收敛思维)增强其问题解决能力的潜力。本研究(1)为智能体开辟了一个聚焦于物理推理、规划和非传统思维复杂层面的新领域,补充了现有机器智能的维谱;(2)揭示人类与人工智能在受限问题解决能力上的洞见。