NovPhy: A Testbed for Physical Reasoning in Open-world Environments

Due to the emergence of AI systems that interact with the physical environment, there is an increased interest in incorporating physical reasoning capabilities into those AI systems. But is it enough to only have physical reasoning capabilities to operate in a real physical environment? In the real world, we constantly face novel situations we have not encountered before. As humans, we are competent at successfully adapting to those situations. Similarly, an agent needs to have the ability to function under the impact of novelties in order to properly operate in an open-world physical environment. To facilitate the development of such AI systems, we propose a new testbed, NovPhy, that requires an agent to reason about physical scenarios in the presence of novelties and take actions accordingly. The testbed consists of tasks that require agents to detect and adapt to novelties in physical scenarios. To create tasks in the testbed, we develop eight novelties representing a diverse novelty space and apply them to five commonly encountered scenarios in a physical environment. According to our testbed design, we evaluate two capabilities of an agent: the performance on a novelty when it is applied to different physical scenarios and the performance on a physical scenario when different novelties are applied to it. We conduct a thorough evaluation with human players, learning agents, and heuristic agents. Our evaluation shows that humans' performance is far beyond the agents' performance. Some agents, even with good normal task performance, perform significantly worse when there is a novelty, and the agents that can adapt to novelties typically adapt slower than humans. We promote the development of intelligent agents capable of performing at the human level or above when operating in open-world physical environments. Testbed website: https://github.com/phy-q/novphy

翻译：随着与物理环境交互的人工智能系统的出现，将物理推理能力融入这些系统的兴趣日益增加。但仅具备物理推理能力是否足以在真实物理环境中运行？在现实世界中，我们不断面临此前未曾遇到过的新情境。作为人类，我们能够成功适应这些情境。同样，智能体需要具备在新鲜事物影响下运作的能力，才能正确地在开放世界物理环境中运行。为促进此类AI系统的发展，我们提出新测试平台NovPhy，要求智能体在存在新鲜事物的物理场景中进行推理并采取相应行动。该测试平台包含需要智能体检测并适应物理场景中新鲜事物的任务。为创建测试平台任务，我们开发了代表多样化新鲜事物空间的八种新鲜事物，并将其应用于物理环境中五种常见场景。根据测试平台设计，我们评估智能体的两项能力：同一新鲜事物应用于不同物理场景时的表现，以及不同新鲜事物应用于同一物理场景时的表现。我们与人类玩家、学习型智能体和启发式智能体进行了全面评估。评估显示，人类表现远超智能体。部分智能体即使基准任务表现良好，在出现新鲜事物时表现也显著下降，而能适应新鲜事物的智能体通常适应速度慢于人类。我们致力于促进能在开放世界物理环境中达到或超越人类水平的智能体的开发。测试平台网站：https://github.com/phy-q/novphy