Mindstorms in Natural Language-Based Societies of Mind

Mingchen Zhuge,Haozhe Liu,Francesco Faccio,Dylan R. Ashley,Róbert Csordás,Anand Gopalakrishnan,Abdullah Hamdi,Hasan Abed Al Kader Hammoud,Vincent Herrmann,Kazuki Irie,Louis Kirsch,Bing Li,Guohao Li,Shuming Liu,Jinjie Mai,Piotr Piękos,Aditya Ramesh,Imanol Schlag,Weimin Shi,Aleksandar Stanić,Wenyi Wang,Yuhui Wang,Mengmeng Xu,Deng-Ping Fan,Bernard Ghanem,Jürgen Schmidhuber

from arxiv, 9 pages in main text + 7 pages of references + 38 pages of appendices, 14 figures in main text + 13 in appendices, 7 tables in appendices

Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions.

翻译：明斯基的“思维社会”和施密德胡伯的“学习思考”共同启发了由大型多模态神经网络（NN）组成的多样化社会，这些网络通过“心智风暴”相互对话来解决问题。近期基于神经网络的社会实现由大型语言模型（LLM）和其他基于神经网络的专家组成，通过自然语言接口进行通信。在此过程中，它们克服了单个LLM的局限性，改进了多模态零样本推理。在这些基于自然语言的思维社会（NLSOM）中，所有通过相同通用符号语言通信的新智能体可以模块化地轻松添加。为了展示NLSOM的能力，我们构建并实验了多个（最多包含129个成员的）此类系统，利用其中的心智风暴解决一些实际人工智能任务：视觉问答、图像描述、文本到图像合成、3D生成、自我中心检索、具身AI以及通用语言任务求解。我们将其视为迈向拥有数十亿智能体（其中可能包含人类）的更大规模NLSOM的起点。随着这种异质思维伟大社会的涌现，许多新的研究问题突然对人工智能的未来变得至关重要。NLSOM的社会结构应该是什么？君主制结构相比民主制结构有何（优）缺点？如何利用神经网络经济学原理最大化强化学习NLSOM的总奖励？在本工作中，我们识别、讨论并尝试回答其中一些问题。