Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent's transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.
翻译:近期视觉语言模型展现出强大的感知能力,但其隐式推理过程难以解释,且在复杂查询下易产生幻觉。组合式方法提升了可解释性,但多数依赖单一智能体或手工构建的流水线,无法动态决定何时在互补智能体间协作或在功能重叠的智能体间竞争。本文提出MATA(多智能体分层可训练自动机),一种以分层有限状态自动机形式呈现的多智能体视觉推理系统,其顶层状态转移由可训练的超智能体决策。每个智能体对应超自动机中的一个状态,并运行基于规则的微型子自动机以实现可靠的微观控制。所有智能体通过读写共享内存进行交互,形成透明的执行历史。为监督超智能体的转移策略,我们构建转移轨迹树并将其转化为内存-下一状态对,由此构建包含90K样本的MATA-SFT数据集用于监督微调。经微调后作为转移策略的大语言模型能够理解查询需求与各智能体能力,从而高效选择最优智能体解决问题。在多个视觉推理基准测试中,MATA相比单体模型与组合式基线方法均取得了最先进的性能。代码与数据集已开源:https://github.com/ControlNet/MATA。