We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.
翻译:我们提出MetaDOAR,一种轻量级元控制器,通过引入一个学习到的、分区感知的过滤层和Q值缓存来增强双Oracle/PSRO范式,从而实现在超大规模网络环境下的可扩展多智能体强化学习。MetaDOAR从每个节点的结构嵌入中学习紧凑的状态投影,以快速评分并选择一小部分设备(一个top-k分区),随后由传统底层执行器利用评论家智能体在此分区上进行聚焦束搜索。所选候选动作通过批处理的评论家前向传播进行评估,并存储在以量化状态投影和局部动作标识符为键的LRU缓存中,这通过保守的k跳缓存失效机制,在保持决策质量的同时显著减少了冗余的评论家计算。实验表明,在大型网络拓扑上,MetaDOAR获得了比现有最优基线更高的玩家收益,且在内存使用或训练时间方面未出现显著的扩展性问题。这项贡献为大规模网络化决策问题的高效分层策略学习提供了一条实用且具有理论依据的路径。