Recently it has been shown that tensor networks (TNs) have the ability to represent the expected return of a single-agent finite Markov decision process (FMDP). The TN represents a distribution model, where all possible trajectories are considered. When extending these ideas to a multi-agent setting, distribution models suffer from the curse of dimensionality: the exponential relation between the number of possible trajectories and the number of agents. The key advantage of using TNs in this setting is that there exists a large number of established optimisation and decomposition techniques that are specific to TNs, that one can apply to ensure the most efficient representation is found. In this report, these methods are used to form a TN that represents the expected return of a multi-agent reinforcement learning (MARL) task. This model is then applied to a 2 agent random walker example, where it was shown that the policy is correctly optimised using a DMRG technique. Finally, I demonstrate the use of an exact decomposition technique, reducing the number of elements in the tensors by 97.5%, without experiencing any loss of information.
翻译:近期研究表明,张量网络(TN)能够表示单智能体有限马尔可夫决策过程(FMDP)的期望回报。这种张量网络构建了一种分布模型,其中所有可能的轨迹均被纳入考量。当将这些思想扩展到多智能体场景时,分布模型会面临维度灾难问题:可能轨迹的数量与智能体数量之间存在指数级关联。在此背景下使用张量网络的关键优势在于,存在大量针对张量网络的特有优化与分解技术,可确保获得最高效的表示。本报告运用这些方法构建了表示多智能体强化学习(MARL)任务期望回报的张量网络。将该模型应用于双智能体随机游走实例后,验证了通过密度矩阵重整化群(DMRG)技术可实现策略的正确优化。最后,通过精确分解技术,将张量元素数量减少了97.5%,且未造成任何信息损失。