In imperfect information games, the evaluation of a game state not only depends on the observable world but also relies on hidden parts of the environment. As accessing the obstructed information trivialises state evaluations, one approach to tackle such problems is to estimate the value of the imperfect state as a combination of all states in the information set, i.e., all possible states that are consistent with the current imperfect information. In this work, the goal is to learn a function that maps from the imperfect game information state to its expected value. However, constructing a perfect training set, i.e. an enumeration of the whole information set for numerous imperfect states, is often infeasible. To compute the expected values for an imperfect information game like \textit{Reconnaissance Blind Chess}, one would need to evaluate thousands of chess positions just to obtain the training target for a single state. Still, the expected value of a state can already be approximated with appropriate accuracy from a much smaller set of evaluations. Thus, in this paper, we empirically investigate how a budget of perfect information game evaluations should be distributed among training samples to maximise the return. Our results show that sampling a small number of states, in our experiments roughly 3, for a larger number of separate positions is preferable over repeatedly sampling a smaller quantity of states. Thus, we find that in our case, the quantity of different samples seems to be more important than higher target quality.
翻译:在非完备信息博弈中,游戏状态的评估不仅取决于可观测世界,还依赖于环境的隐藏部分。由于获取被遮挡信息会使状态评估变得平凡,处理此类问题的一种方法是将非完备状态的价值估计为信息集中所有可能状态的组合,即与当前非完备信息一致的所有可能状态。本工作的目标是学习一个从非完备游戏信息状态到其期望值的映射函数。然而,构建完美的训练集(即对大量非完备状态完整枚举其信息集)通常是不可行的。以《侦察盲棋》这类非完备信息博弈为例,仅为了获得单个状态的训练目标,就需要评估数千个棋局位置。尽管如此,通过更小规模的评估集已能以适当精度近似状态的期望值。因此,本文通过实证研究探讨了:在有限预算的完备信息博弈评估资源下,应如何在训练样本间分配这些资源以实现收益最大化。实验结果表明,对大量独立位置各采样少量状态(本实验中约为3个),优于对少量位置进行重复采样。由此可见,在本研究场景中,不同样本的数量似乎比更高的目标质量更为重要。