The 20 Questions (Q20) game is a well known game which encourages deductive reasoning and creativity. In the game, the answerer first thinks of an object such as a famous person or a kind of animal. Then the questioner tries to guess the object by asking 20 questions. In a Q20 game system, the user is considered as the answerer while the system itself acts as the questioner which requires a good strategy of question selection to figure out the correct object and win the game. However, the optimal policy of question selection is hard to be derived due to the complexity and volatility of the game environment. In this paper, we propose a novel policy-based Reinforcement Learning (RL) method, which enables the questioner agent to learn the optimal policy of question selection through continuous interactions with users. To facilitate training, we also propose to use a reward network to estimate the more informative reward. Compared to previous methods, our RL method is robust to noisy answers and does not rely on the Knowledge Base of objects. Experimental results show that our RL method clearly outperforms an entropy-based engineering system and has competitive performance in a noisy-free simulation environment.
翻译:二十问(Q20)游戏是一种鼓励演绎推理与创造力的经典游戏。在该游戏中,回答者首先选定一个对象(如知名人物或某种动物),随后提问者通过提出二十个问题尝试猜中该对象。在Q20游戏系统中,用户被视为回答者,而系统自身作为提问者,需要采用有效的问题选择策略以推断正确对象并赢得游戏。然而,由于游戏环境的复杂性与多变性,问题选择的最优策略难以直接推导。本文提出一种新颖的基于策略的强化学习方法,使提问者智能体能够通过与用户的持续交互来学习最优问题选择策略。为促进训练,我们还提出使用奖励网络来估计信息量更高的奖励值。与现有方法相比,本强化学习方法对噪声答案具有鲁棒性,且无需依赖对象知识库。实验结果表明,本强化学习方法显著优于基于信息熵的工程系统,并在无噪声仿真环境中展现出具有竞争力的性能。