Personalised interactive systems such as recommender systems require selecting relevant items from massive catalogs dependent on context. Reward-driven offline optimisation of these systems can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
翻译:个性化交互系统(如推荐系统)需要根据上下文从海量目录中筛选相关物品。通过松弛离散问题可实现此类系统的奖励驱动离线优化,从而得到策略学习或REINFORCE类学习算法。然而,松弛步骤需计算整个目录的和,导致梯度评估(以及每次随机梯度下降迭代)的计算复杂度与目录规模呈线性关系。在实际场景(如大规模目录推荐系统)中,该计算量往往不可行,严重限制了该方法的实用性。本文提出一种策略学习算法的近似方法,其计算复杂度仅为目录规模的对数级。该贡献基于三个新颖思想的融合:策略梯度的新型蒙特卡洛估计、自归一化重要性采样估计器以及训练时快速最大内积搜索的使用。大量实验表明,本算法比朴素方法快一个数量级,且能产生同等质量的策略。