An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.
翻译:大规模机器学习系统中一个日益重要的构建模块是基于返回列表(即根据查询给出的有序项目列表)的功能。该技术的应用包括:搜索、信息检索和推荐系统。当动作空间巨大时,决策系统需采用特定结构以快速完成在线查询。本文针对任意奖励函数下大规模决策系统的优化问题展开研究。我们将该学习问题置于策略优化框架中,提出一类源于决策函数新型松弛方法的新策略。由此产生了一种简单而高效的学习算法,可扩展至海量动作空间。我们将所提方法与广泛采用的Plackett-Luce策略类进行对比,并在动作空间规模达百万级的问题上验证了方法的有效性。