In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.
翻译:本文考虑MNL-赌博机问题的情境变体。具体而言,我们研究一个动态集合优化问题:决策者在每一轮向消费者提供一个产品子集(组合),并观察消费者的响应。消费者通过购买产品以最大化其效用。我们假设产品由一组属性描述,且产品的平均效用与这些属性的取值呈线性关系。我们采用广泛使用的多项式罗吉特(MNL)模型对消费者选择行为进行建模,并考虑决策者在销售周期$T$内动态学习模型参数同时优化累积收益的问题。尽管该问题近年来备受关注,但现有方法往往需要求解一个棘手的非凸优化问题。其理论性能保证依赖于一个可能过大的问题相关参数。具体而言,现有算法的遗憾界为$O(\sqrt{\kappa d T})$,其中$\kappa$是可能随属性数量呈指数增长的问题相关常数。本文提出一种乐观算法,证明其遗憾界为$O(\sqrt{dT} + \kappa)$,显著优于现有方法。此外,我们提出优化步骤的凸松弛方法,在保持有利遗憾保证的同时实现可解的决策过程。