In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.
翻译:本文研究多元Logit赌博机(MNL-Bandit)问题的上下文变体。具体而言,我们考虑一个动态集合优化问题:决策者在每轮向消费者提供一组产品(可选集合)并观察其响应,消费者通过购买产品最大化自身效用。假设产品由一组属性描述,且产品平均效用与这些属性值呈线性关系。我们采用广泛使用的多元Logit(MNL)模型刻画消费者选择行为,并考虑决策者在销售周期内动态学习模型参数并优化累计收益的问题。尽管该问题近期备受关注,但现有方法通常需解决难处理的非凸优化问题,其理论性能保证依赖于可能过大的问题相关参数。特别地,现有算法对该问题的遗憾界为 $O(\sqrt{\kappa d T})$,其中 $\kappa$ 是与问题相关的常数,可能随属性数量呈指数增长。本文提出一种乐观算法,证明其遗憾界为 $O(\sqrt{dT} + \kappa)$,显著优于现有方法。此外,我们提出优化步骤的凸松弛方法,在保持有利遗憾保证的同时实现易处理的决策过程。