The Use of Binary Choice Forests to Model and Estimate Discrete Choices

Problem definition. In retailing, discrete choice models (DCMs) are commonly used to capture the choice behavior of customers when offered an assortment of products. When estimating DCMs using transaction data, flexible models (such as machine learning models or nonparametric models) are typically not interpretable and hard to estimate, while tractable models (such as the multinomial logit model) tend to misspecify the complex behavior represeted in the data. Methodology/results. In this study, we use a forest of binary decision trees to represent DCMs. This approach is based on random forests, a popular machine learning algorithm. The resulting model is interpretable: the decision trees can explain the decision-making process of customers during the purchase. We show that our approach can predict the choice probability of any DCM consistently and thus never suffers from misspecification. Moreover, our algorithm predicts assortments unseen in the training data. The mechanism and errors can be theoretically analyzed. We also prove that the random forest can recover preference rankings of customers thanks to the splitting criterion such as the Gini index and information gain ratio. Managerial implications. The framework has unique practical advantages. It can capture customers' behavioral patterns such as irrationality or sequential searches when purchasing a product. It handles nonstandard formats of training data that result from aggregation. It can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product. It can also incorporate price information and customer features. Our numerical experiments using synthetic and real data show that using random forests to estimate customer choices can outperform existing methods.

翻译：问题定义。在零售业中，离散选择模型（DCM）常用于刻画顾客在面对产品组合时的选择行为。当利用交易数据估计DCM时，灵活模型（如机器学习模型或非参数模型）通常缺乏可解释性且难以估计，而可处理模型（如多项Logit模型）往往会导致数据中复杂行为的错误设定。方法论/结果。在本研究中，我们采用二元决策树森林来表示DCM。该方法基于随机森林这一流行的机器学习算法。所得模型具有可解释性：决策树能够解释顾客在购买过程中的决策机制。我们证明该方法能够一致地预测任何DCM的选择概率，从而避免模型误设问题。此外，我们的算法可预测训练数据中未出现的产品组合。机制与误差可进行理论分析。我们还证明，得益于基尼系数、信息增益比等分裂准则，随机森林能够恢复顾客的偏好排序。管理启示。该框架具有独特的实践优势：可捕捉顾客购买行为中的非理性或序贯搜索模式；处理因数据聚合产生的非标准训练数据格式；通过测量随机顾客决策对产品存在性的依赖频率来量化产品重要性；同时能整合价格信息与顾客特征。使用合成数据与真实数据的数值实验表明，采用随机森林估计顾客选择的表现优于现有方法。