The crossed random-effects model is widely used in applied statistics, finding applications in various fields such as longitudinal studies, e-commerce, and recommender systems, among others. However, these models encounter scalability challenges, as the computational time grows disproportionately with the number of data points, typically following a cubic root relationship $(N^{(3/2)}$ or worse) with $N$. Our inspiration for addressing this issue comes from observing the recommender system employed by an online clothing retailer. Our dataset comprises over 700,000 clients, 5,000 items, and 5,000,000 measurements. When applying the maximum likelihood approach to fit crossed random effects, computational inefficiency becomes a significant concern, limiting the applicability of this approach in large-scale settings. To tackle the scalability issues, previous research by Ghosh et al. (2022a) and Ghosh et al. (2022b) has explored linear and logistic regression models utilizing fixed-effect features based on client and item variables, while incorporating random intercept terms for clients and items. In this study, we present a more generalized version of the problem, allowing random effect sizes/slopes. This extension enables us to capture the variability in effect size among both clients and items. Importantly, we have developed a scalable solution to address the aforementioned problem and have empirically demonstrated the consistency of our estimates. Specifically, as the number of data points increases, our estimates converge towards the true parameters. To validate our approach, we implement the proposed algorithm using Stitch Fix data.
翻译:交叉随机效应模型在应用统计学中广泛应用,常见于纵向研究、电子商务和推荐系统等领域。然而,这类模型面临可扩展性挑战,其计算时间随数据点数量呈非比例增长,通常与N存在立方根关系$(N^{(3/2)}$或更差)。我们解决该问题的灵感来源于对某在线服装零售商推荐系统的观察。我们的数据集包含超过70万客户、5000种商品和500万次测量。在采用最大似然方法拟合交叉随机效应时,计算效率低下成为显著问题,限制了该方法在大规模场景中的适用性。为解决可扩展性问题,Ghosh等人(2022a, 2022b)先前的研究利用基于客户和商品变量的固定效应特征,并引入客户及商品的随机截距项,探索了线性和逻辑回归模型。本研究提出该问题的更一般化版本,允许随机效应大小/斜率存在。这一扩展使我们能够捕捉客户和商品间效应大小的变异性。重要的是,我们开发了针对上述问题的可扩展解决方案,并通过实证验证了估计的一致性——随着数据点数量增加,估计值收敛于真实参数。为验证方法有效性,我们在Stitch Fix数据上实现了所提出的算法。