The crossed random-effects model is widely used in applied statistics, finding applications in various fields such as longitudinal studies, e-commerce, and recommender systems, among others. However, these models encounter scalability challenges, as the computational time grows disproportionately with the number of data points, typically following a cubic root relationship (N^(3/2) or worse) with N. Our inspiration for addressing this issue comes from observing the recommender system employed by an online clothing retailer. Our dataset comprises over 700,000 clients, 5,000 items, and 5,000,000 measurements. When applying the maximum likelihood approach to fit crossed random effects, computational inefficiency becomes a significant concern, limiting the applicability of this approach in large-scale settings. To tackle the scalability issues, previous research by Ghosh et al. (2022a) and Ghosh et al. (2022b) has explored linear and logistic regression models utilizing fixed-effect features based on client and item variables, while incorporating random intercept terms for clients and items. In this study, we present a more generalized version of the problem, allowing random effect sizes/slopes. This extension enables us to capture the variability in effect size among both clients and items. Importantly, we have developed a scalable solution to address the aforementioned problem and have empirically demonstrated the consistency of our estimates. Specifically, as the number of data points increases, our estimates converge towards the true parameters. To validate our approach, we implement the proposed algorithm using Stitch Fix data.
翻译:交叉随机效应模型在应用统计学中广泛应用,涵盖纵向研究、电子商务和推荐系统等多个领域。然而,这些模型面临可扩展性挑战——计算时间随数据点数量呈非线性增长(通常为N^(3/2)或更差)。我们解决该问题的灵感来自对某在线服装零售商推荐系统的观察。数据集包含超过70万客户、5000件商品和500万个测量值。当采用最大似然方法拟合交叉随机效应时,计算效率低下成为重大问题,限制了该方法在大规模场景中的适用性。为突破可扩展性瓶颈,Ghosh等(2022a)和Ghosh等(2022b)此前已探索了基于客户与商品固定效应特征的线性与逻辑回归模型,并引入客户和商品的随机截距项。本研究提出了更广义的问题形式,允许随机效应量/斜率的存在。这一扩展使我们能够捕捉客户与商品间效应量的异质性。关键在于,我们开发了解决上述问题的可扩展方案,并通过实证验证了估计量的一致性——随着数据点数量增加,估计值趋近于真实参数。为验证方法有效性,我们使用Stitch Fix数据实施了所提算法。