Scalable solution to crossed random effects model with random slopes

The crossed random-effects model is widely used in applied statistics, finding applications in various fields such as longitudinal studies, e-commerce, and recommender systems, among others. However, these models encounter scalability challenges, as the computational time grows disproportionately with the number of data points, typically following a cubic root relationship (N^(3/2) or worse) with N. Our inspiration for addressing this issue comes from observing the recommender system employed by an online clothing retailer. Our dataset comprises over 700,000 clients, 5,000 items, and 5,000,000 measurements. When applying the maximum likelihood approach to fit crossed random effects, computational inefficiency becomes a significant concern, limiting the applicability of this approach in large-scale settings. To tackle the scalability issues, previous research by Ghosh et al. (2022a) and Ghosh et al. (2022b) has explored linear and logistic regression models utilizing fixed-effect features based on client and item variables, while incorporating random intercept terms for clients and items. In this study, we present a more generalized version of the problem, allowing random effect sizes/slopes. This extension enables us to capture the variability in effect size among both clients and items. Importantly, we have developed a scalable solution to address the aforementioned problem and have empirically demonstrated the consistency of our estimates. Specifically, as the number of data points increases, our estimates converge towards the true parameters. To validate our approach, we implement the proposed algorithm using Stitch Fix data.

翻译：交叉随机效应模型在应用统计学中广泛应用，涵盖纵向研究、电子商务和推荐系统等多个领域。然而，这些模型面临可扩展性挑战——计算时间随数据点数量呈非线性增长（通常为N^(3/2)或更差）。我们解决该问题的灵感来自对某在线服装零售商推荐系统的观察。数据集包含超过70万客户、5000件商品和500万个测量值。当采用最大似然方法拟合交叉随机效应时，计算效率低下成为重大问题，限制了该方法在大规模场景中的适用性。为突破可扩展性瓶颈，Ghosh等（2022a）和Ghosh等（2022b）此前已探索了基于客户与商品固定效应特征的线性与逻辑回归模型，并引入客户和商品的随机截距项。本研究提出了更广义的问题形式，允许随机效应量/斜率的存在。这一扩展使我们能够捕捉客户与商品间效应量的异质性。关键在于，我们开发了解决上述问题的可扩展方案，并通过实证验证了估计量的一致性——随着数据点数量增加，估计值趋近于真实参数。为验证方法有效性，我们使用Stitch Fix数据实施了所提算法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

《多范式建模与仿真：系统工程视角》CMU 2022最新24页slides

专知会员服务

59+阅读 · 2022年11月4日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日