The glmnet package in R is widely used for lasso estimation because of its computational efficiency. Despite its popularity, glmnet occasionally yields solutions that deviate substantially from the true ones because of the inappropriate default configuration of the algorithm. The accuracy of the obtained solutions can be improved by appropriately tuning the configuration. However, such improvements typically increase computational time, resulting in a tradeoff between accuracy and computational efficiency. Therefore, a systematic approach is required to determine the appropriate configuration. To address this need, we propose a unified data-driven framework specifically designed to optimize the configuration by balancing solution path accuracy and computational cost. Specifically, we generate a large-scale training dataset by measuring the accuracy and computation time of glmnet. Using this dataset, we construct neural networks to predict accuracy and computation time from data characteristics and configuration. For a new dataset, the proposed framework uses the trained networks to explore the configuration space and derive a Pareto front that represents the tradeoff between accuracy and computational cost. This front enables automatic selection of the configuration that maximizes accuracy under a user-specified time constraint. The proposed method is implemented in the R package glmnetconf, available at https://github.com/Shuhei-Muroya/glmnetconf.git.
翻译:R语言中的glmnet包因其计算效率而被广泛用于lasso估计。尽管广受欢迎,但由于算法默认配置不当,glmnet偶尔会产生与真实解显著偏离的结果。通过适当调整配置,可以提高所得解的精度。然而,这种改进通常会增加计算时间,导致精度与计算效率之间的权衡。因此,需要一种系统方法来确定合适的配置。为满足这一需求,我们提出了一个统一的数据驱动框架,专门用于通过平衡解路径精度与计算成本来优化配置。具体而言,我们通过测量glmnet的精度和计算时间生成大规模训练数据集。利用该数据集,我们构建神经网络以根据数据特征和配置预测精度与计算时间。对于新数据集,所提框架使用训练好的网络探索配置空间,并推导出表征精度与计算成本权衡的帕累托前沿。该前沿支持在用户指定的时间约束下自动选择使精度最大化的配置。所提方法已在R包glmnetconf中实现,可通过https://github.com/Shuhei-Muroya/glmnetconf.git获取。