The application of kernel-based Machine Learning (ML) techniques to discrete choice modelling using large datasets often faces challenges due to memory requirements and the considerable number of parameters involved in these models. This complexity hampers the efficient training of large-scale models. This paper addresses these problems of scalability by introducing the Nystr\"om approximation for Kernel Logistic Regression (KLR) on large datasets. The study begins by presenting a theoretical analysis in which: i) the set of KLR solutions is characterised, ii) an upper bound to the solution of KLR with Nystr\"om approximation is provided, and finally iii) a specialisation of the optimisation algorithms to Nystr\"om KLR is described. After this, the Nystr\"om KLR is computationally validated. Four landmark selection methods are tested, including basic uniform sampling, a k-means sampling strategy, and two non-uniform methods grounded in leverage scores. The performance of these strategies is evaluated using large-scale transport mode choice datasets and is compared with traditional methods such as Multinomial Logit (MNL) and contemporary ML techniques. The study also assesses the efficiency of various optimisation techniques for the proposed Nystr\"om KLR model. The performance of gradient descent, Momentum, Adam, and L-BFGS-B optimisation methods is examined on these datasets. Among these strategies, the k-means Nystr\"om KLR approach emerges as a successful solution for applying KLR to large datasets, particularly when combined with the L-BFGS-B and Adam optimisation methods. The results highlight the ability of this strategy to handle datasets exceeding 200,000 observations while maintaining robust performance.
翻译:将基于核的机器学习(ML)技术应用于大规模数据集的离散选择建模时,常因内存需求及模型涉及的大量参数而面临挑战。这种复杂性阻碍了大规模模型的高效训练。本文通过为大规模数据集上的核逻辑回归(KLR)引入Nyström近似,以解决这些可扩展性问题。研究首先进行了理论分析,其中:i) 刻画了KLR解的集合;ii) 给出了带Nyström近似的KLR解的上界;最后 iii) 描述了针对Nyström KLR的优化算法特化方案。随后,对Nyström KLR进行了计算验证。测试了四种地标点选取方法,包括基本均匀采样、一种k-means采样策略以及两种基于杠杆得分的非均匀方法。这些策略的性能使用大规模交通方式选择数据集进行评估,并与传统方法(如多项式Logit(MNL))及当代ML技术进行比较。研究还评估了所提Nyström KLR模型多种优化技术的效率。在这些数据集上检验了梯度下降、Momentum、Adam和L-BFGS-B优化方法的性能。在这些策略中,k-means Nyström KLR方法脱颖而出,成为将KLR应用于大规模数据集的成功解决方案,尤其当与L-BFGS-B和Adam优化方法结合时。结果突显了该策略在处理超过20万条观测数据的同时保持稳健性能的能力。