Modern machine learning algorithms, especially deep learning based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search based approaches to automating this laborious and compute-intensive task, the fundamental learning theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data driven setting. We assume that we have a series of deep learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. This is unlike previous work in data driven design, where one can typically explicitly model the algorithmic behavior as a function of the hyperparameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter, our analysis relies on subtle concepts including tools from differential/algebraic geometry and constrained optimization. This can be used to show that the learning theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.
翻译:现代机器学习算法,特别是基于深度学习的技术,通常需要细致的超参数调优才能达到最佳性能。尽管贝叶斯优化和基于随机搜索的自动化方法等实用技术引起了广泛关注,旨在减轻这项耗时且计算密集的任务,但对于深度神经网络超参数调优的基本学习理论复杂度,目前仍缺乏深入理解。受这一显著研究空白的启发,我们通过近期提出的数据驱动框架,首次对深度学习中的超参数调优复杂度展开形式化研究。我们假设存在一系列深度学习任务,需要通过对超参数进行调优,使其在任务分布上的平均表现达到最优。主要困难在于:效用函数作为超参数的函数具有高度波动性,且其本身是由模型参数上的优化问题隐式定义的。这与以往数据驱动设计的研究不同,在先前工作中通常能够显式地将算法行为建模为超参数的函数。为应对这一挑战,我们提出一种新技术,用于刻画在固定问题实例上随超参数变化时效用函数的不连续性和振荡特性。我们的分析依赖于微分/代数几何和约束优化等精细数学工具。该技术可用于证明相应效用函数族的学习理论复杂度存在上界。我们具体实例化了研究结果,并为两个实际应用提供了样本复杂度界限:一是调节神经网络激活函数插值的超参数,二是设置图神经网络中的核参数。