We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.
翻译:我们首次为通过无导数优化的黑盒学习提供了泛化误差分析。在假设损失函数为Lipschitz平滑且未知的条件下,我们考虑了零阶随机搜索算法,该算法通过每次数据集(样本)查询时使用$K+1$个扰动损失评估的随机差值替代随机梯度方向,更新一个$d$维模型。对于无界和有界(可能非凸)的损失函数,我们首次给出了ZoSS算法的泛化界。令人惊讶的是,这些界与SGD的泛化界一致,并且在适当选择略微降低的学习率时,与维度$d$、采样数$K$以及批量大小$m$无关。对于有界非凸损失且批量大小$m=1$的情况,我们进一步证明,即使仅使用两次函数评估,泛化误差和学习率均与$d$和$K$无关,且与SGD的对应值基本相同。我们的结果广泛扩展并一致地恢复了先前工作中关于SGD的泛化界及相应学习率的已有结论。此外,若$m=n$(其中$n$为数据集大小),我们还可推导出全批量GD的泛化保证。