We study the performance of stochastic first-order methods for finding saddle points of convex-concave functions. A notorious challenge faced by such methods is that the gradients can grow arbitrarily large during optimization, which may result in instability and divergence. In this paper, we propose a simple and effective regularization technique that stabilizes the iterates and yields meaningful performance guarantees even if the domain and the gradient noise scales linearly with the size of the iterates (and is thus potentially unbounded). Besides providing a set of general results, we also apply our algorithm to a specific problem in reinforcement learning, where it leads to performance guarantees for finding near-optimal policies in an average-reward MDP without prior knowledge of the bias span.
翻译:我们研究了随机一阶方法在寻找凸-凹函数鞍点时的性能表现。这类方法面临的一个显著挑战是:优化过程中梯度可能任意增长,从而导致不稳定性和发散。本文提出一种简单有效的正则化技术,该技术能稳定迭代过程,即便当定义域和梯度噪声随迭代规模线性增长(因此可能无界)时,仍能提供有意义的性能保证。除给出系列通用结论外,我们还将此算法应用于强化学习中的特定问题:在无需预知偏差跨度的情况下,为平均奖励马尔可夫决策过程寻找近似最优策略提供了性能保证。