As sample sizes grow, scalability has become a central concern in the development of Markov chain Monte Carlo (MCMC) methods. One general approach to this problem, exemplified by the popular stochastic gradient Langevin dynamics (SGLD) algorithm, is to use a small random subsample of the data at every time step. This paper, building on recent work such as \cite{nagapetyan2017true,JohndrowJamesE2020NFLf}, shows that this approach often fails: while decreasing the sample size increases the speed of each MCMC step, for typical datasets this is balanced by a matching decrease in accuracy. This result complements recent work such as \cite{nagapetyan2017true} (which came to the same conclusion, but analyzed only specific upper bounds on errors rather than actual errors) and \cite{JohndrowJamesE2020NFLf} (which did not analyze nonreversible algorithms and allowed for logarithmic improvements).
翻译:随着样本量的增长,可扩展性已成为马尔可夫链蒙特卡洛(MCMC)方法发展的核心问题。解决此问题的一种通用方法(以流行的随机梯度朗之万动力学(SGLD)算法为例)是在每个时间步使用数据的随机小子样本。本文基于近期研究(如 \cite{nagapetyan2017true,JohndrowJamesE2020NFLf})表明,该方法常会失效:虽然减小样本量能提高每个MCMC步骤的速度,但对于典型数据集,这种速度提升会被相应精度的下降所抵消。该结果补充了近期研究,如 \cite{nagapetyan2017true}(得出相同结论,但仅分析了误差的特定上界而非实际误差)和 \cite{JohndrowJamesE2020NFLf}(未分析不可逆算法且允许对数级改进)。