From Understanding Genetic Drift to a Smart-Restart Mechanism for Estimation-of-Distribution Algorithms

from arxiv, Accepted for publication in "Journal of Machine Learning Research". Extended version of our GECCO 2020 paper. This article supersedes arXiv:2004.07141

Estimation-of-distribution algorithms (EDAs) are optimization algorithms that learn a distribution on the search space from which good solutions can be sampled easily. A key parameter of most EDAs is the sample size (population size). If the population size is too small, the update of the probabilistic model builds on few samples, leading to the undesired effect of genetic drift. Too large population sizes avoid genetic drift, but slow down the process. Building on a recent quantitative analysis of how the population size leads to genetic drift, we design a smart-restart mechanism for EDAs. By stopping runs when the risk for genetic drift is high, it automatically runs the EDA in good parameter regimes. Via a mathematical runtime analysis, we prove a general performance guarantee for this smart-restart scheme. This in particular shows that in many situations where the optimal (problem-specific) parameter values are known, the restart scheme automatically finds these, leading to the asymptotically optimal performance. We also conduct an extensive experimental analysis. On four classic benchmark problems, we clearly observe the critical influence of the population size on the performance, and we find that the smart-restart scheme leads to a performance close to the one obtainable with optimal parameter values. Our results also show that previous theory-based suggestions for the optimal population size can be far from the optimal ones, leading to a performance clearly inferior to the one obtained via the smart-restart scheme. We also conduct experiments with PBIL (cross-entropy algorithm) on two combinatorial optimization problems from the literature, the max-cut problem and the bipartition problem. Again, we observe that the smart-restart mechanism finds much better values for the population size than those suggested in the literature, leading to a much better performance.

翻译：分布估计算法是一种优化算法，通过在搜索空间上学习分布来轻松采样优质解。大多数分布估计算法的关键参数是样本量（种群规模）。若种群规模过小，概率模型的更新仅基于少量样本，导致遗传漂变的不良效应；若种群规模过大虽能避免遗传漂变，但会降低算法进程。基于最近关于种群规模如何导致遗传漂变的定量分析，我们为分布估计算法设计了智能重启机制。该方法通过在高遗传漂变风险时终止运行，自动使算法处于良好参数区间。通过数学运行时间分析，我们证明了该智能重启方案具有通用性能保证。特别地，这表明在已知最优问题特定参数值的多数场景中，重启方案能自动寻得这些参数，从而实现渐近最优性能。我们还开展了大量实验分析。在四个经典基准问题上，我们清晰观察到种群规模对性能的关键影响，并发现智能重启方案能获得接近最优参数值时的性能。研究结果还表明，先前基于理论的最优种群规模建议可能与实际最优值相去甚远，其性能明显逊于智能重启方案。我们还在文献中的两个组合优化问题（最大割问题和二分问题）上对PBIL（交叉熵算法）进行了实验。再次观察到，智能重启机制能发现远优于文献建议的种群规模，从而获得显著更优的性能。