Informed down-sampling (IDS) is known to improve performance in symbolic regression when combined with various selection strategies, especially tournament selection. However, recent work found that IDS's gains are not consistent across all problems. Our analysis reveals that IDS performance is worse for problems containing outliers. IDS systematically favors including outliers in subsets which pushes GP towards finding solutions that overfit to outliers. To address this, we introduce ROIDS (Robust Outlier-Aware Informed Down-Sampling), which excludes potential outliers from the sampling process of IDS. With ROIDS it is possible to keep the advantages of IDS without overfitting to outliers and to compete on a wide range of benchmark problems. This is also reflected in our experiments in which ROIDS shows the desired behavior on all studied benchmark problems. ROIDS consistently outperforms IDS on synthetic problems with added outliers as well as on a wide range of complex real-world problems, surpassing IDS on over 80% of the real-world benchmark problems. Moreover, compared to all studied baseline approaches, ROIDS achieves the best average rank across all tested benchmark problems. This robust behavior makes ROIDS a reliable down-sampling method for selection in symbolic regression, especially when outliers may be included in the data set.
翻译:信息下采样(IDS)在符号回归中与多种选择策略(特别是锦标赛选择)结合使用时,已知能提升性能。然而,近期研究发现,IDS 的增益并非在所有问题上都保持一致。我们的分析表明,对于包含离群点的问题,IDS 的性能更差。IDS 系统性地倾向于在子集中包含离群点,这促使遗传规划(GP)寻找对离群点过拟合的解。为解决此问题,我们提出了 ROIDS(鲁棒的离群点感知信息下采样),该方法将潜在的离群点排除在 IDS 的采样过程之外。使用 ROIDS 可以在不过拟合离群点的前提下保持 IDS 的优势,并在广泛的基准问题上保持竞争力。这一点也在我们的实验中得到了体现,ROIDS 在所有研究的基准问题上都表现出期望的行为。在添加了离群点的合成问题以及广泛的复杂现实问题上,ROIDS 始终优于 IDS,在超过 80% 的现实世界基准问题上超越了 IDS。此外,与所有研究的基线方法相比,ROIDS 在所有测试的基准问题上取得了最佳的平均排名。这种鲁棒的行为使 ROIDS 成为符号回归选择中一种可靠的下采样方法,尤其是在数据集中可能包含离群点的情况下。