Symbolic regression (SR) aims to discover explicit mathematical expressions that explain observed data and is widely used in domains where interpretability is essential. Because interpretability requires expressions to reflect meaningful regularities, SR is sensitive to observations that deviate from the dominant relationship. Such irregular observations, or outliers, are common in real-world data and can hinder SR from identifying underlying regularities. Robust regression mitigates this by downweighting observations with large residuals. However, deciding which observations should be treated as outliers is often ambiguous and depends on user interpretation and domain knowledge, a perspective largely overlooked in existing SR studies. This motivates approaches that present multiple candidate expressions, allowing users to examine different residual patterns and choose expressions consistent with their expertise. We propose diversified residual symbolic regression (DRSR), which achieves high predictive accuracy while promoting diversity with respect to residual patterns based on the Quality-Diversity paradigm. DRSR collects multiple expressions that fit the data well but differ in how residuals are distributed, enabling post-search selection aligned with domain knowledge. On a synthetic mixture dataset, DRSR produces more diverse expressions than conventional SR while capturing multiple underlying relationships. On a real-world astronomical dataset, DRSR discovers multiple expressions consistent with known physical relationships.
翻译:符号回归(SR)旨在发现解释观测数据的显式数学表达式,广泛应用于可解释性至关重要的领域。由于可解释性要求表达式反映有意义的规律性,SR对偏离主导关系的观测值敏感。这种不规则观测值(即离群点)在真实数据中普遍存在,可能阻碍SR识别潜在规律。鲁棒回归通过降低大残差观测值的权重来缓解这一问题。然而,判断哪些观测值应被视为离群点往往具有模糊性,且取决于用户解释和领域知识——这一视角在现有SR研究中被严重忽视。这促使我们提出能呈现多个候选表达式的方法,使用户能够检查不同的残差模式,并选择与其专业知识一致的表达式。我们提出多样残差符号回归(DRSR),该方法基于质量-多样性范式在实现高预测精度的同时促进残差模式的多样性。DRSR收集多个拟合数据良好的表达式,但这些表达式在残差分布上存在差异,从而支持基于领域知识的搜索后选择。在合成混合数据集上,DRSR生成的表达式比传统SR更多样化,同时捕获多种潜在关系。在真实天文数据集上,DRSR发现了多个与已知物理关系一致的表达式。