In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(\mathbf{x})$ using limited instances $S={(\mathbf{x^{(i)}},y^{(i)})}$, where $\mathbf{x^{(i)}} \in D$ and $D$ represents the domain of interest, is a common objective. We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $\mathbf{x}$. Consequently, the model's generalization ability is evaluated on a separate set $T=\{\mathbf{x^{(j)}}\} \subset D$, where $T \neq S$, frequently with $T \cap S = \emptyset$, to assess its performance beyond the training set. However, certain applications require accurate approximation not only within the original domain $D$ but also in an extended domain $D'$ that encompasses $D$. This becomes particularly relevant in scenarios involving the design of new structures, where minimizing errors in approximations is crucial. For example, when developing new materials through data-driven approaches, the AI/ML system can provide valuable insights to guide the design process by serving as a surrogate function. Consequently, the learned model can be employed to facilitate the design of new laboratory experiments. In this paper, we propose a method for multivariate regression based on iterative fitting of a continued fraction, incorporating additive spline models. We compare the performance of our method with established techniques, including AdaBoost, Kernel Ridge, Linear Regression, Lasso Lars, Linear Support Vector Regression, Multi-Layer Perceptrons, Random Forests, Stochastic Gradient Descent, and XGBoost. To evaluate these methods, we focus on an important problem in the field: predicting the critical temperature of superconductors based on physical-chemical characteristics.
翻译:在人工智能(AI)与机器学习(ML)领域,基于有限实例集 $S={(\mathbf{x^{(i)}},y^{(i)})}$ 近似未知目标函数 $y=f(\mathbf{x})$(其中 $\mathbf{x^{(i)}} \in D$,$D$ 为兴趣域)是一项常见目标。我们将 $S$ 称为训练集,旨在识别一个低复杂度数学模型,使其能有效逼近新实例 $\mathbf{x}$ 对应的目标函数。因此,模型的泛化能力通过独立测试集 $T=\{\mathbf{x^{(j)}}\} \subset D$ 进行评估,其中 $T \neq S$,且常满足 $T \cap S = \emptyset$,以检验其在训练集之外的性能。然而,某些应用不仅要求在原始域 $D$ 内准确近似,还需在包含 $D$ 的扩展域 $D'$ 中实现高精度。这在新结构设计等场景中尤为重要——例如,通过数据驱动方法开发新材料时,AI/ML系统可作为代理函数提供指导设计的宝贵见解,其近似误差最小化至关重要。因此,所学模型可用于促进新实验室实验的设计。本文提出一种基于连分数迭代拟合并结合加性样条模型的多变量回归方法。我们将该方法与AdaBoost、核岭回归、线性回归、Lasso Lars、线性支持向量回归、多层感知机、随机森林、随机梯度下降及XGBoost等成熟技术进行性能对比。为评估这些方法,我们聚焦领域内重要问题:基于物理化学特性预测超导体的临界温度。