Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models

翻译：将预处理选择重构为近红外光谱中的模型内部校准：基于大规模基准的算子自适应PLS与Ridge回归模型

Gregory Beurier,Robin Reiter,Camille Noûs,Lauriane Rouan,Denis Cornet

from arxiv, 17 pages, 8 figures; supplementary material (39 pages, 4 figures) included. Extended preprint version of a companion study prepared as a concise journal article (same results, different framing and scope). Code and artifacts: https://github.com/GBeurier/nirs4all-aom

Preprocessing screening is often the most expensive part of a near-infrared spectroscopy calibration workflow. It works because smoothing, derivatives, detrending and related filters change the spectral directions seen by partial least squares (PLS) or Ridge regression, but a full external search repeatedly refits nearly the same linear model. This paper studies the case where that search can be collapsed into one calibration step. For a strict linear preprocessing operator A acting on row spectra as XA^T, the transformed PLS cross-covariance satisfies (XA^T)^T Y = A X^T Y, and Ridge regression depends on the operator-induced kernel X A^T A X^T. These identities let a finite operator bank be screened inside the model while retaining original-wavelength coefficients, and the same identity extends to cheaply evaluated linear operator chains. Sample-adaptive or fitted corrections such as SNV, MSC, EMSC and ASLS are not strict linear; we prove the boundary and keep them as fold-local branches. The cohort has 61 regression and 17 classification rows, with a strict paired regression denominator of N=32 for the eight paper variants. There, AOM-PLS reaches median RMSEP ratios of 0.991/0.990 (simple) and 0.985/1.002 (best) against PLS-default/PLS-HPO, and AOM-Ridge reaches 0.974/0.984 (simple) and 0.918/0.966 (best) against Ridge-default/Ridge-HPO. The operator-adaptive classifier AOM-PLS-DA improves balanced accuracy by a median 0.159 on N=13 datasets (12/13 wins). The practical result is the runtime gap: PLS-HPO takes a median 710.81 s per run, whereas AOM-PLS takes 1.18-1.63 s -- 436 to 602 times less PLS fitting time. Linear operator-adaptive calibration thus gives prediction quality comparable to exhaustive preprocessing screening, with orders-of-magnitude less fitting time for PLS.

翻译：预处理筛选通常是近红外光谱校准流程中成本最高的环节。其有效性在于平滑、导数、去趋势及相关滤波器会改变偏最小二乘（PLS）或岭回归（Ridge regression）所观测的光谱方向，但完整的外部搜索会反复拟合几乎相同的线性模型。本文研究将此类搜索压缩为单一校准步骤的情形。对于作用于行光谱的严格线性预处理算子A（记为XA^T），变换后的PLS交叉协方差满足(XA^T)^T Y = A X^T Y，而Ridge回归则依赖于算子诱导的核X A^T A X^T。这些恒等式使得有限算子库可在模型内部进行筛选，同时保留原始波长系数，且同一恒等式可扩展至廉价计算的线性算子链。样本自适应或拟合校正方法（如SNV、MSC、EMSC和ASLS）并非严格线性；我们证明了其边界条件，并将其保留为折叠局部分支。实验队列包含61个回归任务和17个分类任务，对于八种论文变体，严格配对回归的分母为N=32。在该条件下，AOM-PLS相对于PLS默认/PLS-HPO的中位RMSEP比值为0.991/0.990（简单模式）和0.985/1.002（最优模式）；AOM-Ridge相对于Ridge默认/Ridge-HPO的中位比值为0.974/0.984（简单模式）和0.918/0.966（最优模式）。算子自适应分类器AOM-PLS-DA在N=13个数据集上（12/13获胜）将平衡准确率中位值提升0.159。实际结果表明运行时间的显著差异：PLS-HPO单次运行中位耗时710.81秒，而AOM-PLS仅需1.18-1.63秒——PLS拟合时间减少436至602倍。因此，线性算子自适应校准在提供与穷举预处理筛选相当的预测质量的同时，将PLS拟合时间降低数个数量级。