High-dimensional spectral data -- routinely generated in dairy production -- are used to predict a range of traits in milk products. Partial least squares regression (PLSR) is ubiquitously used for these prediction tasks. However PLSR is not typically viewed as arising from statistical inference of a probabilistic model, and parameter uncertainty is rarely quantified. Additionally, PLSR does not easily lend itself to model-based modifications, coherent prediction intervals are not readily available, and the process of choosing the latent-space dimension, $\mathtt{Q}$, can be subjective and sensitive to data size. We introduce a Bayesian latent-variable model, emulating the desirable properties of PLSR while accounting for parameter uncertainty. The need to choose $\mathtt{Q}$ is eschewed through a nonparametric shrinkage prior. The flexibility of the proposed Bayesian partial least squares regression (BPLSR) framework is exemplified by considering sparsity modifications and allowing for multivariate response prediction. The BPLSR framework is used in two motivating settings: 1) trait prediction from mid-infrared spectral analyses of milk samples, and 2) milk pH prediction from surface-enhanced Raman spectral data. The prediction performance of BPLSR at least matches that of PLSR. Additionally, the provision of correctly calibrated prediction intervals objectively provides richer, more informative inference for stakeholders in dairy production.
翻译:高维光谱数据——乳制品生产中常规生成的数据——被用于预测奶制品的一系列性状。偏最小二乘回归(PLSR)普遍应用于这些预测任务。然而,PLSR通常不被视为源于概率模型的统计推断,参数不确定性也很少被量化。此外,PLSR不易进行基于模型的改进,连贯的预测区间不易获得,且潜空间维度$\mathtt{Q}$的选择可能具有主观性并对数据规模敏感。我们引入一种贝叶斯潜变量模型,在考虑参数不确定性的同时模拟PLSR的理想特性。通过非参数收缩先验避免了选择$\mathtt{Q}$的需求。所提出的贝叶斯偏最小二乘回归(BPLSR)框架的灵活性通过考虑稀疏性修改并允许多变量响应预测得以体现。BPLSR框架应用于两个激励性场景:1)基于牛奶样品中红外光谱分析的性状预测,以及2)基于表面增强拉曼光谱数据的牛奶pH预测。BPLSR的预测性能至少与PLSR相当。此外,提供正确校准的预测区间客观地为乳制品生产中的利益相关者提供了更丰富、更具信息量的推断。