Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

In biotechnology Raman Spectroscopy is rapidly gaining popularity as a process analytical technology (PAT) that measures cell densities, substrate- and product concentrations. As it records vibrational modes of molecules it provides that information non-invasively in a single spectrum. Typically, partial least squares (PLS) is the model of choice to infer information about variables of interest from the spectra. However, biological processes are known for their complexity where convolutional neural networks (CNN) present a powerful alternative. They can handle non-Gaussian noise and account for beam misalignment, pixel malfunctions or the presence of additional substances. However, they require a lot of data during model training, and they pick up non-linear dependencies in the process variables. In this work, we exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels so that a network trained on such data exhibits low correlations between the model predictions. We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training. This data augmentation technique enables us to reuse spectra as training data for new contexts that exhibit different correlations. The additional data allows for building a better and more robust model. This is of interest in scenarios where large amounts of historical data are available but are currently not used for model training. We demonstrate the capabilities of the proposed method using synthetic spectra of Ralstonia eutropha batch cultivations to monitor substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations during of the experiments.

翻译：在生物技术领域，拉曼光谱作为一种过程分析技术（PAT），正迅速普及用于测量细胞密度、底物浓度和产物浓度。该技术通过记录分子的振动模式，可在单张光谱中非侵入性地提供上述信息。通常，偏最小二乘法（PLS）是从光谱中推断目标变量信息的首选模型。然而，生物过程因其复杂性而著称，此时卷积神经网络（CNN）成为一种强有力的替代方案。CNN能够处理非高斯噪声，并考虑光束未对准、像素故障或额外物质存在等因素。但其模型训练需要大量数据，且会捕捉过程变量间的非线性依赖关系。本研究利用光谱的加和特性，从给定数据集生成具有统计独立标签的额外数据点，使得在此类数据上训练的网络在模型预测间呈现较低相关性。我们证明，使用这些生成数据点训练CNN，能在标注相关性不同于训练数据集的数据集上提升模型性能。该数据增强技术使我们能够将光谱重新用作具有不同相关性的新场景的训练数据。额外数据有助于构建更优且更稳健的模型。这对于存在大量历史数据但当前未用于模型训练的场景具有重要意义。我们通过使用Ralstonia eutropha分批培养的合成光谱监测实验过程中的底物、生物量及聚羟基烷酸酯（PHA）生物聚合物浓度，验证了所提方法的有效性。