Despite the recent success of machine learning algorithms, most models face drawbacks when considering more complex tasks requiring interaction between different sources, such as multimodal input data and logical time sequences. On the other hand, the biological brain is highly sharpened in this sense, empowered to automatically manage and integrate such streams of information. In this context, this work draws inspiration from recent discoveries in brain cortical circuits to propose a more biologically plausible self-supervised machine learning approach. This combines multimodal information using intra-layer modulations together with Canonical Correlation Analysis, and a memory mechanism to keep track of temporal data, the overall approach termed Canonical Cortical Graph Neural networks. This is shown to outperform recent state-of-the-art models in terms of clean audio reconstruction and energy efficiency for a benchmark audio-visual speech dataset. The enhanced performance is demonstrated through a reduced and smother neuron firing rate distribution. suggesting that the proposed model is amenable for speech enhancement in future audio-visual hearing aid devices.
翻译:尽管机器学习算法近期取得了成功,但在处理涉及多源交互的复杂任务(如多模态输入数据和逻辑时间序列)时,大多数模型仍面临缺陷。另一方面,生物大脑在此方面高度敏锐,能自主管理和整合此类信息流。在此背景下,本文借鉴大脑皮层回路的最新发现,提出一种更具生物合理性的自监督机器学习方法。该方法通过层内调制结合典型相关分析以及用于追踪时序数据的记忆机制来融合多模态信息,整体方法称为规范皮层图神经网络。在基准视听语音数据集上,该方法在干净音频重建和能效方面优于最新模型。性能提升通过更平滑且更低的神经元发放率分布得到验证,表明所提模型适用于未来视听助听设备中的语音增强。