Measuring Robustness of Speech Recognition from MEG Signals Under Distribution Shift

This study investigates robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark from the 2025 PNPL competition. We compare residual convolutional neural networks (CNNs), an STFT-based CNN, and a CNN--Transformer hybrid, while also examining the effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation. Across our in-house implementations, preprocessing and data-configuration choices matter more than additional architectural complexity, among which instance normalization emerges as the most influential modification for generalization. The strongest of our own models, a CNN with group averaging, label balancing, repeated grouping, and instance normalization, achieves 60.95% F1-macro on the test split, compared with 39.53% for the plain CNN baseline. However, most of our models, without instance normalization, show substantial validation-to-test degradation, indicating that distribution shift induced by different normalization statistics is a major obstacle to generalization in our experiments. By contrast, MEGConformer maintains 64.09% F1-macro on both validation and test, and saliency-map analysis is qualitatively consistent with this contrast: weaker models exhibit more concentrated or repetitive phoneme-sensitive patterns across splits, whereas MEGConformer appears more distributed. Overall, the results suggest that improving the reliability of non-invasive phoneme decoding will likely require better handling of normalization-related distribution shift while also addressing the challenge of single-trial decoding.

翻译：本研究基于2025年PNPL竞赛中的LibriBrain音素分类基准，探究从非侵入式MEG信号中进行鲁棒语音解码的方法。我们比较了残差卷积神经网络（CNN）、基于STFT的CNN以及CNN-Transformer混合模型，同时考察了组平均、标签平衡、重复分组、归一化策略和数据增强的影响。在我们的内部实现中，预处理和数据配置选择比额外的架构复杂性更为关键，其中实例归一化成为对泛化能力影响最大的修改。我们最强模型（采用组平均、标签平衡、重复分组和实例归一化的CNN）在测试集上达到60.95%的F1宏观平均值，而普通CNN基线仅为39.53%。然而，未使用实例归一化的大部分模型在验证集到测试集上出现显著性能下降，表明由不同归一化统计量引发的分布偏移是我们实验中泛化能力的主要障碍。相比之下，MEGConformer在验证集和测试集上均保持64.09%的F1宏观平均值，且显著图分析在定性上与这一对比一致：较弱模型在不同分割中表现出更集中或重复的音素敏感模式，而MEGConformer的分布则更为分散。总体结果表明，提升非侵入式音素解码的可靠性需要更好地处理与归一化相关的分布偏移，同时应对单试次解码的挑战。