This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.
翻译:本研究考察wav2vec2.0架构在音系语境补偿方面呈现的证据程度。我们针对普通话声调进行了感知补偿实验的伪复制,比较了纯自监督预训练模型与针对普通话语音识别任务微调模型之间的嵌入相似度与探测分类器输出。纯预训练模型的嵌入相似度中未发现补偿证据。探测分类器除呈现预期的层级性分类能力提升外,也显现出部分补偿证据,但未能复刻人类对孤立测试音节的感知表现。本研究结果与先前关于仅通过预训练即可涌现音系结构敏感性的报告形成对比,表明监督学习目标可能是促进至少某些类型音系规律抽象化所必需的条件。