Understanding which information is encoded in deep models of spoken and written language has been the focus of much research in recent years, as it is crucial for debugging and improving these architectures. Most previous work has focused on probing for speaker characteristics, acoustic and phonological information in models of spoken language, and for syntactic information in models of written language. Here we focus on the encoding of syntax in several self-supervised and visually grounded models of spoken language. We employ two complementary probing methods, combined with baselines and reference representations to quantify the degree to which syntactic structure is encoded in the activations of the target models. We show that syntax is captured most prominently in the middle layers of the networks, and more explicitly within models with more parameters.
翻译:近年来,理解口语与书面语言深度模型中所编码的信息已成为研究焦点,这对调试和改进这些架构至关重要。先前研究多聚焦于探究口语模型中的说话人特征、声学和音系信息,以及书面语言模型中的句法信息。本文则重点关注几种自监督和视觉基础口语模型中句法的编码方式。我们采用两种互补性探测方法,结合基线模型和参考表征,量化目标模型激活值中句法结构的编码程度。研究表明,句法特征在网络中间层级中最为显著,且在参数规模更大的模型中编码更为明确。