Numerous studies have demonstrated the ability of neural language models to learn various linguistic properties without direct supervision. This work takes an initial step towards exploring the less researched topic of how neural models discover linguistic properties of words, such as gender, as well as the rules governing their usage. We propose to use an artificial corpus generated by a PCFG based on French to precisely control the gender distribution in the training data and determine under which conditions a model correctly captures gender information or, on the contrary, appears gender-biased.
翻译:大量研究表明,神经语言模型能够在无直接监督的情况下学习多种语言属性。本研究旨在探索神经模型如何发现词语的性别等语言属性及其使用规则这一研究较少的课题,并迈出初步步伐。我们提出使用基于法语的概率上下文无关文法(PCFG)生成的人工语料库,以精确控制训练数据中的性别分布,并确定模型在何种条件下能正确捕捉性别信息,或相反地表现出性别偏见。