We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methods for the attribute control task in Language Models (LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric and rigorously prove our claim. We provide experimental validations for our claims and release our code (anonymously) here: https://github.com/causalate-mitigates-bias/causal-ate-mitigates-bias.
翻译:我们通过因果平均处理效应(Causal ATE)方法研究语言模型中的属性控制问题。现有语言模型属性控制任务的方法通过检测句子中词语与目标属性的共现关系进行控制,然而训练数据中词语与属性的虚假相关性会导致模型在推理时遇到虚假相关特征时产生属性幻觉。我们证明基于简单扰动的Causal ATE方法可以消除这种非预期效应。具体而言,我们将其应用于毒性缓解问题——该问题的核心挑战在于去毒化后常出现针对受保护群体的无意识偏见。我们证明使用Causal ATE指标可解决这种非预期偏差,并给出严格的数学证明。我们通过实验验证了这一主张,并匿名开源代码:https://github.com/causalate-mitigates-bias/causal-ate-mitigates-bias。