Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically, we review test content generated for a large-scale standardized English proficiency test with the goal of identifying content that only pertains to a certain subset of the test population as well as content that has the potential to be upsetting or distracting to some test takers. Issues like these could inadvertently impact a test taker's score and thus should be avoided. This kind of content does not reflect the more commonly-acknowledged biases, making it challenging even for modern models that contain safeguards. We build a dataset of 601 generated texts annotated for fairness and explore a variety of methods for classification: fine-tuning, topic-based classification, and prompting, including few-shot and self-correcting prompts. We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of 0.79 on our held-out test set, while much smaller BERT- and topic-based models have competitive performance on out-of-domain data.
翻译:自然语言生成工具在内容生成方面强大且高效。然而,语言模型已知存在偏见和公平性问题,这使得它们在许多应用场景中难以实际部署。本文聚焦公平性问题如何影响自动生成的测试内容——这类内容通常需满足严格要求,以确保测试仅衡量其预期评估的指标。具体而言,我们针对一项大规模标准化英语水平测试生成的测试内容进行审查,旨在识别仅适用于特定测试人群的内容,以及可能对部分考生造成困扰或分心的内容。此类问题可能无意中影响考生的测试成绩,因此应被规避。这类内容并不反映更广为人知的偏见类型,因此即使包含防护措施的现代模型也难以识别。我们构建了一个包含601个生成文本的数据集,并标注了公平性信息,进而探索多种分类方法:微调、基于主题的分类以及提示策略(包括少样本提示与自纠正提示)。研究发现,结合提示自纠正与少样本学习的方法表现最佳,在预留测试集上F1分数达到0.79,而更小规模的BERT模型和基于主题的模型在跨领域数据上展现了具有竞争力的性能。