Quantifying LLM Biases Across Instruction Boundary in Mixed Question Forms

Large Language Models (LLMs) annotated datasets are widely used nowadays, however, large-scale annotations often show biases in low-quality datasets. For example, Multiple-Choice Questions (MCQs) datasets with one single correct option is common, however, there may be questions attributed to none or multiple correct options; whereas true-or-false questions are supposed to be labeled with either True or False, but similarly the text can include unsolvable elements, which should be further labeled as Unknown. There are problems when low-quality datasets with mixed question forms can not be identified. We refer to these exceptional label forms as Sparse Labels, and LLMs' ability to distinguish datasets with Sparse Labels mixture is important. Since users may not know situations of datasets, their instructions can be biased. To study how different instruction settings affect LLMs' identifications of Sparse Labels mixture, we introduce the concept of Instruction Boundary, which systematically evaluates different instruction settings that lead to biases. We propose BiasDetector, a diagnostic benchmark to systematically evaluate LLMs on datasets with mixed question forms under Instruction Boundary settings. Experiments show that users' instructions induce large biases on our benchmark, highlighting the need not only for LLM developers to recognize risks of LLM biased annotation resulting in Sparse Labels mixture, but also problems arising from users' instructions to identify them. Code, datasets and detailed implementations are available at https://github.com/ZpLing/Instruction-Boundary.

翻译：当前，基于大型语言模型（LLM）标注的数据集被广泛使用，然而大规模标注在低质量数据集中常表现出偏差。例如，具有单一正确选项的多项选择题（MCQ）数据集很常见，但可能存在无正确选项或多个正确选项的问题；而真假判断题本应标注为真或假，但类似地，文本可能包含无法判定的元素，应进一步标注为未知。当混合问题形式的低质量数据集无法被识别时，便会产生问题。我们将这些异常标注形式称为稀疏标签，而LLM区分具有稀疏标签混合的数据集的能力至关重要。由于用户可能不了解数据集的具体情况，其指令可能存在偏差。为研究不同指令设置如何影响LLM对稀疏标签混合的识别，我们引入了指令边界的概念，以系统评估导致偏差的不同指令设置。我们提出了BiasDetector，这是一个诊断基准，用于在指令边界设置下系统评估LLM在混合问题形式数据集上的表现。实验表明，用户的指令在我们的基准上引发了显著偏差，这不仅凸显了LLM开发者需要认识到LLM偏差标注导致稀疏标签混合的风险，也揭示了用户指令在识别这些问题时所产生的难题。代码、数据集及详细实现可在https://github.com/ZpLing/Instruction-Boundary获取。