Test suites assess natural language processing models' performance on specific functionalities: cases of interest involving model robustness, fairness, or particular linguistic capabilities. This paper introduces specification instructions: text descriptions specifying fine-grained task-specific behaviors. For each functionality in a suite, we generate an instruction that describes it. We combine the specification instructions to create specification-augmented prompts, which we feed to language models pre-trained on natural instruction data. We conduct experiments to measure how optimizing for some functionalities may negatively impact functionalities that are not covered by the specification set. Our analyses across four tasks and models of diverse sizes and families show that smaller models struggle to follow specification instructions. However, larger models (>~3B params.) can benefit from specifications and -- surprisingly -- even generalize certain desirable behaviors across functionalities.
翻译:测试套件用于评估自然语言处理模型在特定功能上的表现:这些功能涉及模型鲁棒性、公平性或特定语言能力的关注案例。本文提出规范指令:即描述细粒度任务特定行为的文本说明。针对测试套件中的每个功能,我们生成描述该功能的指令。通过组合这些规范指令,我们构建了规范增强提示,并将其输入至经过自然指令数据预训练的语言模型。我们通过实验测量优化某些功能时,可能对规范集未覆盖的功能产生的负面影响。在四个任务及不同规模和系列模型上的分析表明:较小模型难以遵循规范指令,但较大模型(参数规模>~30亿)能够从规范中获益,并且——令人惊讶的是——甚至能在不同功能间泛化某些期望行为。