Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
翻译:近期研究表明,语言模型利用基准测试中的"伪影"来完成任务,而非真正学习任务本身,导致模型性能被高估。为创建更好的基准测试,我们提出VAIDA——一种新颖的NLP基准创建范式,聚焦于引导众包工作者(一个尚未充分探索的基准特性优化方向)。VAIDA通过提供实时视觉反馈与改进建议来提升样本质量,实现样本修正。该方法在领域、模型、任务及评估指标层面均具有通用性,通过引入人类与指标协同的工作流,为构建鲁棒、可验证且动态更新的基准测试提供了范式革新。我们通过专家评审及基于NASA-TLX量表的用户研究进行评估。实验发现,VAIDA降低了众包工作者和分析人员的体力消耗、挫败感、脑力负担及时间需求,同时将两组用户的性能提升幅度扩大至45.8%(伪影样本量降幅)。作为用户研究的副产品,我们还观察到所创建的样本具有跨模型对抗性,导致BERT、RoBERTa及GPT-3(少样本学习)的性能分别下降31.3%、22.5%及14.98%。