Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.

翻译：近期研究表明，语言模型利用基准测试中的"伪影"来完成任务，而非真正学习任务本身，导致模型性能被高估。为创建更好的基准测试，我们提出VAIDA——一种新颖的NLP基准创建范式，聚焦于引导众包工作者（一个尚未充分探索的基准特性优化方向）。VAIDA通过提供实时视觉反馈与改进建议来提升样本质量，实现样本修正。该方法在领域、模型、任务及评估指标层面均具有通用性，通过引入人类与指标协同的工作流，为构建鲁棒、可验证且动态更新的基准测试提供了范式革新。我们通过专家评审及基于NASA-TLX量表的用户研究进行评估。实验发现，VAIDA降低了众包工作者和分析人员的体力消耗、挫败感、脑力负担及时间需求，同时将两组用户的性能提升幅度扩大至45.8%（伪影样本量降幅）。作为用户研究的副产品，我们还观察到所创建的样本具有跨模型对抗性，导致BERT、RoBERTa及GPT-3（少样本学习）的性能分别下降31.3%、22.5%及14.98%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/