The Science of Data Collection: Insights from Surveys can Improve Machine Learning Models

Whether future AI models make the world safer or less safe for humans rests in part on our ability to efficiently collect accurate data from people about what they want the models to do. However, collecting high quality data is difficult, and most AI/ML researchers are not trained in data collection methods. The growing emphasis on data-centric AI highlights the potential of data to enhance model performance. It also reveals an opportunity to gain insights from survey methodology, the science of collecting high-quality survey data. In this position paper, we summarize lessons from the survey methodology literature and discuss how they can improve the quality of training and feedback data, which in turn improve model performance. Based on the cognitive response process model, we formulate specific hypotheses about the aspects of label collection that may impact training data quality. We also suggest collaborative research ideas into how possible biases in data collection can be mitigated, making models more accurate and human-centric.

翻译：未来的人工智能模型对人类世界是变得更安全还是更不安全，部分取决于我们能否有效采集人们关于模型应如何运作的准确数据。然而，采集高质量数据十分困难，且大多数人工智能/机器学习研究者并未接受过数据采集方法的训练。以数据为中心的人工智能日益受到重视，凸显了数据在提升模型性能方面的潜力，同时也揭示了从调查方法论——即采集高质量调查数据的科学——中汲取经验的机会。在本篇立场论文中，我们总结了调查方法论文献中的经验，并讨论了这些经验如何能提升训练数据和反馈数据的质量，进而改善模型性能。基于认知响应过程模型，我们提出了关于标签采集可能影响训练数据质量的具体假设。我们还建议开展合作研究，探讨如何缓解数据采集中可能存在的偏差，从而使模型更加准确且以人为中心。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日