Whether future AI models make the world safer or less safe for humans rests in part on our ability to efficiently collect accurate data from people about what they want the models to do. However, collecting high quality data is difficult, and most AI/ML researchers are not trained in data collection methods. The growing emphasis on data-centric AI highlights the potential of data to enhance model performance. It also reveals an opportunity to gain insights from survey methodology, the science of collecting high-quality survey data. In this position paper, we summarize lessons from the survey methodology literature and discuss how they can improve the quality of training and feedback data, which in turn improve model performance. Based on the cognitive response process model, we formulate specific hypotheses about the aspects of label collection that may impact training data quality. We also suggest collaborative research ideas into how possible biases in data collection can be mitigated, making models more accurate and human-centric.
翻译:未来的人工智能模型对人类世界是变得更安全还是更不安全,部分取决于我们能否有效采集人们关于模型应如何运作的准确数据。然而,采集高质量数据十分困难,且大多数人工智能/机器学习研究者并未接受过数据采集方法的训练。以数据为中心的人工智能日益受到重视,凸显了数据在提升模型性能方面的潜力,同时也揭示了从调查方法论——即采集高质量调查数据的科学——中汲取经验的机会。在本篇立场论文中,我们总结了调查方法论文献中的经验,并讨论了这些经验如何能提升训练数据和反馈数据的质量,进而改善模型性能。基于认知响应过程模型,我们提出了关于标签采集可能影响训练数据质量的具体假设。我们还建议开展合作研究,探讨如何缓解数据采集中可能存在的偏差,从而使模型更加准确且以人为中心。