Whether future AI models are fair, trustworthy, and aligned with the public's interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.
翻译:未来AI模型是否公平、可信且符合公众利益,部分取决于我们收集关于模型应执行任务的准确数据的能力。然而,收集高质量数据十分困难,且鲜有AI/ML研究人员接受过数据收集方法的专业训练。近期以数据为中心的AI研究表明,更高质量的训练数据能带来性能更优的模型,这恰是向AI/ML研究人员介绍调查方法学——这门数据收集科学——的最佳时机。我们总结了调查方法学文献中的核心见解,并探讨如何利用这些见解提升训练数据与反馈数据的质量。此外,我们提出若干协作研究思路,旨在减少数据收集中的偏差,使模型更精准且更以人为本。