The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
翻译:使用人工标注的验证数据评估机器学习模型往往成本高昂且耗时。通过一种称为“自动评估”的方法,可以采用AI标注的合成数据来减少所需的人工标注数量。我们提出了高效且具有统计原理的算法,旨在提升样本效率的同时保持无偏性。实验表明,这些算法在GPT-4上的有效人工标注样本量最多可提升50%。