The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
翻译:使用人工标注的验证数据评估机器学习模型可能成本高昂且耗时。AI标注的合成数据可用于减少此过程所需的人工标注数量,这一过程称为自动评估。我们为此提出了高效且具有统计原则的算法,这些算法在保持无偏性的同时提高了样本效率。在GPT-4的实验中,这些算法将有效的人工标注样本量提升了高达50%。