University admission at many highly selective institutions uses a holistic review process, where all aspects of the application, including protected attributes (e.g., race, gender), grades, essays, and recommendation letters are considered, to compose an excellent and diverse class. In this study, we empirically evaluate how influential protected attributes are for predicting admission decisions using a machine learning (ML) model, and in how far textual information (e.g., personal essay, teacher recommendation) may substitute for the loss of protected attributes in the model. Using data from 14,915 applicants to an undergraduate admission office at a selective U.S. institution in the 2022-2023 cycle, we find that the exclusion of protected attributes from the ML model leads to substantially reduced admission-prediction performance. The inclusion of textual information via both a TF-IDF representation and a Latent Dirichlet allocation (LDA) model partially restores model performance, but does not appear to provide a full substitute for admitting a similarly diverse class. In particular, while the text helps with gender diversity, the proportion of URM applicants is severely impacted by the exclusion of protected attributes, and the inclusion of new attributes generated from the textual information does not recover this performance loss.
翻译:在许多高选拔性大学的招生过程中,采用全面审查流程,即综合考虑申请的所有方面,包括受保护属性(如种族、性别)、成绩、论文和推荐信,以组建优秀且多样化的班级。本研究通过机器学习模型实证评估了受保护属性对预测录取决定的影响程度,以及文本信息(如个人陈述、教师推荐信)在多大程度上可以替代模型中缺失的受保护属性。我们基于2022-2023申请周期中一所美国选拔性大学本科招生办公室的14,915份申请数据发现,从机器学习模型中排除受保护属性会导致录取预测性能显著下降。通过TF-IDF表示和潜在狄利克雷分配模型纳入文本信息可部分恢复模型性能,但似乎无法完全替代录取同样多样化班级的效果。特别是,虽然文本有助于性别多样性,但受保护属性的排除严重影响了URM申请者的比例,而通过文本信息生成的新属性并未能弥补这一性能损失。