Auditing Gender Analyzers on Text Data

AI models have become extremely popular and accessible to the general public. However, they are continuously under the scanner due to their demonstrable biases toward various sections of the society like people of color and non-binary people. In this study, we audit three existing gender analyzers -- uClassify, Readable and HackerFactor, for biases against non-binary individuals. These tools are designed to predict only the cisgender binary labels, which leads to discrimination against non-binary members of the society. We curate two datasets -- Reddit comments (660k) and, Tumblr posts (2.05M) and our experimental evaluation shows that the tools are highly inaccurate with the overall accuracy being ~50% on all platforms. Predictions for non-binary comments on all platforms are mostly female, thus propagating the societal bias that non-binary individuals are effeminate. To address this, we fine-tune a BERT multi-label classifier on the two datasets in multiple combinations, observe an overall performance of ~77% on the most realistically deployable setting and a surprisingly higher performance of 90% for the non-binary class. We also audit ChatGPT using zero-shot prompts on a small dataset (due to high pricing) and observe an average accuracy of 58% for Reddit and Tumblr combined (with overall better results for Reddit). Thus, we show that existing systems, including highly advanced ones like ChatGPT are biased, and need better audits and moderation and, that such societal biases can be addressed and alleviated through simple off-the-shelf models like BERT trained on more gender inclusive datasets.

翻译：人工智能模型已变得极为流行且易于公众获取。然而，由于其对有色人种、非二元性别者等社会群体存在明显偏见，这些模型持续受到审视。本研究对三种现有性别分析工具——uClassify、Readable和HackerFactor——针对非二元性别个体的偏见进行了审计。这些工具仅设计用于预测顺性别二元标签，导致对社会中非二元性别成员的歧视。我们整理了两个数据集——Reddit评论（66万条）和Tumblr帖子（205万条）——实验评估表明，这些工具在所有平台上的整体准确率均约为50%，表现出高度不准确性。所有平台上非二元性别评论的预测结果大多为女性，从而传播了非二元性别个体具有女性化特质的社会偏见。为解决此问题，我们在两个数据集上以多种组合方式微调了BERT多标签分类器，在最接近实际部署场景下观察到约77%的整体性能，而针对非二元性别类别则意外获得高达90%的性能。我们还通过零样本提示在少量数据集上（因定价高昂）对ChatGPT进行了审计，观察到Reddit和Tumblr综合平均准确率为58%（Reddit整体表现更优）。由此证明，包括ChatGPT这类高度先进系统在内的现有模型均存在偏见，亟需更完善的审计与调控，而此类社会偏见可通过诸如基于更具性别包容性数据集训练的BERT等现成模型得到缓解与消除。