Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.
翻译:由于隐私等多种原因,现实世界中的数据往往缺失识别少数群体所需的分组信息。另一方面,众所周知,机器学习模型的质量取决于其训练数据,因此可能在欠表征的少数群体上表现不佳。缺失的分组信息给负责任的数据科学家带来了困境,他们发现自己处于"未知的未知"情境中——不仅无法获取分组属性,甚至不清楚需要考虑哪些群体。本文旨在尝试解决这一困境。具体而言,我们提出了一个少数群体挖掘问题,旨在通过寻找属性空间中的向量来揭示可能存在的欠表征与低性能潜在群体。从技术角度,我们提出将数据几何变换到对偶空间,并利用超平面排布等概念,为低维情形设计高效算法。将解决方案推广到高维时会遭遇维度灾难,因此我们针对此类情况提出了基于搜索空间智能探索的解决方案。我们结合理论分析,使用真实世界数据集与合成数据集进行了全面实验。实验结果证明,我们提出的解决方案在挖掘未知、欠表征与低性能的少数群体方面具有显著有效性。