Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups

Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.

翻译：由于隐私等多种原因，现实世界中的数据往往缺失识别少数群体所需的分组信息。另一方面，众所周知，机器学习模型的质量取决于其训练数据，因此可能在欠表征的少数群体上表现不佳。缺失的分组信息给负责任的数据科学家带来了困境，他们发现自己处于"未知的未知"情境中——不仅无法获取分组属性，甚至不清楚需要考虑哪些群体。本文旨在尝试解决这一困境。具体而言，我们提出了一个少数群体挖掘问题，旨在通过寻找属性空间中的向量来揭示可能存在的欠表征与低性能潜在群体。从技术角度，我们提出将数据几何变换到对偶空间，并利用超平面排布等概念，为低维情形设计高效算法。将解决方案推广到高维时会遭遇维度灾难，因此我们针对此类情况提出了基于搜索空间智能探索的解决方案。我们结合理论分析，使用真实世界数据集与合成数据集进行了全面实验。实验结果证明，我们提出的解决方案在挖掘未知、欠表征与低性能的少数群体方面具有显著有效性。

相关内容

GROUP

关注 1

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日