Complexity analysis and practical resolution of the data classification problem with private characteristics

In this work we analyze the problem of, given the probability distribution of a population, questioning an unknown individual that is representative of the distribution so that our uncertainty about certain characteristics is significantly reduced -but the uncertainty about others, deemed private or sensitive, is not. Thus, the goal of the problem is extracting information being relevant to a legitimate purpose while preserving the privacy of individuals, which is crucial to enable non-intrusive selection processes in several areas. For instance, it is essential in the design of non-discriminatory personnel selection, promotion, and layoff processes in companies and institutions; in the retrieval of customer information being relevant to the service provided by a company (and no more); in certifications not revealing sensitive industrial information being irrelevant for the certification itself; etc. Interactive questioning processes are constructed for this purpose, which requires generalizing the notion of decision trees to account the amount of desired and undesired information retrieved for each branch of the plan. Our findings about this problem are both theoretical and practical: on the one hand, we prove its NP-completeness by a reduction from the Set Cover problem; and on the other hand, given this intractability, we provide heuristic solutions to find reasonable solutions in affordable time. In particular, a greedy algorithm and two genetic algorithms are presented. Our experiments indicate that the best results are obtained using a genetic algorithm reinforced with a greedy strategy.

翻译：本文研究以下问题：给定一个群体的概率分布，对代表该分布的未知个体进行询问，使得我们对某些特征的不确定性显著降低，而对其他被视为隐私或敏感特征的不确定性则保持不变。因此，该问题的目标是在保护个体隐私的前提下提取与合法目的相关的信息，这对实现多个领域中的非侵入式筛选流程至关重要。例如，该机制对于企业和机构设计非歧视性的人员选拔、晋升与裁员流程；对于企业获取与所提供服务相关（且仅限相关）的客户信息；对于认证过程中不泄露与认证本身无关的敏感行业信息等场景都具有关键意义。为此我们构建了交互式询问流程，这需要将决策树的概念进行推广，以量化计划每个分支所获取的期望信息与非期望信息量。关于该问题的研究成果兼具理论价值与实践意义：一方面，我们通过从集合覆盖问题归约证明了该问题的NP完全性；另一方面，针对该难解性问题，我们提出了在可接受时间内获得合理解决方案的启发式算法。具体而言，我们提出了一种贪心算法与两种遗传算法。实验结果表明，采用贪心策略强化的遗传算法能获得最佳效果。