A new algorithm for Subgroup Set Discovery based on Information Gain

Pattern discovery is a machine learning technique that aims to find sets of items, subsequences, or substructures that are present in a dataset with a higher frequency value than a manually set threshold. This process helps to identify recurring patterns or relationships within the data, allowing for valuable insights and knowledge extraction. In this work, we propose Information Gained Subgroup Discovery (IGSD), a new SD algorithm for pattern discovery that combines Information Gain (IG) and Odds Ratio (OR) as a multi-criteria for pattern selection. The algorithm tries to tackle some limitations of state-of-the-art SD algorithms like the need for fine-tuning of key parameters for each dataset, usage of a single pattern search criteria set by hand, usage of non-overlapping data structures for subgroup space exploration, and the impossibility to search for patterns by fixing some relevant dataset variables. Thus, we compare the performance of IGSD with two state-of-the-art SD algorithms: FSSD and SSD++. Eleven datasets are assessed using these algorithms. For the performance evaluation, we also propose to complement standard SD measures with IG, OR, and p-value. Obtained results show that FSSD and SSD++ algorithms provide less reliable patterns and reduced sets of patterns than IGSD algorithm for all datasets considered. Additionally, IGSD provides better OR values than FSSD and SSD++, stating a higher dependence between patterns and targets. Moreover, patterns obtained for one of the datasets used, have been validated by a group of domain experts. Thus, patterns provided by IGSD show better agreement with experts than patterns obtained by FSSD and SSD++ algorithms. These results demonstrate the suitability of the IGSD as a method for pattern discovery and suggest that the inclusion of non-standard SD metrics allows to better evaluate discovered patterns.

翻译：模式发现是一种机器学习技术，旨在识别数据集中频率高于人工设定阈值的项集、子序列或子结构。这一过程有助于发现数据中重复出现的模式或关联，从而提取有价值的见解和知识。本文提出信息增益子群发现（IGSD），一种结合信息增益（IG）和优势比（OR）作为多准则模式选择的模式发现新算法。该算法旨在解决现有子群发现算法的一些局限性，例如需要针对每个数据集微调关键参数、使用单一的手动设定模式搜索准则、采用非重叠数据结构进行子群空间探索，以及无法通过固定相关数据集变量来搜索模式。为此，我们将IGSD与两种先进子群发现算法（FSSD和SSD++）进行性能比较，使用十一个数据集进行评估。在性能评估方面，我们还提出用IG、OR和p值补充标准子群发现度量指标。结果显示，对于所有评估的数据集，FSSD和SSD++算法提供的模式可靠性较低且模式集规模较小，而IGSD算法表现更优。此外，IGSD提供的OR值高于FSSD和SSD++，表明模式与目标之间的依赖性更强。同时，从其中一个数据集获得的模式已通过领域专家组验证，IGSD提供的模式与专家意见的一致性优于FSSD和SSD++算法。这些结果证明了IGSD作为模式发现方法的适用性，并表明引入非标准子群发现度量指标有助于更有效地评估发现的模式。