Outlier detection is an important data mining tool that becomes particularly challenging when dealing with nominal data. First and foremost, flagging observations as outlying requires a well-defined notion of nominal outlyingness. This paper presents a definition of nominal outlyingness and introduces a general framework for quantifying outlyingness of nominal data. The proposed framework makes use of ideas from the association rule mining literature and can be used for calculating scores that indicate how outlying a nominal observation is. Methods for determining the involved hyperparameter values are presented and the concepts of variable contributions and outlyingness depth are introduced, in an attempt to enhance interpretability of the results. An implementation of the framework is tested on five real-world data sets and the key findings are outlined. The ideas presented can serve as a tool for assessing the degree to which an observation differs from the rest of the data, under the assumption of sequences of nominal levels having been generated from a Multinomial distribution with varying event probabilities.
翻译:离群点检测是一种重要的数据挖掘工具,在处理名义数据时变得尤为困难。首先,将观测标记为离群点需要一个明确定义的名义离群度概念。本文提出了名义离群度的定义,并引入了一个量化名义数据离群度的通用框架。该框架利用了关联规则挖掘文献中的思想,可用于计算表明名义观测离群程度的得分。本文提出了确定所涉及超参数值的方法,并引入了变量贡献度和离群深度的概念,以增强结果的可解释性。该框架的一个实现在五个真实世界数据集上进行了测试,并概述了主要发现。在假设名义水平序列是从具有变化事件概率的多项分布中生成的条件下,所提出的思想可以作为一种工具,用于评估观测值与数据其余部分的差异程度。