Outlier detection can serve as an extremely important tool for researchers from a wide range of fields. From the sectors of banking and marketing to the social sciences and healthcare sectors, outlier detection techniques are very useful for identifying subjects that exhibit different and sometimes peculiar behaviours. When the data set available to the researcher consists of both discrete and continuous variables, outlier detection presents unprecedented challenges. In this paper we propose a novel method that detects outlying observations in settings of mixed-type data, while reducing the required user interaction which can lead to misleading findings caused by subjectivity. The methodology developed is being assessed through a series of simulations on data sets with varying characteristics and achieves very good performance levels. Our method demonstrates a high capacity for detecting the majority of outliers while minimising the number of falsely detected non-outlying observations. The ideas and techniques outlined in the paper can be used either as a pre-processing step or in tandem with other data mining and machine learning algorithms for developing novel approaches to challenging research problems.
翻译:异常检测可成为广泛领域研究人员的极其重要的工具。从银行与营销行业,到社会科学和医疗保健领域,异常检测技术对于识别表现出不同甚至异常行为的对象非常有用。当研究者可用的数据集由离散变量和连续变量共同构成时,异常检测面临前所未有的挑战。本文提出一种新方法,可在混合类型数据环境下检测异常观测值,同时减少需要用户交互的环节——这种交互可能导致主观性引发的误导性结果。所开发的方法通过一系列针对不同特征数据集的模拟评估,取得了极佳的性能表现。我们的方法在最小化误检非异常观测值数量的同时,展现出检测大部分异常值的高效能力。文中概述的思想与技术既可作为预处理步骤使用,也可与其他数据挖掘和机器学习算法协同,用于开发应对具有挑战性研究问题的新方法。