Outlier detection can serve as an extremely important tool for researchers from a wide range of fields. From the sectors of banking and marketing to the social sciences and healthcare sectors, outlier detection techniques are very useful for identifying subjects that exhibit different and sometimes peculiar behaviours. When the data set available to the researcher consists of both discrete and continuous variables, outlier detection presents unprecedented challenges. In this paper we propose a novel method that detects outlying observations in settings of mixed-type data, while reducing the required user interaction and providing general guidelines for selecting suitable hyperparameter values. The methodology developed is being assessed through a series of simulations on data sets with varying characteristics and achieves very good performance levels. Our method demonstrates a high capacity for detecting the majority of outliers while minimising the number of falsely detected non-outlying observations. The ideas and techniques outlined in the paper can be used either as a pre-processing step or in tandem with other data mining and machine learning algorithms for developing novel approaches to challenging research problems.
翻译:异常检测可作为众多领域研究人员极其重要的工具。从银行业、市场营销领域到社会科学和医疗保健领域,异常检测技术对于识别表现出不同甚至特殊行为的对象非常有用。当研究人员可用的数据集同时包含离散变量和连续变量时,异常检测会面临前所未有的挑战。本文提出了一种新方法,可在混合类型数据背景下检测异常观测值,同时减少所需的用户交互,并为选择合适超参数值提供通用指南。通过一系列针对不同特征数据集的模拟对所开发方法进行了评估,该方法达到了非常高的性能水平。我们的方法展示出在最大限度减少误检为异常的正常观测值数量的同时,检测出大部分异常的强大能力。本文概述的思想和技术既可作为预处理步骤使用,也可与其他数据挖掘和机器学习算法协同使用,以开发应对具有挑战性研究问题的新方法。