While there is a rich literature on robust methodologies for contamination in continuously distributed data, contamination in categorical data is largely overlooked. This is regrettable because many datasets are categorical and oftentimes suffer from contamination. Examples include inattentive responding and bot responses in questionnaires or zero-inflated count data. We propose a novel class of contamination-robust estimators of models for categorical data, coined $C$-estimators (``$C$'' for categorical). We show that the countable and possibly finite sample space of categorical data results in non-standard theoretical properties. Notably, in contrast to classic robustness theory, $C$-estimators can be simultaneously robust \textit{and} fully efficient at the postulated model. In addition, a certain particularly robust specification fails to be asymptotically Gaussian at the postulated model, but is asymptotically Gaussian in the presence of contamination. We furthermore propose a diagnostic test to identify categorical outliers and demonstrate the enhanced robustness of $C$-estimators in a simulation study.
翻译:尽管关于连续分布数据污染的稳健方法已有丰富文献,但分类数据中的污染在很大程度上被忽视了。这令人遗憾,因为许多数据集是分类数据,且常常受到污染。例如问卷调查中的不专注作答和机器人回答,或零膨胀计数数据。我们提出了一类新颖的分类数据模型污染稳健估计量,称为$C$估计量("$C$"代表分类)。我们证明,分类数据的可数且可能有限的样本空间会导致非标准的理论性质。值得注意的是,与经典稳健性理论不同,$C$估计量可以在假设模型下同时具备稳健性**和**完全有效性。此外,某种特别稳健的设定在假设模型下无法渐近服从高斯分布,但在存在污染时却是渐近高斯的。我们还提出了一种诊断检验来识别分类异常值,并通过模拟研究展示了$C$估计量增强的稳健性。