Machine learning (ML) solutions are prevalent in many applications. However, many challenges exist in making these solutions business-grade. For instance, maintaining the error rate of the underlying ML models at an acceptably low level. Typically, the true relationship between feature inputs and the target feature to be predicted is uncertain, and hence statistical in nature. The approach we propose is to separate the observations that are the most likely to be predicted incorrectly into 'attention sets'. These can directly aid model diagnosis and improvement, and be used to decide on alternative courses of action for these problematic observations. We present several algorithms (`strategies') for determining optimal rules to separate these observations. In particular, we prefer strategies that use feature-based slicing because they are human-interpretable, model-agnostic, and require minimal supplementary inputs or knowledge. In addition, we show that these strategies outperform several common baselines, such as selecting observations with prediction confidence below a threshold. To evaluate strategies, we introduce metrics to measure various desired qualities, such as their performance, stability, and generalizability to unseen data; the strategies are evaluated on several publicly-available datasets. We use TOPSIS, a Multiple Criteria Decision Making method, to aggregate these metrics into a single quality score for each strategy, to allow comparison.
翻译:机器学习(ML)解决方案在众多应用中广泛存在。然而,要使这些解决方案达到企业级标准仍面临诸多挑战,例如将底层ML模型的错误率维持在可接受的低水平。通常,特征输入与待预测目标特征之间的真实关系存在不确定性,因此具有统计本质。我们提出的方法是将最可能被错误预测的观测值分离为"注意力集"。这些集合可直接辅助模型诊断与改进,并用于为这些有问题的观测值制定替代行动方案。我们提出了多种算法("策略")以确定分离这些观测值的最优规则。特别地,我们偏好采用基于特征切片的策略,因其具有人类可解释性、模型无关性,且仅需极少的补充输入或知识。此外,我们证明这些策略的性能优于多种常见基线方法(如选择预测置信度低于阈值的观测值)。为评估策略,我们引入多项指标来度量所需的各种品质,包括性能、稳定性及对未见数据的泛化能力;并在多个公开数据集上对这些策略进行评价。我们采用多准则决策方法TOPSIS将各策略的指标聚合为单一质量评分,以便进行比较。