Missing data can pose a challenge for machine learning (ML) modeling. To address this, current approaches are categorized into feature imputation and label prediction and are primarily focused on handling missing data to enhance ML performance. These approaches rely on the observed data to estimate the missing values and therefore encounter three main shortcomings in imputation, including the need for different imputation methods for various missing data mechanisms, heavy dependence on the assumption of data distribution, and potential introduction of bias. This study proposes a Contrastive Learning (CL) framework to model observed data with missing values, where the ML model learns the similarity between an incomplete sample and its complete counterpart and the dissimilarity between other samples. Our proposed approach demonstrates the advantages of CL without requiring any imputation. To enhance interpretability, we introduce CIVis, a visual analytics system that incorporates interpretable techniques to visualize the learning process and diagnose the model status. Users can leverage their domain knowledge through interactive sampling to identify negative and positive pairs in CL. The output of CIVis is an optimized model that takes specified features and predicts downstream tasks. We provide two usage scenarios in regression and classification tasks and conduct quantitative experiments, expert interviews, and a qualitative user study to demonstrate the effectiveness of our approach. In short, this study offers a valuable contribution to addressing the challenges associated with ML modeling in the presence of missing data by providing a practical solution that achieves high predictive accuracy and model interpretability.
翻译:缺失数据可能会对机器学习建模构成挑战。当前应对方法主要分为特征插补与标签预测两类,其核心目标是通过处理缺失数据来提升机器学习性能。这些方法依赖观测数据估计缺失值,因此在插补过程中面临三大缺陷:需针对不同缺失机制采用不同插补方法、高度依赖数据分布假设、以及可能引入偏差。本研究提出一种对比学习框架,用于对含缺失值的观测数据进行建模。在该框架中,机器学习模型学习不完整样本与其完整对应样本之间的相似性,以及与其他样本之间的差异性。所提方法无需任何插补即可展现对比学习的优势。为增强可解释性,我们引入CIVis可视化分析系统,该系统集成可解释技术以可视化学习过程并诊断模型状态。用户可通过交互式采样利用领域知识,识别对比学习中的负样本对与正样本对。CIVis的输出是一个经过优化的模型,该模型将利用指定特征预测下游任务。我们提供了回归与分类任务中的两个应用场景,并通过定量实验、专家访谈及定性用户研究验证了方法的有效性。简言之,本研究通过提供一种兼具高预测精度与模型可解释性的实用解决方案,为应对缺失数据环境下的机器学习建模挑战做出了有价值的贡献。