Learning and decision-making in domains with naturally high noise-to-signal ratio, such as Finance or Healthcare, is often challenging, while the stakes are very high. In this paper, we study the problem of learning and acting under a general noisy generative process. In this problem, the data distribution has a significant proportion of uninformative samples with high noise in the label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data during both training and testing. We propose a novel approach to learning under these conditions via a loss inspired by the selective learning theory. By minimizing this loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluates its empirical performance in a variety of settings.
翻译:在金融或医疗等自然噪声信号比高的领域中,学习与决策往往极具挑战性,同时风险极高。本文研究了在一般噪声生成过程中进行学习与行为的问题。在该问题中,数据分布包含大量标签噪声高的不具信息性样本,而部分数据则包含由低标签噪声代表的有用信息。这种二分性在训练与推理过程中均存在,要求在训练与测试阶段妥善处理不具信息性的数据。我们提出了一种基于选择性学习理论的新损失函数方法,用于在此类条件下进行学习。通过最小化该损失,模型能够通过区分信息性数据与不具信息性数据并进行预测,从而保证做出接近最优的决策。我们基于理论保证的优势,描述了一种迭代算法,该算法联合优化预测器与选择器,并在多种设置下评估其实验性能。