Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning. PUICL is pretrained on synthetic PU datasets generated from randomly instantiated structural causal models, exposing it to a wide range of feature-label relationships and class-prior configurations. At inference time, PUICL receives the labeled positives and the unlabeled samples as a single input and returns class probabilities for the unlabeled rows in one forward pass, with no gradient updates or per-task fitting. On 20 semi-synthetic PU benchmarks derived from the UCI Machine Learning Repository, OpenML, and scikit-learn, PUICL outperforms four standard PU learning baselines in average AUC and accuracy, and is competitive on F1-score. These results show that the in-context learning paradigm extends naturally beyond fully supervised tabular prediction to the semi-supervised PU setting.
翻译:正无标签学习解决的是仅有一组标注的正例样本与一个由正负例混合构成的无标签样本集时的二分类问题。现有正无标签学习方法通常需要针对特定数据集进行训练或迭代优化,这限制了其快速解决大量任务或需要较少调参场景下的适用性。我们提出PUICL,一种通过上下文学习完全解决正无标签分类问题的预训练Transformer模型。PUICL基于随机实例化结构因果模型生成的合成正无标签数据集进行预训练,使其暴露于广泛的特征-标签关系与先验类别配置中。在推理阶段,PUICL将标注正例与无标签样本作为单一输入,通过一次前向传播直接输出无标签样本的类别概率,无需梯度更新或每个任务的独立拟合。在源自UCI机器学习库、OpenML和scikit-learn的20个半合成正无标签基准测试中,PUICL在平均AUC和准确率上超越四种标准正无标签学习基线,并在F1分数上具有竞争力。这些结果表明,上下文学习范式能够自然地从全监督表格预测扩展至半监督正无标签场景。