Self-supervised learning (SSL) has become the de facto training paradigm of large models where pre-training is followed by supervised fine-tuning using domain-specific data and labels. Hypothesizing that SSL models would learn more generic, hence less biased, representations, this study explores the impact of pre-training and fine-tuning strategies on fairness (i.e., performing equally on different demographic breakdowns). Motivated by human-centric applications on real-world timeseries data, we interpret inductive biases on the model, layer, and metric levels by systematically comparing SSL models to their supervised counterparts. Our findings demonstrate that SSL has the capacity to achieve performance on par with supervised methods while significantly enhancing fairness--exhibiting up to a 27% increase in fairness with a mere 1% loss in performance through self-supervision. Ultimately, this work underscores SSL's potential in human-centric computing, particularly high-stakes, data-scarce application domains like healthcare.
翻译:自监督学习已成为大型模型的事实标准训练范式,该范式通过预训练结合领域特定数据和标签进行监督微调。基于自监督模型能学习更通用、因而偏差更少的表征这一假设,本研究系统探究了预训练与微调策略对公平性(即在不同人口统计分组中表现一致)的影响。受面向真实世界时间序列数据的人本主义应用启发,我们通过将自监督模型与对应监督模型进行系统对比,从模型层级、网络层和评估指标三个维度解读归纳偏置。研究结果表明:自监督方法在保持与监督方法相当性能的同时,能显著提升公平性——通过自监督机制可最多实现27%的公平性提升,而性能损失仅约1%。最终,本研究揭示了自监督学习在人本计算领域的潜力,尤其适用于医疗等具有高风险、数据稀缺特征的应用场景。