Machine Learning-Based Pre-Test Risk Stratification for PCR-Confirmed Chlamydia Using Patient-Reported Data and Urine Biomarkers

Early identification of individuals at elevated risk of Chlamydia trachomatis infection may enable optimal use of molecular testing in resource-aware screening. We evaluate the feasibility of pre-test risk stratification (PTRS) using machine-learning models trained on routinely available, non-invasive clinical data. A curated dataset of 93 urine samples with PCR reference labels was analyzed using three feature groups: patient-reported history and symptoms, urine biomarkers from standard urinalysis, and their combination. Five supervised classifiers were evaluated using stratified 5-fold cross-validation with out-of-fold probability estimates. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-dependent metrics, with uncertainty quantified via bootstrap confidence intervals. Models using only patient-reported data showed moderate discrimination (AUC up to 0.72). Urine biomarker-based models demonstrated slightly lower peak discrimination but more consistent performance, with ensemble methods yielding the strongest results. Combining feature groups marginally increased the peak AUC and reduced performance variability across models, indicating improved robustness. Findings indicate that urine biomarkers provide a reliable predictive signal for PTRS that is complementary to patient-reported information, while feature integration enhances robustness. This work supports the integration of non-invasive, routinely available information for PTRS into screening workflows, including decentralized or home-based PCR contexts, to optimize testing prioritization.

翻译：早期识别沙眼衣原体感染高风险个体，可在资源有限的筛查中优化分子检测的使用。我们评估了利用常规可获取的非侵入性临床数据训练的机器学习模型进行预检测风险分层（PTRS）的可行性。研究分析了包含93份尿液样本及PCR参考标签的精选数据集，使用三类特征组：患者报告病史与症状、标准尿常规中的尿液生物标志物，以及两者的组合。采用分层5折交叉验证与折外概率估计对五种监督分类器进行评估，通过受试者工作特征曲线下面积（AUC）和阈值依赖指标评估性能，并利用bootstrap置信区间量化不确定性。仅基于患者报告数据的模型显示出中等区分能力（最高AUC为0.72），而尿液生物标志物模型虽峰值区分能力略低但性能更稳定，其中集成方法表现最佳。特征组合在边际上提高了峰值AUC并降低了模型间的性能差异，表明鲁棒性增强。研究结果表明，尿液生物标志物为PTRS提供了与患者报告信息互补的可靠预测信号，而特征融合可提升鲁棒性。本工作支持将非侵入性、常规可获取信息整合到PTRS筛查流程中（包括分散式或居家PCR检测场景），以优化检测优先级分配。