This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.
翻译:本研究开发了一个可解释的机器学习框架,用于预测初创企业的关键发展结果,包括融资、专利申请与退出。我们基于Crunchbase数据构建了2010-2023年企业季度面板数据集,并将其与美国专利商标局(USPTO)数据进行匹配。评估涵盖三个时间维度:未来12个月内获得下一轮融资、24个月内专利存量的增长、以及36个月内通过首次公开募股(IPO)或被收购实现退出。预处理流程在开发窗口期(2010-2019年)进行拟合,并保持不变地应用于后续队列数据以避免信息泄露。针对类别不平衡问题,采用逆流行度加权与针对标称及连续特征的合成少数类过采样技术(SMOTE-NC)进行处理。通过精确率-召回率曲线下面积(PR-AUC)与受试者工作特征曲线下面积(AUROC)对比了逻辑回归与树集成模型(包括随机森林、XGBoost、LightGBM和CatBoost)。专利、融资与退出预测的AUROC值分别达到0.921、0.817和0.872,为创新金融领域提供了透明且可复现的评估排序体系。