Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores, which can then be used to select features. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance of STS as a feature selection metric is evaluated against preliminary survey data collected as a part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.
翻译:调查数据可能包含大量特征,但样本数量相对较少。在此条件下,基于调查数据预测结果的机器学习模型容易过拟合,导致泛化能力较差。解决该问题的一种方法是特征选择,即选择最优特征子集进行学习。在特征选择过程中,一个相对未被探索的信息来源是利用特征的文本名称,这些名称可能在语义上指示哪些特征与目标结果相关。特征名称与目标名称之间的关系可通过语言模型(LMs)评估,生成语义文本相似性(STS)分数,进而用于选择特征。我们研究了直接使用STS选择特征以及将其用于最小冗余最大相关性(mRMR)算法的效果。以一项关于持续性术后疼痛(PPSP)的临床研究收集的初步调查数据为基准,评估了STS作为特征选择指标的性能。结果表明,与传统特征选择算法相比,使用STS选择的特征能够构建出性能更高的模型。