Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores, which can then be used to select features. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance of STS as a feature selection metric is evaluated against preliminary survey data collected as a part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.

翻译：调查数据可能包含大量特征，但样本数量相对较少。在此条件下，基于调查数据预测结果的机器学习模型容易过拟合，导致泛化能力较差。解决该问题的一种方法是特征选择，即选择最优特征子集进行学习。在特征选择过程中，一个相对未被探索的信息来源是利用特征的文本名称，这些名称可能在语义上指示哪些特征与目标结果相关。特征名称与目标名称之间的关系可通过语言模型（LMs）评估，生成语义文本相似性（STS）分数，进而用于选择特征。我们研究了直接使用STS选择特征以及将其用于最小冗余最大相关性（mRMR）算法的效果。以一项关于持续性术后疼痛（PPSP）的临床研究收集的初步调查数据为基准，评估了STS作为特征选择指标的性能。结果表明，与传统特征选择算法相比，使用STS选择的特征能够构建出性能更高的模型。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日