Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data can render many methodologies inapplicable. We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-supervised learning specialized for cases when sequence alignment and/or experimental phenotyping of all classes are not possible. Among its main advantages, BioKlustering 1) allows for maximally imbalanced settings of partially observed labels including cases when only one class is observed, which is currently prohibited in most semi-supervised methods, 2) takes unaligned sequences as input and thus, allows learning for widely diverse sequences (impossible to align) such as virus and bacteria, 3) is easy to use for anyone with little or no programming expertise, and 4) works well with small sample sizes. Availability and Implementation: BioKlustering (https://bioklustering.wid.wisc.edu) is a freely available web app implemented with Django, a Python-based framework, with all major browsers supported. The web app does not need any installation, and it is publicly available and open-source (https://github.com/solislemuslab/bioklustering).
翻译:摘要: 摘要:从基因组序列中准确预测表型是生物和医学研究中备受关注的任务。尽管机器学习在多个领域是实现准确预测的关键,但生物数据的复杂性可能导致许多方法不适用。我们推出了BioKlustering,这是一个用户友好的开源公开网络应用程序,专门用于无监督和半监督学习,适用于无法进行序列比对和/或对所有类别进行实验性表型分析的情况。其主要优势包括:1) 允许部分观测标签处于最大不平衡设置,包括仅观测到一个类别的情况,这在大多数半监督方法中目前是被禁止的;2) 接受未比对的序列作为输入,因此能够对广泛多样(无法比对)的序列(如病毒和细菌)进行学习;3) 易于使用,无需或仅需少量编程经验;4) 在小样本量下表现良好。可用性与实现:BioKlustering (https://bioklustering.wid.wisc.edu) 是一个免费的网络应用程序,使用基于Python的Django框架实现,支持所有主流浏览器。该网络应用程序无需安装,公开可用且开源 (https://github.com/solislemuslab/bioklustering)。