We show that protein sequences can be thought of as sentences in natural language processing and can be parsed using the existing Quantum Natural Language framework into parameterized quantum circuits of reasonable qubits, which can be trained to solve various protein-related machine-learning problems. We classify proteins based on their subcellular locations, a pivotal task in bioinformatics that is key to understanding biological processes and disease mechanisms. Leveraging the quantum-enhanced processing capabilities, we demonstrate that Quantum Tensor Networks (QTN) can effectively handle the complexity and diversity of protein sequences. We present a detailed methodology that adapts QTN architectures to the nuanced requirements of protein data, supported by comprehensive experimental results. We demonstrate two distinct QTNs, inspired by classical recurrent neural networks (RNN) and convolutional neural networks (CNN), to solve the binary classification task mentioned above. Our top-performing quantum model has achieved a 94% accuracy rate, which is comparable to the performance of a classical model that uses the ESM2 protein language model embeddings. It's noteworthy that the ESM2 model is extremely large, containing 8 million parameters in its smallest configuration, whereas our best quantum model requires only around 800 parameters. We demonstrate that these hybrid models exhibit promising performance, showcasing their potential to compete with classical models of similar complexity.
翻译:我们证明,蛋白质序列可被视为自然语言处理中的句子,并可通过现有量子自然语言框架解析为合理量子比特数的参数化量子电路,这些电路经训练后能解决多种与蛋白质相关的机器学习问题。我们基于蛋白质的亚细胞定位进行分类,这是生物信息学中理解生物过程和疾病机制的关键任务。借助量子增强处理能力,我们展示了量子张量网络(QTN)能够有效处理蛋白质序列的复杂性和多样性。我们提出了一套详细的方法论,将QTN架构适配于蛋白质数据的精细需求,并辅以全面的实验结果支撑。我们设计了两种受经典循环神经网络(RNN)和卷积神经网络(CNN)启发的不同QTN架构,用于解决上述二分类任务。性能最优的量子模型达到了94%的准确率,与使用ESM2蛋白质语言模型嵌入的经典模型性能相当。值得注意的是,ESM2模型规模极大,其最小配置包含800万个参数,而我们的最佳量子模型仅需约800个参数。我们证明这些混合模型展现出具有竞争力的性能,表明其具备与类似复杂度的经典模型相抗衡的潜力。