Prediction of pathogen-host interactions with protein sequence embeddings using deep learning
Loading...
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2023.
Abstract
Infections caused by pathogens are a significant problem around the world. Determining protein interactions between pathogens and hosts is critical to understanding infection mechanisms and developing prevention and treatment strategies. Wet-lab experiments to identify protein interactions are expensive and time-consuming. Therefore, computational approaches have been proposed as a promising complementary solution. While 3D structures of proteins contain helpful information about protein functions, with advances in sequencing technology, 1D sequences of proteins are widely available and are often utilized because they are easier to process with less computational power. The main goal of this thesis is to develop a sequence-based approach for predicting pathogen-host protein interactions based on the hypothesis that protein sequences can be viewed as sentences, therefore, can be decomposed into chunks, which we refer to as protein words. We first adapt the Byte Pair Encoding (BPE) tokenization method from the field of natural language processing to protein sequences and then apply a graph-based approach using the Metapath2Vec algorithm to learn representations of sequences. The results show that incorporating a word-based representation of proteins improves the performance of the graph-based approach. In addition, two other methods for learning text representations, SeqVec and ProtBERT, are evaluated for predicting pathogen-host protein interactions. The results on three virus-host protein interaction datasets show that the sequence-based protein representation approaches are promising and achieve comparable performance to the state-of-the-art methods.