Repository logo

Prediction of pathogen-host interactions with protein sequence embeddings using deep learning

dc.contributorGraduate program in Computer Engineering.
dc.contributor.advisorÖzgür, Arzucan.
dc.contributor.authorOğuzoğlu, Büşra.
dc.date.accessioned2025-04-14T12:09:53Z
dc.date.available2025-04-14T12:09:53Z
dc.date.issued2023
dc.description.abstractInfections caused by pathogens are a significant problem around the world. Determining protein interactions between pathogens and hosts is critical to understanding infection mechanisms and developing prevention and treatment strategies. Wet-lab experiments to identify protein interactions are expensive and time-consuming. Therefore, computational approaches have been proposed as a promising complementary solution. While 3D structures of proteins contain helpful information about protein functions, with advances in sequencing technology, 1D sequences of proteins are widely available and are often utilized because they are easier to process with less computational power. The main goal of this thesis is to develop a sequence-based approach for predicting pathogen-host protein interactions based on the hypothesis that protein sequences can be viewed as sentences, therefore, can be decomposed into chunks, which we refer to as protein words. We first adapt the Byte Pair Encoding (BPE) tokenization method from the field of natural language processing to protein sequences and then apply a graph-based approach using the Metapath2Vec algorithm to learn representations of sequences. The results show that incorporating a word-based representation of proteins improves the performance of the graph-based approach. In addition, two other methods for learning text representations, SeqVec and ProtBERT, are evaluated for predicting pathogen-host protein interactions. The results on three virus-host protein interaction datasets show that the sequence-based protein representation approaches are promising and achieve comparable performance to the state-of-the-art methods.
dc.format.pagesxiv, 72 leaves
dc.identifier.otherGraduate program in Computer Engineering. TKL 2023 U68 PhD (Thes TR 2023 L43
dc.identifier.urihttps://hdl.handle.net/20.500.14908/21508
dc.publisherThesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2023.
dc.subject.lcshDeep learning (Machine learning)
dc.subject.lcshHost-parasite relationships.
dc.titlePrediction of pathogen-host interactions with protein sequence embeddings using deep learning

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
b2795661.038392.001.PDF
Size:
621.06 KB
Format:
Adobe Portable Document Format

Collections