Repository logo

Predicting intracellular functions of proteins from amino acid sequences using language processing methods

dc.contributorGraduate Program in Computer Engineering.
dc.contributor.advisorÖzgür, Arzucan.
dc.contributor.authorÇaldır, Bedirhan.
dc.date.accessioned2023-10-15T06:48:30Z
dc.date.available2023-10-15T06:48:30Z
dc.date.issued2022
dc.description.abstractRapidly increasing computational power and sequencing technologies, which are at the peak of their development, enable the use of advanced algorithms with high processing volume to predict the intracellular functions of proteins, which is one of the most important problems in computational biology. The functionalities of proteins emerge primarily through their three-dimensional folded structures. When these structures are interpreted as graphs, the application of graph neural networks leads to promising results. However, these approaches are limited as the three-dimensional folded structures are not yet known for most proteins. The fact that the amino acid sequences of proteins have properties similar to natural languages and the large amounts of sequence data suggest that these sequences can be processed using natural language processing (NLP) methods. In this thesis, two different NLP methods are adapted to the problem of protein function prediction, assuming that the protein sequence data contain necessary and sufficient information to predict both three-dimensional folded structure and intracellular function: (i) Bidirectional Transoformer BERT model (ii) Heterogeneous Graph Convolutional Network (GCN) model. The results show that it is more advantageous to treat the proteins as graphs. The GCN model performs better than the BERT model and achieves performance close to the state-of-the-art model that uses three-dimensional folding information. In addition, we find that tokenizing the sequences instead of using the individual amino acids as tokens increases the performance.
dc.format.pagesxiii, 79 leaves
dc.identifier.otherCMPE 2022 C35
dc.identifier.urihttps://hdl.handle.net/20.500.14908/19700
dc.publisherThesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcshProteins.
dc.subject.lcshAmino acid sequence.
dc.subject.lcshComputational biology.
dc.titlePredicting intracellular functions of proteins from amino acid sequences using language processing methods

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
b2778283.037650.001.PDF
Size:
2.69 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
b2778283.037651.001.zip
Size:
283.71 MB
Format:
Unknown data format

Collections