Repository logo

Machine learning based language models on nucleotide sequences of human genes

dc.contributorGraduate Program in Computer Engineering.
dc.contributor.advisorÖzgür, Arzucan.
dc.contributor.authorİhtiyar, Musa Nuri.
dc.date.accessioned2025-04-14T12:09:51Z
dc.date.available2025-04-14T12:09:51Z
dc.date.issued2023
dc.description.abstractThe use of computers for different fields of science has provided tremendous benefits. This phenomenon is expected to be more common as the speed of computers and the amount of data available for different kinds of scientific problems increase. This study focuses on genomics, one of the most exciting areas of science. We have applied several techniques to obtain a model for nucleotide sequences of genes that are found in human beings so that the model can learn the general pattern in these nucleotide sequences and predict how likely it is that an unseen sequence is a gene that belongs to human beings. They can even generate new nucleotide sequences. All of the methods used are examples of machine learning, where the programs are designed to learn from data for a specific task, rather than explicitly programming what to do at each step. Traditional approaches such as N-grams and more recent deep learning-based techniques such as recurrent neural networks and transformer architecture language models are used. In addition to the classical metrics, the strength of the methods is measured using a real-world task from the field of genomics. Finally, the results show an interesting comparison of how all these models perform on a task that is inherently different from classical natural language processing tasks, and how sometimes simple models like N-grams can be as good as, if not better than, more sophisticated techniques such as transformer for solving certain types of problems. Furthermore, the significance of evaluating obtained models on real-life tasks is seen because the transformer model was superior to the N- gram model according to perplexity, although it performed worse on real-world task.
dc.format.pagesxii, 52 leaves
dc.identifier.otherGraduate Program in Computer Engineering. TKL 2023 U68 PhD (Thes POLS 2023 E66
dc.identifier.urihttps://hdl.handle.net/20.500.14908/21493
dc.publisherThesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2023.
dc.subject.lcshNucleotide sequence.
dc.subject.lcshMachine learning.
dc.titleMachine learning based language models on nucleotide sequences of human genes

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
b2795766.038434.001.PDF
Size:
810.5 KB
Format:
Adobe Portable Document Format

Collections