A bayesian approach to the clustering problem with application to gene expression analysis

Loading...
Thumbnail Image

Date

2016.

Journal Title

Journal ISSN

Volume Title

Publisher

Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2016.

Abstract

This thesis investigates methods for extraction of information from gene expression time series data. These time series provide indirect measurements about the underlying biological mechanisms, hence their analysis heavily depends on statistical modelling techniques. One particularly popular analysis approach is clustering genes by their similarity of expression profiles. However, for scientific data analysis, clustering requires a rigorous methodology and Bayesian nonparametrics provides a promising framework. In this context, two novel models were developed: Infinite Multiway Mixture (IMM) that extends the standard infinite mixture model; and Infinite Mixture of Piecewise Linear Sequences (IMPLS) that assumes a specific structure for its mixture components, tailored towards gene expression time series. In the Bayesian paradigm, the key object for gene analysis is the posterior distribution over partitionings, given the model and observed data. However, a posterior distribution over partitionings is a highly complicated object. Here, we apply Markov Chain Monte Carlo (MCMC) inference to obtain a sample from the posterior distribution of gene partitionings, and cluster genes by a heuristic algorithm. An alternative, novel approach for the analysis of distributions over partitions is also developed, that we named as entropy agglomeration (EA). We demonstrate the use of EA by a clustering experiment on a literary text, Ulysses by James Joyce. In our bioinformatics application CLUSTERnGO (CnG), the relevance of resulting clusters are evaluated by applying standard multiple hypothesis testing to compare them against previous biological knowledge encoded in terms of a Gene Ontology. The complete workflow of CnG consists of a four-phase pipeline (Configuration, Inference, Clustering, Evaluation).

Description

Keywords

Citation

Collections