Ph.D. Theses
Permanent URI for this collection
Browse
Browsing Ph.D. Theses by Author "Arslan, Levent M."
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Cross-lingual voice conversion(Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2007., 2007.) Türk, Oytun.; Arslan, Levent M.Cross-lingual voice conversion refers to the automatic transformation of a source speaker’s voice to a target speaker’s voice in a language that the target speaker can not speak. It involves a set of statistical analysis, pattern recognition, machine learning, and signal processing techniques. This study focuses on the problems related to cross-lingual voice conversion by discussing open research questions, presenting new methods, and performing comparisons with the state-of-the-art techniques. In the training stage, a Phonetic Hidden Markov Model based automatic segmentation and alignment method is developed for cross-lingual applications which support textindependent and text-dependent modes. Vocal tract transformation function is estimated using weighted speech frame mapping in more detail. Adjusting the weights, similarity to target voice and output quality can be balanced depending on the requirements of the cross- lingual voice conversion application. A context-matching algorithm is developed to reduce the one-to-many mapping problems and enable nonparallel training. Another set of improvements are proposed for prosody transformation including stylistic modeling and transformation of pitch and the speaking rate. A high quality cross-lingual voice conversion database is designed for the evaluation of the proposed methods. The database consists of recordings from bilingual speakers of American English and Turkish. It is employed in objective and subjective evaluations, and in case studies for testing new ideas in cross- lingual voice conversion.Item Telephone-based text-dependent speaker verification(Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2011., 2011.) Büyük, Osman.; Arslan, Levent M.In this thesis, we investigate model selection and channel variability issues on telephone-based text-dependent speaker verification applications. Due to the lack of an appropriate database for the task, we collected two multi-channel speaker recognition databases which are referred to as text-dependent variable text (TDVT-D) and textdependent single utterance (TDSU-D). TDVT-D consists of digit strings and short utterances in Turkish and TDSU-D contains a single Turkish phrase. In the TVDT-D, Gaussian mixture model (GMM) and hidden Markov model (HMM) based methods are compared using several authentication utterances, enrollment scenarios and enrollment-authentication channel conditions. In the experiments, we employ a rankbased decision making procedure. In the second set of experiments, we investigate three channel compensation techniques together with cepstral mean subtraction (CMS): i) LTAS filtering ii) MLLR transformation iii) handset-dependent rank-based decision making (Hrank). In all three methods, a prior knowledge of the employed channel type is required. We recognize the channels with channel GMMs trained for each condition. In this section, we also analyze the influence of channel detection errors on the verification performance. In the TDSU-D, phonetic HMM, sentence HMM and GMM based methods are compared for the single utterance task. In order to compensate for channel mismatch conditions, we implement test normalization (T-norm), zero normalization (Z-norm) and combined (i.e., TZ-norm and ZT-norm) score normalization techniques. We also propose a novel combination procedure referred to as C-norm. Additionally, we benefit from the prior knowledge of handset-channel type in order to improve the verification performance. A cohort-based channel detection method is introduced in addition to the classical GMMbased method. After the score normalization section, feature domain spectral mean division (SMD) method is presented as an alternative to the well-known CMS. In the last set of experiments, prosodic (energy, pitch, duration) and spectral features are combined together in the sentence HMM framework.