Ph.D. Theses
Permanent URI for this collection
Browse
Browsing Ph.D. Theses by Subject "Automatic speech recognition."
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Keyword search for low resource languages(Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2018., 2018.) Gündoğdu, M. Batuhan.; Saraçlar, Murat.Retrieval of spoken content is one key endeavor, not only for finding the speech parts of interest, but also for an automated and facilitated speech mining towards better automatic speech recognition (ASR) systems. In particular, keyword search (KWS) systems aims to address these goals, by locating the specific parts of speech where a user provided keyword uttered. The most intuitive and convenient method for keyword search is to obtain text transcriptions from speech using ASR systems, and then conduct text based search on this ASR output. However, for low resource languages, for which available labeled speech training data is not sufficient, reliable ASR systems cannot be built and, KWS systems that depend on them will fail. Furthermore, if the keyword of interest is not within the vocabulary of the ASR system, it can never be found in the word level transcriptions. In this thesis, we address the above mentioned issues of KWS for the low resource languages. We aim to build a KWS system, using a completely different approach, with ideas inspired by the similarity search techniques of the query by example retrieval tasks. For this, we utilize a subsequence dynamic time warping-based search, after artificially modeling “pseudo examples” for text queries. Furthermore, we investigate a joint learning of these query representations and a proper distance metric for use in dynamic time warping. We show that, this new KWS system, we propose, outperforms the state of the art KWS techniques for retrieval of out of-vocabulary terms, and provides significant improvements when combined with the conventional ASR-based KWS system due to its heterogeneity.Item Single-channel speech-music separation for robust ASR with mixture of NMF models(Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014., 2014.) Demir, Cemil.; Saraçlar, Murat.; Cemgil, Ali Taylan.In this dissertation, we analyze the single-channel speech-music separation problem for automatic speech recognition (ASR). The motivation of the study is to increase the performance of the ASR systems by decreasing the effect of background music. We describe a single-channel speech-music separation method based on a mixture of nonnegative matrix factorization (NMF) model. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog and it is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by an NMF model that is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. Moreover, we extend the mixture of NMF based single-channel speech-music separation method such that it incorporates prior speech information to enhance the separation performance of the method. Finally, we propose to use sub-word NMF-based speech models for the separation of speech and music signals. By applying such a strategy, it is demonstrated that the recognition accuracy can be improved as compared to using a general speech model.Item Statistical and discriminative language modeling for Turkish large vocabulary continuous speech recognition(Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2009., 2009.) Arısoy, Ebru.; Saraçlar, Murat.Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of-Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and further improves the accuracy of sub-lexical units. Additionally, discriminative language models (DLMs) with linguistically and statistically motivated features are utilized. DLM outperforms the conventional approaches, partly due to the improved parameter estimates with discriminative training and partly due to integrating the complex language characteristics of Turkish into language modeling. The significance of this dissertation lies in being a comparative study of several sub-lexical units on the same LVCSR system, addressing the over-generation problem of sub-lexical units and extending sub-lexical-based generative language modeling of Turkish to discriminative language modeling. These approaches can be easily extended to other morphologically rich languages that suffer from similar problems.Item Supervised, semi-supervised and unsupervised methods in discriminative language modeling for automatic speech recognition(Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2016., 2016.) Dikici, Erinç.; Saraçlar, Murat.Discriminative language modeling aims to reduce the error rates by rescoring the output of an automatic speech recognition (ASR) system. Discriminative language model (DLM) training conventionally follows a supervised approach, using acoustic recordings together with their manual transcriptions (reference) as training examples, and the recognition performance is improved with increasing amount of such matched data. In this thesis we investigate the case where matched data for DLM training is limited or not available at all, and explore methods to improve ASR accuracy by incorporating unmatched acoustic and text data that come from separate sources. For semi-supervised training, we utilize weighted nite-state transducer and machine translation based confusion models to generate arti cial hypotheses in addition to the real ASR hypotheses. For unsupervised training, we explore target output selection methods to replace the missing reference. We handle discriminative language modeling both as a structured prediction and a reranking problem and employ variants of the perceptron, MIRA and SVM algorithms adapted for both problems. We propose several hypothesis sampling approaches to decrease the complexity of algorithms and to increase the diversity of arti cial hypotheses. We obtain signi cant improvements over baseline ASR accuracy even when there is no transcribed acoustic data available to train the DLM.Item Telephone-based text-dependent speaker verification(Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2011., 2011.) Büyük, Osman.; Arslan, Levent M.In this thesis, we investigate model selection and channel variability issues on telephone-based text-dependent speaker verification applications. Due to the lack of an appropriate database for the task, we collected two multi-channel speaker recognition databases which are referred to as text-dependent variable text (TDVT-D) and textdependent single utterance (TDSU-D). TDVT-D consists of digit strings and short utterances in Turkish and TDSU-D contains a single Turkish phrase. In the TVDT-D, Gaussian mixture model (GMM) and hidden Markov model (HMM) based methods are compared using several authentication utterances, enrollment scenarios and enrollment-authentication channel conditions. In the experiments, we employ a rankbased decision making procedure. In the second set of experiments, we investigate three channel compensation techniques together with cepstral mean subtraction (CMS): i) LTAS filtering ii) MLLR transformation iii) handset-dependent rank-based decision making (Hrank). In all three methods, a prior knowledge of the employed channel type is required. We recognize the channels with channel GMMs trained for each condition. In this section, we also analyze the influence of channel detection errors on the verification performance. In the TDSU-D, phonetic HMM, sentence HMM and GMM based methods are compared for the single utterance task. In order to compensate for channel mismatch conditions, we implement test normalization (T-norm), zero normalization (Z-norm) and combined (i.e., TZ-norm and ZT-norm) score normalization techniques. We also propose a novel combination procedure referred to as C-norm. Additionally, we benefit from the prior knowledge of handset-channel type in order to improve the verification performance. A cohort-based channel detection method is introduced in addition to the classical GMMbased method. After the score normalization section, feature domain spectral mean division (SMD) method is presented as an alternative to the well-known CMS. In the last set of experiments, prosodic (energy, pitch, duration) and spectral features are combined together in the sentence HMM framework.