Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation

Loading...
Thumbnail Image

Date

2014.

Journal Title

Journal ISSN

Volume Title

Publisher

Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.

Abstract

The training data size is of utmost importance for statistical machine translation (SMT), since it a ects the training time, model size, decoding speed, as well as the system's overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this thesis, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on rst ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation lter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation is achieved. We compared our results with the translation model combination approaches and the best English-Turkish translation systems as well, then reported the improvements. Moreover, we implemented our system with dependency based language modeling in addition to n-gram based language modeling and reported comparable results.

Description

Keywords

Citation

Collections