Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation

Aydın, Burak.

Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation

Files

b1792103.021704.001.PDF (296.71 KB)

Date

2014.

Authors

Aydın, Burak.

Publisher

Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.

Abstract

The training data size is of utmost importance for statistical machine translation (SMT), since it a ects the training time, model size, decoding speed, as well as the system's overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this thesis, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on rst ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation lter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation is achieved. We compared our results with the translation model combination approaches and the best English-Turkish translation systems as well, then reported the improvements. Moreover, we implemented our system with dependency based language modeling in addition to n-gram based language modeling and reported comparable results.

URI

https://hdl.handle.net/20.500.14908/12273

Collections

M.S. Theses

Full item page

Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By