Revisiting image captioning structures based on CNN and RNN, and improving the performance using modified decoders with residual connections

Loading...
Thumbnail Image

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2023.

Abstract

In this thesis, the image captioning structure consisting of a Convolutional Neural Network (CNN) as the encoder and a Recurrent Neural Network (RNN) as the decoder is visited by comparing and evaluating the effects of different image feature extractors, different RNN cells, different types of word embeddings, and the involvement of residual connections between the RNN cells. The famous “Show, Attend and Tell” model is modified by adding residual connections between the RNN cells and adding other modifications on both the encoder and the decoder side, which improved the performance of the model on the image captioning task. Furthermore, models were trained by implementing 3 different pre-trained word embeddings and their benefits were explored. With the best model, 34 BLEU-4 points and 15 SPICE points improvement were achieved compared with the base model. The effects of training our best model with the images transformed into the frequency domain rather than the images represented in the spatial domain are investigated and it is concluded that this approach cannot enhance the performance of the model. The results of the experiments demonstrate the effectiveness of the proposed modifications and provide insights into the potential of residual connections.

Description

Keywords

Citation

Collections