Revisiting image captioning structures based on CNN and RNN, and improving the performance using modified decoders with residual connections

Saraçoğlu, Sinan.

Revisiting image captioning structures based on CNN and RNN, and improving the performance using modified decoders with residual connections

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Anarım, Emin.
dc.contributor.author	Saraçoğlu, Sinan.
dc.date.accessioned	2025-04-14T12:25:50Z
dc.date.available	2025-04-14T12:25:50Z
dc.date.issued	2023
dc.description.abstract	In this thesis, the image captioning structure consisting of a Convolutional Neural Network (CNN) as the encoder and a Recurrent Neural Network (RNN) as the decoder is visited by comparing and evaluating the effects of different image feature extractors, different RNN cells, different types of word embeddings, and the involvement of residual connections between the RNN cells. The famous “Show, Attend and Tell” model is modified by adding residual connections between the RNN cells and adding other modifications on both the encoder and the decoder side, which improved the performance of the model on the image captioning task. Furthermore, models were trained by implementing 3 different pre-trained word embeddings and their benefits were explored. With the best model, 34 BLEU-4 points and 15 SPICE points improvement were achieved compared with the base model. The effects of training our best model with the images transformed into the frequency domain rather than the images represented in the spatial domain are investigated and it is concluded that this approach cannot enhance the performance of the model. The results of the experiments demonstrate the effectiveness of the proposed modifications and provide insights into the potential of residual connections.
dc.format.pages	xiii, 74 leaves
dc.identifier.other	Graduate Program in Electrical and Electronic Engineering. TKL 2023 U68 PhD (Thes SOC 2023 T74
dc.identifier.uri	https://hdl.handle.net/20.500.14908/21539
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2023.
dc.subject.lcsh	Convolutions (Mathematics)
dc.subject.lcsh	Recurrent sequences (Mathematics)
dc.title	Revisiting image captioning structures based on CNN and RNN, and improving the performance using modified decoders with residual connections

Files

Original bundle

Now showing 1 - 1 of 1

Name:: b2795548.038345.001.PDF
Size:: 3.32 MB
Format:: Adobe Portable Document Format

Download

Collections

M.S. Theses