Deep Image Captioning system using Attention Two-Layer LSTM Network
Main Article Content
Abstract
Image captioning techniques are the algorithmic approach of automatic generation of one or more natural language based textual sentences for an input image. Image captioning is a cross-modal challenge that necessitates the automatic generation of natural sounding phrases to describe the semantic information present in an image. As there is an enormous gap between human visionary understanding and the corresponding natural language description, most of the existing techniques suffer from poor semantic matching between images and generated captions. The generation of caption from an image is intriguingly very challenging as it is the processing of bridging human vision with computer vision. With the invention of deep learning techniques many real-time applications in different modalities including the topic of image captioning have had a lot of success. The work is about a model for image captioning by implementing various pre-trained CNN models for feature extraction with custom defined LSTM layers for captioning. The model was developed using various CNN models such as Resnet-152, ResNet-50, VGG16, Inception V3 to extract features from the images. When compared to other architectures, ResNet-152 performs better. The ResNet-152 acts as a generator block for extracting features from images, while the LSTM acts as a decoder unit, generating words that describe the image. After the caption generation phase, to evaluate the effectiveness of our methodology, the model was evaluated using a variety of metrics like BLUE-n and ROUGE-L. As a result, our method assists the user in obtaining a descriptive caption for the input image with improved performance as compared to the existing state-of-the-art methods.