Image Captioning using Transfer Learning

Yash Mittal; Kushagra Chauhan; Sunil Kumar

Image Captioning using Transfer Learning

Author(s):

Yash Mittal , Meerut Institute of Engineering and Technology; Kushagra Chauhan, Meerut Institute of Engineering and Technology; Sunil Kumar, Meerut Institute of Engineering and Technology

Keywords:

Computer Vision, NLTK, RNN, CNN, Semantics, Dictionary

Abstract

In recent years there is evolution which is to be seen in various fields of science. advancements are made from simple mathematical equations to advanced real-world problem-solving, and computer algorithms. Computer vision is also an area that is introduced and evolved for processing the image and examining various aspects like {image classification and object detection } (image captioning based on deep neural networks). Talking about computer vision uses various artificial intelligence techniques and algorithms for confronting any image and deriving meaningful information or conclusions. With the development of deep learning and the natural language process, we can train and use a certain model to predict 1 or 2 sentences from any image. However, the prediction is based on the objects, certain semantics, and other important factors which can result in some unique observations that cannot be overlooked while generating the caption. The caption which is generated should be grammatically correct and in the dictionary for which we have used natural language processing NLTK and the dictionary of words that are to be used in the generation of sentences. Although the generation of image captions is a complicated and difficult task, we have achieved this using the deep neural networks which are based on CNN-RNN, CNN-CNN, and transfer learning. We have maintained a grammar according to which the sentence is formed of 35-36 words. Humans can describe certain scenes using various patterns that are generated parallel while learning to distinguish things and state the object by combining all the factors. In this research paper, we have focused on this type of learning, but as the images are 2 dimensional, we have to map the space with some most probable words that can occur in the image and learn this mapping by training the model.

Other Details

Paper ID: IJSRDV11I20111
Published in: Volume : 11, Issue : 2
Publication Date: 01/05/2023
Page(s): 195-201

Article Preview

Download Article

Email To A Friend

CALL FOR PAPERS : May-2026

ADVANCED SEARCH

NEWS & UPDATES

FOR AUTHORS

FOR REVIEWERS

ARCHIVES

DOWNLOADS