Image Captioning using Transfer Learning |
Author(s): |
Yash Mittal , Meerut Institute of Engineering and Technology; Kushagra Chauhan, Meerut Institute of Engineering and Technology; Sunil Kumar, Meerut Institute of Engineering and Technology |
Keywords: |
Computer Vision, NLTK, RNN, CNN, Semantics, Dictionary |
Abstract |
In recent years there is evolution which is to be seen in various fields of science. advancements are made from simple mathematical equations to advanced real-world problem-solving, and computer algorithms. Computer vision is also an area that is introduced and evolved for processing the image and examining various aspects like {image classification and object detection } (image captioning based on deep neural networks). Talking about computer vision uses various artificial intelligence techniques and algorithms for confronting any image and deriving meaningful information or conclusions. With the development of deep learning and the natural language process, we can train and use a certain model to predict 1 or 2 sentences from any image. However, the prediction is based on the objects, certain semantics, and other important factors which can result in some unique observations that cannot be overlooked while generating the caption. The caption which is generated should be grammatically correct and in the dictionary for which we have used natural language processing NLTK and the dictionary of words that are to be used in the generation of sentences. Although the generation of image captions is a complicated and difficult task, we have achieved this using the deep neural networks which are based on CNN-RNN, CNN-CNN, and transfer learning. We have maintained a grammar according to which the sentence is formed of 35-36 words. Humans can describe certain scenes using various patterns that are generated parallel while learning to distinguish things and state the object by combining all the factors. In this research paper, we have focused on this type of learning, but as the images are 2 dimensional, we have to map the space with some most probable words that can occur in the image and learn this mapping by training the model. |
Other Details |
Paper ID: IJSRDV11I20111 Published in: Volume : 11, Issue : 2 Publication Date: 01/05/2023 Page(s): 195-201 |
Article Preview |
|
|