
(2) Khan Raqib Mahmud

(3) Abul Kalam Al Azad

(4) Md Shahabub Alam

(5) Anif Minhaz Khan

*corresponding author
AbstractAutomated image to text generation is a computationally challenging computer vision task which requires sufficient comprehension of both syntactic and semantic meaning of an image to generate a meaningful description. Until recent times, it has been studied to a limited scope due to the lack of visual-descriptor dataset and functional models to capture intrinsic complexities involving features of an image. In this study, a novel dataset was constructed by generating Bangla textual descriptor from visual input, called Bangla Natural Language Image to Text (BNLIT), incorporating 100 classes with annotation. A deep neural network-based image captioning model was proposed to generate image description. The model employs Convolutional Neural Network (CNN) to classify the whole dataset, while Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) capture the sequential semantic representation of text-based sentences and generate pertinent description based on the modular complexities of an image. When tested on the new dataset, the model accomplishes significant enhancement of centrality execution for image semantic recovery assignment. For the experiment of that task, we implemented a hybrid image captioning model, which achieved a remarkable result for a new self-made dataset, and that task was new for the Bangladesh perspective. In brief, the model provided benchmark precision in the characteristic Bangla syntax reconstruction and comprehensive numerical analysis of the model execution results on the dataset.
Keywordsconvolutional neural network; hybrid recurrent neural network; long short-term memory; bi-directional RNN; natural language descriptors
|
DOIhttps://doi.org/10.26555/ijain.v6i2.499 |
Article metricsAbstract views : 7808 | PDF views : 518 |
Cite |
Full Text![]() |
References
[1] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,†in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, doi: 10.1007/978-3-030-01264-9_42.
[2] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio, “ChatPainter: Improving text to image generation using dialogue,†in 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings, 2018, available at: Google Scholar.
[3] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts,†IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2321–2334, Dec. 2017, doi: 10.1109/TPAMI.2016.2642953.
[4] L. Chen et al., “SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning,†in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6298–6306, doi: 10.1109/CVPR.2017.667.
[5] H. Wang, Y. Zhang, and X. Yu, “An Overview of Image Caption Generation Methods,†Comput. Intell. Neurosci., vol. 2020, pp. 1–13, Jan. 2020, doi: 10.1155/2020/3062706.
[6] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun, “Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner,†in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 521–530, doi: 10.1109/ICCV.2017.64.
[7] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional Image Captioning,†in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5561–5570, doi: 10.1109/CVPR.2018.00583.
[8] F. Fang, H. Wang, Y. Chen, and P. Tang, “Looking deeper and transferring attention for image captioning,†Multimed. Tools Appl., vol. 77, no. 23, pp. 31159–31175, Dec. 2018, doi: 10.1007/s11042-018-6228-6.
[9] M. A. Jishan, K. R. Mahmud, and A. K. Al Azad, “Natural language description of images using hybrid recurrent neural network,†Int. J. Electr. Comput. Eng., vol. 9, no. 4, p. 2932, Aug. 2019, doi: 10.11591/ijece.v9i4.pp2932-2940.
[10] Q. Wang and A. B. Chan, “Cnn+ cnn: Convolutional decoders for image captioning,†arXiv Prepr. arXiv1805.09019, 2018, available at : Google Scholar.
[11] P. Anderson et al., “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,†in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077–6086, doi: 10.1109/CVPR.2018.00636.
[12] M. A. Jishan, K. R. Mahmud, and A. K. Al Azad, Bangla Natural Language Image to Text (BNLIT), 2020, doi: 10.17632/ws3r82gnm8.4.
[13] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene Parsing through ADE20K Dataset,†in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5122–5130, doi: 10.1109/CVPR.2017.544.
[14] T. Xu et al., “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks,†in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1316–1324, doi: 10.1109/CVPR.2018.00143.
[15] K. H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked Cross Attention for Image-Text Matching,†in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 201-216, 2018, doi: 10.1007/978-3-030-01225-0_13.
[16] Y. Zhu et al., “Texygen: A Benchmarking Platform for Text Generation Models,†in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 1097–1100, doi: 10.1145/3209978.3210080.
[17] M. Hussain, J. J. Bird, and D. R. Faria, “A Study on CNN Transfer Learning for Image Classification,†Advances in Computational Intelligence Systems, vol. 840, pp. 191–202, 2019, doi: 10.1007/978-3-319-97982-3_16.
[18] L. Zhu, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Generative Adversarial Networks for Hyperspectral Image Classification,†IEEE Trans. Geosci. Remote Sens., vol. 56, no. 9, pp. 5046–5063, Sep. 2018, doi: 10.1109/TGRS.2018.2805286.
[19] I. Dhall, S. Vashisth, and S. Saraswat, “Text Generation Using Long Short-Term Memory Networks,†Micro-Electronics and Telecommunication Engineering, vol. 106, pp. 649–657, 2020, doi: 10.1007/978-981-15-2329-8_66.
[20] C. Rebuffel, L. Soulier, G. Scoutheeten, and P. Gallinari, “A Hierarchical Model for Data-to-Text Generation,†Advances in Information Retrieval, vol. 12035, pp. 65–80, 2020, doi: 10.1007/978-3-030-45439-5_5.
[21] T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning Text-To-Image Generation by Redescription,†in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1505–1514, doi: 10.1109/CVPR.2019.00160.
[22] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, “Disentangled Person Image Generation,†in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 99-108, doi: 10.1109/CVPR.2018.00018.
[23] S. Bai and S. An, “A survey on automatic image caption generation,†Neurocomputing, vol. 311, pp. 291–304, Oct. 2018, doi: 10.1016/j.neucom.2018.05.080.
[24] Z. Zhang, Y. Xie, and L. Yang, “Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network,†in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6199–6208, doi: 10.1109/CVPR.2018.00649.
[25] E. Laloy, R. Hérault, D. Jacques, and N. Linde, “Trainingâ€Image Based Geostatistical Inversion Using a Spatial Generative Adversarial Neural Network,†Water Resour. Res., vol. 54, no. 1, pp. 381–406, Jan. 2018, doi: 10.1002/2017WR022148.
[26] J. Chen and H. Zhuge, “Extractive summarization of documents with images based on multi-modal RNN,†Futur. Gener. Comput. Syst., vol. 99, pp. 186–196, Oct. 2019, doi: 10.1016/j.future.2019.04.045.
[27] W. Xu, H. Sun, C. Deng, and Y. Tan, “TextDream: Conditional Text Generation by Searching in the Semantic Space,†in 2018 IEEE Congress on Evolutionary Computation (CEC), 2018, pp. 1–6, doi: 10.1109/CEC.2018.8477776.
[28] J. Xu, X. Ren, J. Lin, and X. Sun, “Diversity-Promoting GAN: A Cross-Entropy Based Generative Adversarial Network for Diversified Text Generation,†in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3940–3949, doi: 10.18653/v1/D18-1428.
[29] H. Kwon, Y. Kim, H. Yoon, and D. Choi, “CAPTCHA Image Generation Systems Using Generative Adversarial Networks,†IEICE Trans. Inf. Syst., vol. E101.D, no. 2, pp. 543–546, 2018, doi: 10.1587/transinf.2017EDL8175.
[30] T. Jiang, Z. Zhang, and Y. Yang, “Modeling coverage with semantic embedding for image caption generation,†Vis. Comput., vol. 35, no. 11, pp. 1655–1665, Nov. 2019, doi: 10.1007/s00371-018-1565-z.
[31] A. Gatt and E. Krahmer, “Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation,†J. Artif. Intell. Res., vol. 61, pp. 65–170, Jan. 2018, doi: 10.1613/jair.5477.
[32] C.-C. Wu, R. Song, T. Sakai, W.-F. Cheng, X. Xie, and S.-D. Lin, “Evaluating Image-Inspired Poetry Generation,†in CCF International Conference on Natural Language Processing and Chinese Computing, 2019, pp. 539–551, doi: 10.1007/978-3-030-32233-5_42.
[33] G. Ding, M. Chen, S. Zhao, H. Chen, J. Han, and Q. Liu, “Neural Image Caption Generation with Weighted Training and Reference,†Cognit. Comput., vol. 11, no. 6, pp. 763–777, Dec. 2019, doi: 10.1007/s12559-018-9581-x.
[34] G. Zhang, F. Wang, and W. Duan, “Study on Star-Galaxy Image Generation Method Based on GAN,†Xibei Gongye Daxue Xuebao/Journal Northwest. Polytech. Univ., vol. 37, no. 2, pp. 315–322, Apr. 2019, doi: 10.1051/jnwpu/20193720315.
[35] Y. Sagawa and M. Hagiwara, “Face image generation system using attribute information with DCGANs,†in Proceedings of the 2nd International Conference on Machine Learning and Soft Computing - ICMLSC ’18, 2018, pp. 109–113, doi: 10.1145/3184066.3184071.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0