Journal Browser
Open Access Journal Article

Machine Learning-Based Approaches for Image Captioning

by Daniel Harris 1,*
1
Daniel Harris
*
Author to whom correspondence should be addressed.
TASC  2023, 36; 5(1), 36; https://doi.org/10.69610/j.tasc.20230216
Received: 5 January 2023 / Accepted: 26 January 2023 / Published Online: 16 February 2023

Abstract

The field of computer vision has witnessed significant advancements with the advent of machine learning techniques. Among these advancements, image captioning stands out as a challenging task that involves generating textual descriptions of images. This paper presents a comprehensive overview of machine learning-based approaches for image captioning. We discuss the evolution of captioning techniques from traditional methods to modern deep learning models. We delve into the challenges faced by these models, such as the variability of image content, diversity of language, and the need for context understanding. We also explore the role of pre-trained models, such as ImageNet, in improving captioning performance. Furthermore, we analyze the impact of different architectures, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers, on the quality of generated captions. The paper finally discusses the potential applications of machine learning-based image captioning in various domains, such as accessibility, content creation, and information retrieval. We aim to provide a foundational understanding of the current state-of-the-art in image captioning and identify research directions for future advancements.


Copyright: © 2023 by Harris. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (Creative Commons Attribution 4.0 International License). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Share and Cite

ACS Style
Harris, D. Machine Learning-Based Approaches for Image Captioning. Transactions on Applied Soft Computing, 2023, 5, 36. https://doi.org/10.69610/j.tasc.20230216
AMA Style
Harris D. Machine Learning-Based Approaches for Image Captioning. Transactions on Applied Soft Computing; 2023, 5(1):36. https://doi.org/10.69610/j.tasc.20230216
Chicago/Turabian Style
Harris, Daniel 2023. "Machine Learning-Based Approaches for Image Captioning" Transactions on Applied Soft Computing 5, no.1:36. https://doi.org/10.69610/j.tasc.20230216
APA style
Harris, D. (2023). Machine Learning-Based Approaches for Image Captioning. Transactions on Applied Soft Computing, 5(1), 36. https://doi.org/10.69610/j.tasc.20230216

Article Metrics

Article Access Statistics

References

  1. Burbules, N. C., & Callister, T. A. (2000). Watch IT: The Risks and Promises of Information Technologies for Education. Westview Press.
  2. Pentland, A., & Sclaroff, S. (1994). Watch IT: The Risks and Promises of Information Technologies for Education. Westview Press.
  3. Cai, J., & Fei-Fei, L. (2000). Automatic generation of natural image descriptions. In Proceedings of the 2000 IEEE International Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 910-917).
  4. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
  5. Donahue, J., Krizhevsky, A., & Bertinetto, P. (2014). Long-term recurrent convolutions for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4310-4318).
  6. Vinyals, O., Shazeer, N., & Le, Q. V. (2015). A neural conversation model for standing in dialogue. In Advances in neural information processing systems (pp. 752-760).
  7. Kim, Y. (2014). Sequence-to-sequence learning with neural networks. arXiv preprint arXiv:1409.3215.
  8. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
  9. Luong, T., Pham, H., & Vietnam, A. (2015). A multi-task learning framework for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4447-4455).
  10. Dosovitskiy, A., Fischer, P., Ilg, E., & Cremers, D. (2014). FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (pp. 1125-1133).