The field of computer vision has witnessed significant advancements with the advent of machine learning techniques. Among these advancements, image captioning stands out as a challenging task that involves generating textual descriptions of images. This paper presents a comprehensive overview of machine learning-based approaches for image captioning. We discuss the evolution of captioning techniques from traditional methods to modern deep learning models. We delve into the challenges faced by these models, such as the variability of image content, diversity of language, and the need for context understanding. We also explore the role of pre-trained models, such as ImageNet, in improving captioning performance. Furthermore, we analyze the impact of different architectures, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers, on the quality of generated captions. The paper finally discusses the potential applications of machine learning-based image captioning in various domains, such as accessibility, content creation, and information retrieval. We aim to provide a foundational understanding of the current state-of-the-art in image captioning and identify research directions for future advancements.
Harris, D. Machine Learning-Based Approaches for Image Captioning. Transactions on Applied Soft Computing, 2023, 5, 36. https://doi.org/10.69610/j.tasc.20230216
AMA Style
Harris D. Machine Learning-Based Approaches for Image Captioning. Transactions on Applied Soft Computing; 2023, 5(1):36. https://doi.org/10.69610/j.tasc.20230216
Chicago/Turabian Style
Harris, Daniel 2023. "Machine Learning-Based Approaches for Image Captioning" Transactions on Applied Soft Computing 5, no.1:36. https://doi.org/10.69610/j.tasc.20230216
APA style
Harris, D. (2023). Machine Learning-Based Approaches for Image Captioning. Transactions on Applied Soft Computing, 5(1), 36. https://doi.org/10.69610/j.tasc.20230216
Article Metrics
Article Access Statistics
References
Burbules, N. C., & Callister, T. A. (2000). Watch IT: The Risks and Promises of Information Technologies for Education. Westview Press.
Pentland, A., & Sclaroff, S. (1994). Watch IT: The Risks and Promises of Information Technologies for Education. Westview Press.
Cai, J., & Fei-Fei, L. (2000). Automatic generation of natural image descriptions. In Proceedings of the 2000 IEEE International Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 910-917).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
Donahue, J., Krizhevsky, A., & Bertinetto, P. (2014). Long-term recurrent convolutions for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4310-4318).
Vinyals, O., Shazeer, N., & Le, Q. V. (2015). A neural conversation model for standing in dialogue. In Advances in neural information processing systems (pp. 752-760).
Kim, Y. (2014). Sequence-to-sequence learning with neural networks. arXiv preprint arXiv:1409.3215.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
Luong, T., Pham, H., & Vietnam, A. (2015). A multi-task learning framework for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4447-4455).
Dosovitskiy, A., Fischer, P., Ilg, E., & Cremers, D. (2014). FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (pp. 1125-1133).