NORMA eResearch @NCI Library

Eyes Through Words: Providing Independence to Visually Impaired with Photo-to-Text Technology

-, Sonia (2023) Eyes Through Words: Providing Independence to Visually Impaired with Photo-to-Text Technology. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (2MB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (5MB) | Preview

Abstract

Captioning images automatically is a difficult problem that spans both computer vision and natural language processing. Recent attention has been spurred by the importance of this job in applications like as self-driving cars, helping the visually handicapped, and improving the detection of malicious activities. Captioning photos of cricket matches is of particular interest in this research. Approximately a thousand pictures of cricketers were gathered and labelled with an average of two captions each. We have personally collected cricketer images and generated associated labels, making this dataset a valuable contribution to our research. The created model for cricketer picture captioning was compared against the popular Flickr dataset, which consists of 8,000 photographs with five captions each. The study used a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN) to process textual captions, and a pre-trained InceptionV3 model to extract picture data without the classification layer. An image feature extractor, sequence processor, and decoder make up the model's architecture. The accuracy of the predicted captions was compared to the accuracy of the reference captions using BLEU ratings. Both BLEU-1 (uni-gram scoring) and BLEU-2 (bi-gram scoring) were used in the model selection process. Caption loading, caption prediction for new photos, and other implementation issues are discussed in length. While the cricket captioning model performed admirably on cricketer-related photos, it struggled when applied to unrelated images. However, on photographs of cricketers, it beat a model trained using data from Flickr. In contrast, the Flickr-trained model performed exceptionally well over a wide range of domains since it was built on a more comprehensive dataset.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Milosavljevic, Vladimir
UNSPECIFIED
Uncontrolled Keywords: Automatic Caption Generation; computer vision; natural language processing; visual assistants; sports-related captioning; cricket images; recurrent neural network; bias-variance estimation units (BLEUs); model selection; attention-based modules; hyper-parameter tuning for large CNN models; image feature vectors
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence > Computer vision
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence > Computer vision
R Medicine > Diseases > Disabilities
G Geography. Anthropology. Recreation > GV Recreation Leisure > Sports
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Tamara Malone
Date Deposited: 07 Nov 2024 16:41
Last Modified: 07 Nov 2024 16:41
URI: https://norma.ncirl.ie/id/eprint/7165

Actions (login required)

View Item View Item