NORMA eResearch @NCI Library

Summarizing Newspaper Articles using Optical Character Recognition and Natural Language Processing

Tomar, Shashank Sanjay (2022) Summarizing Newspaper Articles using Optical Character Recognition and Natural Language Processing. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
PDF (Master of Science)
Download (3MB) | Preview
[thumbnail of Configuration manual]
PDF (Configuration manual)
Download (7MB) | Preview


The present journalism market does not see newspapers as the primary source of information as it once did. In recent years, readers have shifted to more digital and accessible sources, such as social media platforms and messaging applications. Because newspapers are so comprehensive, it is a laborious task to sift through all the information. The key goal of this study is to build an end-to-end solution suite that enables readers to listen to an audio file containing a summary of the articles present in a newspaper, in lieu of reading them. It will attempt to resolve a few long-standing challenges in the field of newspaper digitisation by developing a sophisticated solution capable of handling complex newspaper layouts, lengthy articles, etc. This will benefit the readers by providing a quick and reliable way to consume news by means of audio files. Using an unannotated opensource dataset with scanned pages of a newspaper, a Mask RCNN model was trained to segment the various articles contained within a page. The articles were then taken through another stage of Mask RCNN to identify different text columns in them. After segmenting the column images, Tesseract(OCR) was used to extract the text, which was later put through text cleaning and spell checking using the Microsoft Bing API. To produce a summary of the cleaned text retrieved from the OCR, a second opensource dataset (CNN-DailyMail) was used to train a BERT NLP model. While training on only one fifth of the data used in previous studies, the study produced an effective image segmentation model with a validation MRCNN BBox loss of 0.187 & Mask loss of 0.189 while extracting text from articles with a confidence score of 82.79. Text and audio summaries with a ROUGE-l score of 25.78 and a ROUGE-2 score of 18.21 were also produced.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Tamara Malone
Date Deposited: 14 Mar 2023 11:42
Last Modified: 14 Mar 2023 11:42

Actions (login required)

View Item View Item