NORMA eResearch @NCI Library

Detecting and Classifying Post-OCR Errors using Contrastive Self-Supervised Learning

-, Gokulkrishna (2025) Detecting and Classifying Post-OCR Errors using Contrastive Self-Supervised Learning. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (1MB) | Preview
[thumbnail of Configuration Manual]
Preview
PDF (Configuration Manual)
Download (667kB) | Preview

Abstract

This research addresses the major challenge of post-Optical Character Recognition (OCR) error detection and classification. This project introduces approach leveraging contrastive self-supervised learning, integrating both textual and visual information. Using BERT for robust text embeddings and ResNet for image feature extraction, the system effectively identifies and categorizes OCR error types. A fundamental step of this project is generating synthetic errors using real-world datasets like CORD and FUNSD, which allows a supervised learning framework for error classification. The model classifies errors into several types: missing words, correct Word, spelling mistakes, and wrong words, based on calculated similarity scores. This combined approach of classification and contrastive loss optimizes the model's ability to learn difference between embeddings. The outcomes of the comparison after the evaluation indicate that the model achieved an of accuracy by 73% on FUNSD and 74% on CORD, especially regarding performance of detection of missing words. Although the system demonstrates good results with structured documents, there is more that can improve its capabilities in dealing with such fine-grained spelling mistakes as well as complicated layouts. The work is of great contribution in the strong OCR post-correction and is the precursor to more trustworthy text digitization.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Razzaq, Abdul
UNSPECIFIED
Subjects: Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence > Computer vision
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence > Computer vision
Divisions: School of Computing > Master of Science in Artificial Intelligence
Depositing User: Ciara O'Brien
Date Deposited: 28 May 2026 11:17
Last Modified: 28 May 2026 11:35
URI: https://norma.ncirl.ie/id/eprint/9308

Actions (login required)

View Item View Item