-, Gokulkrishna (2025) Detecting and Classifying Post-OCR Errors using Contrastive Self-Supervised Learning. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (1MB) | Preview |
Preview |
PDF (Configuration Manual)
Download (667kB) | Preview |
Abstract
This research addresses the major challenge of post-Optical Character Recognition (OCR) error detection and classification. This project introduces approach leveraging contrastive self-supervised learning, integrating both textual and visual information. Using BERT for robust text embeddings and ResNet for image feature extraction, the system effectively identifies and categorizes OCR error types. A fundamental step of this project is generating synthetic errors using real-world datasets like CORD and FUNSD, which allows a supervised learning framework for error classification. The model classifies errors into several types: missing words, correct Word, spelling mistakes, and wrong words, based on calculated similarity scores. This combined approach of classification and contrastive loss optimizes the model's ability to learn difference between embeddings. The outcomes of the comparison after the evaluation indicate that the model achieved an of accuracy by 73% on FUNSD and 74% on CORD, especially regarding performance of detection of missing words. Although the system demonstrates good results with structured documents, there is more that can improve its capabilities in dealing with such fine-grained spelling mistakes as well as complicated layouts. The work is of great contribution in the strong OCR post-correction and is the precursor to more trustworthy text digitization.
| Item Type: | Thesis (Masters) |
|---|---|
| Supervisors: | Name Email Razzaq, Abdul UNSPECIFIED |
| Subjects: | Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence > Computer vision Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence > Computer vision |
| Divisions: | School of Computing > Master of Science in Artificial Intelligence |
| Depositing User: | Ciara O'Brien |
| Date Deposited: | 28 May 2026 11:17 |
| Last Modified: | 28 May 2026 11:35 |
| URI: | https://norma.ncirl.ie/id/eprint/9308 |
Actions (login required)
![]() |
View Item |
Tools
Tools