NORMA eResearch @NCI Library

Safeguarding Sensitive Data - Detection in Unstructured Text Using Cutting-Edge Transformer Architectures

Rai, Animesh Kumar (2024) Safeguarding Sensitive Data - Detection in Unstructured Text Using Cutting-Edge Transformer Architectures. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (1MB) | Preview
[thumbnail of Configuration Manual]
Preview
PDF (Configuration Manual)
Download (1MB) | Preview

Abstract

The detection of PII in unstructured text enables organizations to protect privacy and meet legal requirements for data protection in accordance with GDPR, HIPAA and CCPA. Often ordinary prescriptive methods fails while working on the complexities that appear with unstructured data that require enhanced approaches. This research focused on using transformer-based models including DeBERTa, RoBERTa, DistilBERT, Longformer, in enhancing NER methods intended for identifying PII. The present analysis was created using ‘Learning Agency Lab - PII Data Detection’ dataset available on Kaggle. these models were trained to detect different form of PIIs but not limited to names, email addresses and phone numbers. In these models, DeBERTa showed the best performance with an F1-score of 0.91 indicating high levels of precision and recall for all classes. Longformer was really promising for long texts because of its ability to maintain the context, while RoBERTa demonstrated a fairly reasonable balance between speed and accuracy. However, for certain rare PII types, including emails and identification numbers, it became challenging for all the models to hit the intended performances no matter the level of dataset balancing and augmentation. Hyperparameter tuning and dropout regularization were among other techniques that further enhanced models, increasing generalization and reduce overfitting. Limitations aside, class imbalance and inherent sparsity in certain PIIs, findings underlined potential of transformer-based models. Future research may explore better data augmentation techniques, boosting models with other methods, and domain-specific pretraining approach. Findings of this research are valuable for academic and industrial purpose to build large-scale efficient PP systems.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Jameel Syed, Muslim
UNSPECIFIED
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QA Mathematics > Computer software > Computer Security
T Technology > T Technology (General) > Information Technology > Computer software > Computer Security
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions: School of Computing > Master of Science in Artificial Intelligence
Depositing User: Ciara O'Brien
Date Deposited: 20 Jun 2025 10:01
Last Modified: 20 Jun 2025 10:01
URI: https://norma.ncirl.ie/id/eprint/7962

Actions (login required)

View Item View Item