NORMA eResearch @NCI Library

Classification of PII and Non PII files using Machine learning (NER) for Data loss Prevention

Mohite, Shivraj Prithviraj (2023) Classification of PII and Non PII files using Machine learning (NER) for Data loss Prevention. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (955kB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (1MB) | Preview

Abstract

Data breaches have now become quite common due to the rise of production and procession of sensitive data by organizations. This Sensitive data is called PII or Personally identifiable information which when leaked could be point to a particular individual and put the privacy of that user at risk. So, the detect such PII information and prevent it from getting leaked publicly organization’s use Data loss prevention system to detect these PII information and prevent them from exfiltration. These PII information could be their employees data or their clients information in order to protect these kinds of data they purchase pricy DLP tools which are able to classify and detect the PII files using keywords based classification, regular expression etc but the traditional legacy DLP systems are not that accurate so to improve the quality of detection I have introduced a new Machine learning based approach which uses Named Entity Recognition which is NLP method which extracts information from text. In our thesis we have used a pre trained BERT model which have finetuned for NER to successfully classify the PII and Non PII files with accuracy of more than 92%. This approach can be more effective in detecting PII files than the traditional legacy DLP approaches.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Vangujar, Apurva
UNSPECIFIED
Uncontrolled Keywords: NER: Named entity recognition; NLP: Natural language processing; BERT model; machine learning; classifier; python; transformers; spaCy; PII; Non PII
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Q Science > QA Mathematics > Computer software > Computer Security
T Technology > T Technology (General) > Information Technology > Computer software > Computer Security
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions: School of Computing > Master of Science in Cyber Security
Depositing User: Tamara Malone
Date Deposited: 24 Oct 2024 14:39
Last Modified: 24 Oct 2024 14:39
URI: https://norma.ncirl.ie/id/eprint/7133

Actions (login required)

View Item View Item