Mohite, Shivraj Prithviraj (2023) Classification of PII and Non PII files using Machine learning (NER) for Data loss Prevention. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (955kB) | Preview |
Preview |
PDF (Configuration manual)
Download (1MB) | Preview |
Abstract
Data breaches have now become quite common due to the rise of production and procession of sensitive data by organizations. This Sensitive data is called PII or Personally identifiable information which when leaked could be point to a particular individual and put the privacy of that user at risk. So, the detect such PII information and prevent it from getting leaked publicly organization’s use Data loss prevention system to detect these PII information and prevent them from exfiltration. These PII information could be their employees data or their clients information in order to protect these kinds of data they purchase pricy DLP tools which are able to classify and detect the PII files using keywords based classification, regular expression etc but the traditional legacy DLP systems are not that accurate so to improve the quality of detection I have introduced a new Machine learning based approach which uses Named Entity Recognition which is NLP method which extracts information from text. In our thesis we have used a pre trained BERT model which have finetuned for NER to successfully classify the PII and Non PII files with accuracy of more than 92%. This approach can be more effective in detecting PII files than the traditional legacy DLP approaches.
Actions (login required)
View Item |