Verma, Suman (2023) Detection of Phishing in Mobile Instant Messaging using Natural Language Processing and Machine Learning. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (663kB) | Preview |
Preview |
PDF (Configuration manual)
Download (630kB) | Preview |
Abstract
Advancement in mobile technology has made communication possible in real time with much ease but at the cost of wider attack area available for phishing. Detection of phishing in instant message is a matter of concern and research due to its widespread use for personal, professional, and business purpose. Cyber attackers are gradually modifying the modus operandi of phishing since its inception from worms, virus, malicious link to wise use of languages invoking fear, urgency, reward, in instant messages for mobile users. There has been continuous research being done to detect phishing in E-mail and SMS using advance technologies but detection of phishing in Instant message remains neglected. The widespread usage of instant messengers by individuals of all ages, including the most susceptible groups like the elderly and younger generations, necessitates the addition of security features for phishing detection and message filtering. This research is aimed at detecting phishing in mobile instant messages by analysing the language of message with the help of Natural Language Processing and building a classifier to detect the keywords pointing towards phishing. Indication of phishing messages cannot be limited to direct use of question or command to users as the language of message can be modelled, depending on the context and emotional state of users during real-time conversation. The SMS Phishing dataset from Mendeley data dedicated for machine learning and pattern recognition was employed in our research since the keywords used in the dataset and the machine learning technique were pertinent to our study. The dataset has been pre-processed before training the classifier. To compare the better vectorisation methods for feature extraction, three different techniques namely Bag of Words (BOW), Term Frequency-Inverse Document Frequency (TFIDF) and Word2vec has been applied on the preprocessed data. Three classification models-Random Forest, Logical Regression and Gaussian Naïve Bayes are trained on the dataset for identification and classification of messages into phishing and legitimate messages. Our tests showed that using TFIDF for vectorization and trying to balance the data with Random over sampling increased classifier performance. Random Forest classifier predicted the messages into phishing and no phishing with accuracy of 99.2 % among three models on the dataset. With a dataset devoted to instant messages, the Word2vec method of vectorization might further increase its classification accuracy, which was 95.2% when trained on Random Forest classifiers. It is necessary to create a dataset for instant messaging that would show contextual relationships between sentences, variations in linguistic structure utilized for phishing, or pretexting for phishing to detect it. Proactive detection of phishing in instant messages will have a pivotal role for a large fraction of society and organisation to safeguard the application as well as valuable customer.
Actions (login required)
View Item |