NORMA eResearch @NCI Library

Knowledge Discovery in Healthcare databases: feature selection in diabetes classification

Nicolenco, Svetlana (2016) Knowledge Discovery in Healthcare databases: feature selection in diabetes classification. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
PDF (Master of Science)
Download (2MB) | Preview
[thumbnail of Configuration File]
PDF (Configuration File)
Download (907kB) | Preview


Artificial Intelligence may enhance and complement Medical Intelligence to deliver Healthcare of the 21 century. Medical databases accumulate vast amounts of data, holding potential remedies to many diseases. Data mining opens new dimensions and opportunities to the existing statistical approaches in medical domain. A better set of predictors is needed for diagnostics and classification of medical conditions and that is where feature selection become indispensable. The National Health and Nutrition Examination Surveys (NHANES) data was utilised in this research, with demographic data, details of laboratory tests and food components data accessed for knowledge discovery. Diabetes is one of the main causes of disabilities and deaths in the world and one of the disorders where causes are poorly understood. This was the main motivation for exploration of the data from a data scientist perspective. Glucose blood level (serum glucose) was selected as the target feature, as it is the main factor in identification of diabetes. Algorithms with feature selection may work with predictors that were never considered before but could help improve accuracy. "Gbm", "glmnet", "svmRadial" and "nnet" packages were applied for feature selection in this research within the R environment. Comparative analysis of features selected by different algorithms estimated with Receiver Operating Characteristic (ROC), sensitivity, specificity and kappa. Complex analysis revealed important features for predicting glucose level: blood osmolality, blood sodium and blood phosphorus level. Validation of the predictive accuracy using a ROC curve has been done on a test set with accuracy almost 87% with all modelling techniques. Pima Indian Diabetes data has been chosen as a reference against the proposed model, accuracy of 76% attained in comparison with the same methods.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
R Medicine > Healthcare Industry
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Caoimhe Ní Mhaicín
Date Deposited: 03 Dec 2016 14:22
Last Modified: 03 Dec 2016 14:22

Actions (login required)

View Item View Item