NORMA eResearch @NCI Library

Smart Synthetic Data as Solution to the Limitations of Conventional Anonymization Means in Big Data

Wang, Bingwei (2022) Smart Synthetic Data as Solution to the Limitations of Conventional Anonymization Means in Big Data. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (1MB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (3MB) | Preview

Abstract

Ideally, the anonymized data method of protecting privacy and promoting data security ought to remove all personally identifiable information and simultaneously maintain the crucial information for the application of data without invading privacy. However, anonymized data neither offers data privacy nor does it retain the key useful information. As a tool, it is associated with several risks and limits, especially in Big Data applications. There should be a major trade-off between absolute privacy protection and actual data utility. The smart artificial data can be characterized by better or similar predictive power as real data, void of any privacy challenges present in the original data. In this study, conditional Generative Adversarial Network (cGAN) is used to generate anonymous data and verify whether the characteristics of the generated anonymous data are close to the real data. Results indicate that cGAN can generate artificial data that eliminates the risks of privacy and confidentiality violation in the use of big data while enabling shareability and hence maximization of big data. In the result, synthetic data generated by cGAN has not only the distribution is similar to the original data, but also in machine learning performance is close to real data. In the final results, the smart synthetic data generated by the method used in this paper were improved by 0.93%,0.39% and 1.6% respectively in the three machine learning algorithms, and the accuracy is sometimes improved by more than 5.0% after optimization in the isolated forest algorithm. The results show that the data synthesized by cGAN can replace anonymous data to protect user privacy.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Tamara Malone
Date Deposited: 14 Mar 2023 15:12
Last Modified: 14 Mar 2023 15:12
URI: https://norma.ncirl.ie/id/eprint/6340

Actions (login required)

View Item View Item