González-Cebrián, Alba, Ciolacu, Iulian, Bradford, Michael, Dobre, Ciprian and González-Vélez, Horacio (2024) Data Drift for Automatic FAIR-compliant Dataset Versioning in Large Repositories. In: 2024 IEEE 20th International Conference on e-Science (e-Science). IEEE, Osaka, Japan, pp. 1-10. ISBN 979-8-3503-6562-7
Preview |
PDF
Download (2MB) | Preview |
Abstract
Construed as a shift in the distribution or structure of data over time, data drift can adversely affect the performance of machine learning models and data-driven decisions. This study examines two data drift metrics, denoted as d E,PCA and d E,AE , that are derived from unsupervised ML models: the reconstruction error-based metrics of Principal Component Analysis (PCA) and Autoencoders (AE). To investigate the robustness of these metrics, we have systematically accessed time-series datasets from the European Data Portal. Our experiments have examined data versioning through three basic events: creation, update, and deletion. The results are summarised and aggregated for all datasets, and unsupervised analysis based on Robust PCA and AE has been performed to examine patterns within the impact of dataset characteristics on data drift detection and computational efficiency. Our results indicate that both metrics aligned closely in performance with new records, suggesting consistent drift detection under normal conditions with FAIR compliance. However, high-dimensional datasets posed challenges for both PCA and AE models. Update events revealed discrepancies between the two metrics, suggesting that non-linear shifts affected AE-based metrics more than PCA-based ones. Deletion events demonstrated the resilience of these metrics against data loss, but also revealed variability in the reliability of the PCA model; i.e., data drift metrics derived from PCA and AE can be effective but sensitive to certain dataset characteristics.
Item Type: | Book Section |
---|---|
Additional Information: | © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. |
Uncontrolled Keywords: | Data drift; Machine Learning; FAIR principles; Principal Component Analysis; Autoencoders; Data versioning; Time series; Dataset Versioning |
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources Q Science > QA Mathematics > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning |
Divisions: | School of Computing > Staff Research and Publications |
Depositing User: | Tamara Malone |
Date Deposited: | 23 Sep 2024 11:21 |
Last Modified: | 23 Sep 2024 11:21 |
URI: | https://norma.ncirl.ie/id/eprint/7065 |
Actions (login required)
View Item |