González-Cebrián, Alba, Bradford, Michael, Chis, Adriana E. and González-Vélez, Horacio (2024) Standardised Versioning of Datasets: a FAIR–compliant Proposal. Scientific Data, 11. pp. 1-15. ISSN 2052-4463
Preview |
PDF
Download (2MB) | Preview |
Abstract
This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature (“major.minor.patch”) and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.
Item Type: | Article |
---|---|
Additional Information: | Open Access: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. |
Uncontrolled Keywords: | Research data; Research management; Technology |
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science Q Science > QA Mathematics > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning |
Divisions: | School of Computing > Staff Research and Publications |
Depositing User: | Tamara Malone |
Date Deposited: | 09 Apr 2024 13:40 |
Last Modified: | 18 Dec 2024 11:35 |
URI: | https://norma.ncirl.ie/id/eprint/6975 |
Actions (login required)
View Item |