NORMA eResearch @NCI Library

Standardised Versioning of Datasets: a FAIR–compliant Proposal

González-Cebrián, Alba, Bradford, Michael, Chis, Adriana E. and González-Vélez, Horacio (2024) Standardised Versioning of Datasets: a FAIR–compliant Proposal. Scientific Data, 11. pp. 1-15. ISSN 2052-4463

[thumbnail of s41597-024-03153-y.pdf]
Preview
PDF
Download (2MB) | Preview
Official URL: https://doi.org/10.1038/s41597-024-03153-y

Abstract

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature (“major.minor.patch”) and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.

Item Type: Article
Additional Information: Open Access: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Uncontrolled Keywords: Research data; Research management; Technology
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QA Mathematics > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions: School of Computing > Staff Research and Publications
Depositing User: Tamara Malone
Date Deposited: 09 Apr 2024 13:40
Last Modified: 18 Dec 2024 11:35
URI: https://norma.ncirl.ie/id/eprint/6975

Actions (login required)

View Item View Item