NORMA eResearch @NCI Library

Automatic versioning of time series datasets: a FAIR algorithmic approach

González-Cebrián, Alba, McGuinness, Luke A., Bradford, Michael, Chis, Adriana E. and González-Vélez, Horacio (2022) Automatic versioning of time series datasets: a FAIR algorithmic approach. In: 2022 IEEE 18th International Conference on e-Science (e-Science). IEEE, pp. 204-213.

[thumbnail of Automatic versioning of time series datasets a FAIR algorithmic approach.pdf]
Preview
PDF
Download (1MB) | Preview
Official URL: https://doi.org/10.1109/eScience55777.2022.00034

Abstract

As one of the fundamental concepts underpinning the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles, data provenance entails keeping track of each version for a given dataset from its original to its latest version. However, standard terms to determine and include versioning information in the metadata of a given dataset are still ambiguous and do not explicitly define how to assess the overlap of information between items along a versioning stream. In this work, we propose a novel approach for automatic versioning of time series datasets, based on the use of parameters from two dimensionality reduction approaches, namely Principal Component Analysis and Autoencoders. That is to say, we systematically detect and measure similarities (information distances) in datasets via dimensionality reduction, encode them as different versions, and then automatically generate provenance metadata via a FAIR versioning service using the W3C DCAT 3.0 nomenclature. We illustrate this approach with two time series datasets and demonstrate how the proposed parameters effectively assess the similarity between different data versions. Our results have shown that the proposed version similarity metrics are robust $(s^{(0,1)}=1)$ to the alteration of up to 60% of cells, the removal of up to 60% of rows, and the log-scale transformation of variables. In contrast, row-wise transformations (e.g. converting absolute values to a percentage of a second variable) yield minimal similarity values $(s^{(0,1)} < 0.75)$. Our code and datasets are openly available to enable reproducibility.

Item Type: Book Section
Additional Information: © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4450 Databases
Q Science > QA Mathematics > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science > Computer Systems > Information Storage and Retrieval Systems
Divisions: School of Computing > Staff Research and Publications
Depositing User: Tamara Malone
Date Deposited: 04 Jan 2023 15:16
Last Modified: 05 Jan 2023 10:23
URI: https://norma.ncirl.ie/id/eprint/6059

Actions (login required)

View Item View Item