González-Cebrián, Alba, McGuinness, Luke A., Bradford, Michael, Chis, Adriana E. and González-Vélez, Horacio (2022) Automatic versioning of time series datasets: a FAIR algorithmic approach. In: 2022 IEEE 18th International Conference on e-Science (e-Science). IEEE, pp. 204-213.
Preview |
PDF
Download (1MB) | Preview |
Abstract
As one of the fundamental concepts underpinning the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles, data provenance entails keeping track of each version for a given dataset from its original to its latest version. However, standard terms to determine and include versioning information in the metadata of a given dataset are still ambiguous and do not explicitly define how to assess the overlap of information between items along a versioning stream. In this work, we propose a novel approach for automatic versioning of time series datasets, based on the use of parameters from two dimensionality reduction approaches, namely Principal Component Analysis and Autoencoders. That is to say, we systematically detect and measure similarities (information distances) in datasets via dimensionality reduction, encode them as different versions, and then automatically generate provenance metadata via a FAIR versioning service using the W3C DCAT 3.0 nomenclature. We illustrate this approach with two time series datasets and demonstrate how the proposed parameters effectively assess the similarity between different data versions. Our results have shown that the proposed version similarity metrics are robust $(s^{(0,1)}=1)$ to the alteration of up to 60% of cells, the removal of up to 60% of rows, and the log-scale transformation of variables. In contrast, row-wise transformations (e.g. converting absolute values to a percentage of a second variable) yield minimal similarity values $(s^{(0,1)} < 0.75)$. Our code and datasets are openly available to enable reproducibility.
Actions (login required)
View Item |