Antarex HPC Fault Dataset

The Antarex dataset contains trace data collected from the homonymous experimental HPC system located at ETH Zurich while it was subjected to fault injection, for the purpose of conducting machine learning-based fault detection studies for HPC systems. Acquiring our own dataset was made necessary by the fact that commercial HPC system operators are very reluctant to share trace data containing information about faults in their systems. In order to acquire data, we executed benchmark applications and at the same time injected faults in the system at specific times via dedicated programs, so as to trigger anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes. This was achieved through the FINJ fault injection tool, developed by the authors. The dataset contains two types of data: one type of data refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS HPC monitoring framework. Another type refers to the log files detailing the status of the system (i.e., currently running benchmark applications or injected fault programs) at each time point in the dataset. Such a structure enables researchers to perform a wide range of studies on the dataset. Moreover, since we collected the dataset by streaming continuous data, any study based on it will easily be reproducible on a real HPC system, in an online way. The dataset is divided in two parts: the first includes only the CPU and memory-related benchmark applications and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data. For a detailed analysis on the structure and features of the Antarex dataset, please refer to the research paper "Online Fault Classification in HPC System through Machine Learning", by Netti et al. Additional details can be found in the research paper "FINJ: a Fault Injection Tool for HPC System" by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool. When using this dataset, please cite the two reference papers above as follows: " Netti A., Kiziltan Z., Babaoglu O., Sîrbu A., Bartolini A., Borghesi A. (2019) FINJ: A Fault Injection Tool for HPC Systems. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham" " Netti A., Kiziltan Z., Babaoglu O., Sîrbu A., Bartolini A., Borghesi A. (2019) Online Fault Classification in HPC Systems through Machine Learning. arXiv:1810.11208"

Tags
Data and Resources
To access the resources you must log in

This item has no data

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field Value
PID https://www.doi.org/10.5281/zenodo.1453949
PID https://www.doi.org/10.5281/zenodo.2551207
PID https://www.doi.org/10.5281/zenodo.2553224
PID https://www.doi.org/10.5281/zenodo.1453948
URL https://zenodo.org/record/2553224
URL http://dx.doi.org/10.5281/zenodo.2553224
URL https://figshare.com/articles/Antarex_HPC_Fault_Dataset/7194806
URL http://dx.doi.org/10.5281/zenodo.2551207
URL https://zenodo.org/record/2551207
URL https://figshare.com/articles/Antarex_HPC_Fault_Dataset/7653929
URL https://figshare.com/articles/Antarex_HPC_Fault_Dataset/7642046
URL http://dx.doi.org/10.5281/zenodo.1453949
URL https://zenodo.org/record/1453949
URL http://dx.doi.org/10.5281/zenodo.1453948
Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field Value
Access Right Open Access
Attribution

Description: Authorships and contributors

Field Value
Author Netti, Alessio
Author Kiziltan, Zeynep
Author Babaoglu, Ozalp
Author Sirbu, Alina
Author Bartolini, Andrea
Author Borghesi, Andrea
Contributor Alessio Netti
Contributor Zeynep Kiziltan
Contributor Ozalp Babaoglu
Contributor Alina Sirbu
Contributor Andrea Bartolini
Contributor Andrea Borghesi
Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field Value
Collected From Zenodo; Datacite; figshare
Hosted By Zenodo; figshare
Publication Date 2018-10-10
Publisher Zenodo
Additional Info
Field Value
Language English
Resource Type Dataset
system:type dataset
Management Info
Field Value
Source https://science-innovation-policy.openaire.eu/search/dataset?datasetId=dedup_wf_001::5191bbc4207d9991b523d63435ebd267
Author jsonws_user
Version 1.0
Last Updated 14 January 2021, 13:03 (CET)
Created 14 January 2021, 13:03 (CET)