A dataset to facilitate automated workflow analysis

Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing (NLP) and Business Process Mining (BPM), however it is difficult to find a suitable dataset for use cases that span across multiple fields, such as the one described in this study. The lack of established ground truth maps between cyberspace and the human-interpretable, persona-driven tasks that occur therein, is one of the principal barriers preventing reliable, automated situation awareness of dynamically evolving events and the consequences of loss due to cybersecurity breaches. Automated workflow analysis—the machine-learning assisted identification of templates of repeated tasks—is the likely missing link between semantic descriptions of mission goals and observable events in cyberspace. We summarize our efforts to establish a ground truth for an email dataset pertaining to the operation of an open source software project. The ground truth defines semantic labels for each email and the arrangement of emails within a sequence that describe actions observed in the dataset. Identified sequences are then used to define template workflows that describe the possible tasks undertaken for a project and their business process model. We present the overall purpose of the dataset, the methodology for establishing a ground truth, and lessons learned from the effort. Finally, we report on the proposed use of the dataset for the workflow discovery problem, and its effect on system accuracy.

Tags
Data and Resources
To access the resources you must log in

This item has no data

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field Value
PID https://www.doi.org/10.1371/journal.pone.0211486
PID pmc:PMC6366754
PID pmid:30730921
URL http://dx.doi.org/10.1371/journal.pone.0211486
URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0211486
URL https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0211486&type=printable
URL https://doi.org/10.1371/journal.pone.0211486
URL https://doaj.org/toc/1932-6203
URL http://europepmc.org/articles/PMC6366754
URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6366754
URL http://dx.plos.org/10.1371/journal.pone.0211486
URL https://academic.microsoft.com/#/detail/2911409138
Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field Value
Access Right Open Access
Attribution

Description: Authorships and contributors

Field Value
Author Allard, Tony
Author Alvino, Paul, 0000-0002-5550-7541
Author Shing, Leslie, 0000-0003-3677-6698
Author Wollaber, Allan, 0000-0001-5997-9610
Author Yuen, Joseph
Contributor Olier, Ivan
Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field Value
Collected From PubMed Central; ORCID; Datacite; UnpayWall; DOAJ-Articles; Crossref; Microsoft Academic Graph
Hosted By Europe PubMed Central; PLoS ONE
Publication Date 2019-02-07
Publisher Public Library of Science (PLoS)
Additional Info
Field Value
Language UNKNOWN
Resource Type Other literature type; Article
keyword Q
keyword R
keyword keywords.General Biochemistry, Genetics and Molecular Biology
system:type publication
Management Info
Field Value
Source https://science-innovation-policy.openaire.eu/search/publication?articleId=dedup_wf_001::f4d23361cc3675fbb72ecdc2f2419225
Author jsonws_user
Last Updated 22 December 2020, 17:49 (CET)
Created 22 December 2020, 17:49 (CET)