Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

Background Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). Results Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. Conclusions We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0784-9) contains supplementary material, which is available to authorized users.

Tags
Data and Resources
To access the resources you must log in

This item has no data

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field Value
PID https://www.doi.org/10.1186/s12859-015-0784-9
PID pmc:PMC4634915
PID pmid:26537827
URL https://academic.microsoft.com/#/detail/2137195266
URL http://dx.doi.org/10.1186/s12859-015-0784-9
URL https://core.ac.uk/display/81089795
URL https://dx.doi.org/10.1186/s12859-015-0784-9
URL https://link.springer.com/article/10.1186/s12859-015-0784-9
URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634915/
URL http://europepmc.org/articles/PMC4634915
URL https://dblp.uni-trier.de/db/journals/bmcbi/bmcbi16.html#BlagusL15a
URL http://link.springer.com/content/pdf/10.1186/s12859-015-0784-9.pdf
URL http://link.springer.com/article/10.1186/s12859-015-0784-9/fulltext.html
URL https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-015-0784-9
URL https://doi.org/10.1186%2Fs12859-015-0784-9
URL https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0784-9
URL https://paperity.org/p/74509186/joint-use-of-over-and-under-sampling-techniques-and-cross-validation-for-the-development
Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field Value
Access Right Open Access
Attribution

Description: Authorships and contributors

Field Value
Author Lara Lusa, 0000-0002-8981-2421
Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field Value
Collected From Europe PubMed Central; PubMed Central; ORCID; UnpayWall; Datacite; Crossref; Microsoft Academic Graph; CORE (RIOXX-UK Aggregator)
Hosted By Europe PubMed Central; SpringerOpen; BMC Bioinformatics
Journal BMC Bioinformatics, 16, null
Publication Date 2015-11-04
Publisher Springer Science and Business Media LLC
Additional Info
Field Value
Language Undetermined
Resource Type Article; UNKNOWN
system:type publication
Management Info
Field Value
Source https://science-innovation-policy.openaire.eu/search/publication?articleId=dedup_wf_001::36d52da6be96460c53db1b2c26455497
Author jsonws_user
Last Updated 25 December 2020, 16:07 (CET)
Created 25 December 2020, 16:07 (CET)