TY - JOUR
T1 - Genomic Machine Learning Meta-regression
T2 - Insights on Associations of Study Features With Reported Model Performance
AU - Barnett, Eric J.
AU - Onete, Daniel G.
AU - Salekin, Asif
AU - Faraone, Stephen V.
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Many studies have been conducted with the goal of correctly predicting diagnostic status of a disorder using the combination of genomic data and machine learning. It is often hard to judge which components of a study led to better results and whether better reported results represent a true improvement or an uncorrected bias inflating performance. We extracted information about the methods used and other differentiating features in genomic machine learning models. We used these features in linear regressions predicting model performance. We tested for univariate and multivariate associations as well as interactions between features. Of the models reviewed, 46% used feature selection methods that can lead to data leakage. Across our models, the number of hyperparameter optimizations reported, data leakage due to feature selection, model type, and modeling an autoimmune disorder were significantly associated with an increase in reported model performance. We found a significant, negative interaction between data leakage and training size. Our results suggest that methods susceptible to data leakage are prevalent among genomic machine learning research, resulting in inflated reported performance. Best practice guidelines that promote the avoidance and recognition of data leakage may help the field avoid biased results.
AB - Many studies have been conducted with the goal of correctly predicting diagnostic status of a disorder using the combination of genomic data and machine learning. It is often hard to judge which components of a study led to better results and whether better reported results represent a true improvement or an uncorrected bias inflating performance. We extracted information about the methods used and other differentiating features in genomic machine learning models. We used these features in linear regressions predicting model performance. We tested for univariate and multivariate associations as well as interactions between features. Of the models reviewed, 46% used feature selection methods that can lead to data leakage. Across our models, the number of hyperparameter optimizations reported, data leakage due to feature selection, model type, and modeling an autoimmune disorder were significantly associated with an increase in reported model performance. We found a significant, negative interaction between data leakage and training size. Our results suggest that methods susceptible to data leakage are prevalent among genomic machine learning research, resulting in inflated reported performance. Best practice guidelines that promote the avoidance and recognition of data leakage may help the field avoid biased results.
KW - Machine learning
KW - artificial intelligence
KW - biology and genetics
KW - computer applications
KW - computing methodologies
KW - learning
KW - life and medical sciences
UR - http://www.scopus.com/inward/record.url?scp=85181805030&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85181805030&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2023.3343808
DO - 10.1109/TCBB.2023.3343808
M3 - Article
C2 - 38109236
AN - SCOPUS:85181805030
SN - 1545-5963
VL - 21
SP - 169
EP - 177
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 1
ER -