TY - GEN
T1 - Predicting the Usage of Scientific Datasets Based on Article, Author, Institution, and Journal Bibliometrics
AU - Acuna, Daniel E.
AU - Yi, Zijun
AU - Liang, Lizhen
AU - Zhuang, Han
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Scientific datasets are increasingly crucial for knowledge accumulation and reproducibility, making it essential to understand how they are used. Although usage information is hard to obtain, features from the publications that describe a dataset can provide clues. This article associates dataset downloads with the authors’ h-index, institutional prestige, journal ranking, and the references used in the publication that first introduces them. Tens of thousands of datasets and associated publications from figshare.com are used in our analysis. We found that a gradient boosting model achieved the highest performance against linear regression, random forests, and artificial neural networks. Our interpretation results suggest that journal ranking is highly predictive of usage while the author’s institutional prestige and h-index are less critical. In addition, we found that publications with a long but focused body of references are associated with more dataset downloads. We also show that prediction performance decays rapidly the farther we estimate downloads into the future. Finally, we discuss the implications of our work for reproducibility and data policies.
AB - Scientific datasets are increasingly crucial for knowledge accumulation and reproducibility, making it essential to understand how they are used. Although usage information is hard to obtain, features from the publications that describe a dataset can provide clues. This article associates dataset downloads with the authors’ h-index, institutional prestige, journal ranking, and the references used in the publication that first introduces them. Tens of thousands of datasets and associated publications from figshare.com are used in our analysis. We found that a gradient boosting model achieved the highest performance against linear regression, random forests, and artificial neural networks. Our interpretation results suggest that journal ranking is highly predictive of usage while the author’s institutional prestige and h-index are less critical. In addition, we found that publications with a long but focused body of references are associated with more dataset downloads. We also show that prediction performance decays rapidly the farther we estimate downloads into the future. Finally, we discuss the implications of our work for reproducibility and data policies.
KW - Bibliometrics
KW - Dataset usage
KW - Prediction
KW - Prestige
KW - Science of science
UR - http://www.scopus.com/inward/record.url?scp=85126262870&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126262870&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-96957-8_5
DO - 10.1007/978-3-030-96957-8_5
M3 - Conference contribution
AN - SCOPUS:85126262870
SN - 9783030969561
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 42
EP - 52
BT - Information for a Better World
A2 - Smits, Malte
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th International Conference on Information for a Better World: Shaping the Global Future, iConference 2022
Y2 - 28 February 2022 through 4 March 2022
ER -