Predicting the Usage of Scientific Datasets Based on Article, Author, Institution, and Journal Bibliometrics

Daniel E. Acuna, Zijun Yi, Lizhen Liang, Han Zhuang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scientific datasets are increasingly crucial for knowledge accumulation and reproducibility, making it essential to understand how they are used. Although usage information is hard to obtain, features from the publications that describe a dataset can provide clues. This article associates dataset downloads with the authors’ h-index, institutional prestige, journal ranking, and the references used in the publication that first introduces them. Tens of thousands of datasets and associated publications from figshare.com are used in our analysis. We found that a gradient boosting model achieved the highest performance against linear regression, random forests, and artificial neural networks. Our interpretation results suggest that journal ranking is highly predictive of usage while the author’s institutional prestige and h-index are less critical. In addition, we found that publications with a long but focused body of references are associated with more dataset downloads. We also show that prediction performance decays rapidly the farther we estimate downloads into the future. Finally, we discuss the implications of our work for reproducibility and data policies.

Original languageEnglish (US)
Title of host publicationInformation for a Better World
Subtitle of host publicationShaping the Global Future - 17th International Conference, iConference 2022, Proceedings
EditorsMalte Smits
PublisherSpringer Science and Business Media Deutschland GmbH
Pages42-52
Number of pages11
ISBN (Print)9783030969561
DOIs
StatePublished - 2022
Externally publishedYes
Event17th International Conference on Information for a Better World: Shaping the Global Future, iConference 2022 - Virtual, Online
Duration: Feb 28 2022Mar 4 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13192 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Conference on Information for a Better World: Shaping the Global Future, iConference 2022
CityVirtual, Online
Period2/28/223/4/22

Keywords

  • Bibliometrics
  • Dataset usage
  • Prediction
  • Prestige
  • Science of science

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Predicting the Usage of Scientific Datasets Based on Article, Author, Institution, and Journal Bibliometrics'. Together they form a unique fingerprint.

Cite this