TY - GEN
T1 - Analyzing a Data Science Online Practitioner Community
T2 - 2022 IEEE International Conference on Big Data, Big Data 2022
AU - Tacheva, Jasmina
AU - Lahiri, Sucheta
AU - Saltz, Jeffrey
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The overarching goal of this research was to gain an understanding of what the data science Reddit online community discussed before, during, and after COVID-19. We used a publicly available Reddit API to harvest the r/datascience subreddit first level post data. We then performed manual annotation to explore the taxonomy of trends and themes discussed by the practitioners who belonged to reddit data science community. Then, we augmented the manually annotated data using a BERT model with topic modeling. In short, the key discussion themes, in order of frequency, were: Education, Jobs, Methods (of data science), Hardware and data collection, Data visualization, and Quality. The Quality theme includes discussions on bias, transparency, and fairness. Hence, a key finding was that there were very few discussions on data science project quality, especially trying to minimize the risk of machine learning bias. As discussions on bias are not yet common, data science teams should proactively identify and address potential questions and concerns that might arise in data science projects, especially the need to increase the team's focus on potential bias and fairness.
AB - The overarching goal of this research was to gain an understanding of what the data science Reddit online community discussed before, during, and after COVID-19. We used a publicly available Reddit API to harvest the r/datascience subreddit first level post data. We then performed manual annotation to explore the taxonomy of trends and themes discussed by the practitioners who belonged to reddit data science community. Then, we augmented the manually annotated data using a BERT model with topic modeling. In short, the key discussion themes, in order of frequency, were: Education, Jobs, Methods (of data science), Hardware and data collection, Data visualization, and Quality. The Quality theme includes discussions on bias, transparency, and fairness. Hence, a key finding was that there were very few discussions on data science project quality, especially trying to minimize the risk of machine learning bias. As discussions on bias are not yet common, data science teams should proactively identify and address potential questions and concerns that might arise in data science projects, especially the need to increase the team's focus on potential bias and fairness.
KW - CoP
KW - Data Science
KW - Online Communities
KW - Project Management
KW - community of practice
UR - http://www.scopus.com/inward/record.url?scp=85147903251&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147903251&partnerID=8YFLogxK
U2 - 10.1109/BigData55660.2022.10020600
DO - 10.1109/BigData55660.2022.10020600
M3 - Conference contribution
AN - SCOPUS:85147903251
T3 - Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022
SP - 2673
EP - 2681
BT - Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022
A2 - Tsumoto, Shusaku
A2 - Ohsawa, Yukio
A2 - Chen, Lei
A2 - Van den Poel, Dirk
A2 - Hu, Xiaohua
A2 - Motomura, Yoichi
A2 - Takagi, Takuya
A2 - Wu, Lingfei
A2 - Xie, Ying
A2 - Abe, Akihiro
A2 - Raghavan, Vijay
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 December 2022 through 20 December 2022
ER -