TY - GEN
T1 - Guidelines for online network crawling
T2 - 10th ACM Conference on Web Science, WebSci 2018
AU - Areekijseree, Katchaguy
AU - Laishram, Ricky
AU - Soundarajan, Sucheta
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/5/15
Y1 - 2018/5/15
N2 - Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.
AB - Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.
KW - Complex networks
KW - Experiments
KW - Network crawling
KW - Network sampling
KW - Online sampling algorithm
UR - http://www.scopus.com/inward/record.url?scp=85049393457&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85049393457&partnerID=8YFLogxK
U2 - 10.1145/3201064.3201066
DO - 10.1145/3201064.3201066
M3 - Conference contribution
AN - SCOPUS:85049393457
T3 - WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science
SP - 57
EP - 66
BT - WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science
PB - Association for Computing Machinery, Inc
Y2 - 27 May 2018 through 30 May 2018
ER -