Guidelines for online network crawling: A study of data collection approaches and network properties

Katchaguy Areekijseree, Ricky Laishram, Sucheta Soundarajan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.

Original languageEnglish (US)
Title of host publicationWebSci 2018 - Proceedings of the 10th ACM Conference on Web Science
PublisherAssociation for Computing Machinery, Inc
Pages57-66
Number of pages10
ISBN (Electronic)9781450355636
DOIs
StatePublished - May 15 2018
Event10th ACM Conference on Web Science, WebSci 2018 - Amsterdam, Netherlands
Duration: May 27 2018May 30 2018

Other

Other10th ACM Conference on Web Science, WebSci 2018
CountryNetherlands
CityAmsterdam
Period5/27/185/30/18

Fingerprint

Application programming interfaces (API)
Sampling
Bandwidth
Experiments

Keywords

  • Complex networks
  • Experiments
  • Network crawling
  • Network sampling
  • Online sampling algorithm

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Areekijseree, K., Laishram, R., & Soundarajan, S. (2018). Guidelines for online network crawling: A study of data collection approaches and network properties. In WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science (pp. 57-66). Association for Computing Machinery, Inc. https://doi.org/10.1145/3201064.3201066

Guidelines for online network crawling : A study of data collection approaches and network properties. / Areekijseree, Katchaguy; Laishram, Ricky; Soundarajan, Sucheta.

WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science. Association for Computing Machinery, Inc, 2018. p. 57-66.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Areekijseree, K, Laishram, R & Soundarajan, S 2018, Guidelines for online network crawling: A study of data collection approaches and network properties. in WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science. Association for Computing Machinery, Inc, pp. 57-66, 10th ACM Conference on Web Science, WebSci 2018, Amsterdam, Netherlands, 5/27/18. https://doi.org/10.1145/3201064.3201066
Areekijseree K, Laishram R, Soundarajan S. Guidelines for online network crawling: A study of data collection approaches and network properties. In WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science. Association for Computing Machinery, Inc. 2018. p. 57-66 https://doi.org/10.1145/3201064.3201066
Areekijseree, Katchaguy ; Laishram, Ricky ; Soundarajan, Sucheta. / Guidelines for online network crawling : A study of data collection approaches and network properties. WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science. Association for Computing Machinery, Inc, 2018. pp. 57-66
@inproceedings{1549908497b949e7840bd1236627cb07,
title = "Guidelines for online network crawling: A study of data collection approaches and network properties",
abstract = "Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.",
keywords = "Complex networks, Experiments, Network crawling, Network sampling, Online sampling algorithm",
author = "Katchaguy Areekijseree and Ricky Laishram and Sucheta Soundarajan",
year = "2018",
month = "5",
day = "15",
doi = "10.1145/3201064.3201066",
language = "English (US)",
pages = "57--66",
booktitle = "WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Guidelines for online network crawling

T2 - A study of data collection approaches and network properties

AU - Areekijseree, Katchaguy

AU - Laishram, Ricky

AU - Soundarajan, Sucheta

PY - 2018/5/15

Y1 - 2018/5/15

N2 - Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.

AB - Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.

KW - Complex networks

KW - Experiments

KW - Network crawling

KW - Network sampling

KW - Online sampling algorithm

UR - http://www.scopus.com/inward/record.url?scp=85049393457&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049393457&partnerID=8YFLogxK

U2 - 10.1145/3201064.3201066

DO - 10.1145/3201064.3201066

M3 - Conference contribution

AN - SCOPUS:85049393457

SP - 57

EP - 66

BT - WebSci 2018 - Proceedings of the 10th ACM Conference on Web Science

PB - Association for Computing Machinery, Inc

ER -