Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling

Ricky Laishram, Katchaguy Areekijseree, Sucheta Soundarajan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web crawling and network sampling in general; however, this work typically assumes that a query on a node either reveals all edges incident to that node (in the case of an undirected graph) or all edges outgoing from that node (e.g., in the case of a web crawler). There is relatively little work on sampling directed networks where incoming and outgoing edges may be obtained with separate queries. This type of sampling is relevant to networks like Twitter, in which the 'Friends' and 'Follower' connections are reciprocal relationships (i.e., if A is a follower of B, then B is a friend of A). In this paper we present Predicted Max Degree Sampling (PMD), a new method of sampling a directed network, with the objective of maximizing the total number of nodes observed. We consider two types of crawling, corresponding to the scenarios in which there is or is not a limit on the number of nodes returned by a query. We compared PMD against three baseline algorithms with three real networks, and saw large improvements vs. baseline sampling algorithms: With a budget of 2000, PMD found 15% to 170.2% more nodes than the closest baseline algorithm for the different datasets considered.

Original languageEnglish (US)
Title of host publication2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages940-945
Number of pages6
ISBN (Electronic)9781538627846
DOIs
StatePublished - Nov 20 2017
Event2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017 - Atlanta, United States
Duration: May 1 2017May 4 2017

Other

Other2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017
CountryUnited States
CityAtlanta
Period5/1/175/4/17

Fingerprint

Directed Network
Coverage
Maximise
Sampling
Vertex of a graph
Baseline
Query
Computer Networks
Computer networks
Undirected Graph
Scenarios
Necessary

ASJC Scopus subject areas

  • Hardware and Architecture
  • Control and Optimization
  • Artificial Intelligence
  • Computer Networks and Communications

Cite this

Laishram, R., Areekijseree, K., & Soundarajan, S. (2017). Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling. In 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017 (pp. 940-945). [8116502] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/INFCOMW.2017.8116502

Predicted max degree sampling : Sampling in directed networks to maximize node coverage through crawling. / Laishram, Ricky; Areekijseree, Katchaguy; Soundarajan, Sucheta.

2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 940-945 8116502.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Laishram, R, Areekijseree, K & Soundarajan, S 2017, Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling. in 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017., 8116502, Institute of Electrical and Electronics Engineers Inc., pp. 940-945, 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017, Atlanta, United States, 5/1/17. https://doi.org/10.1109/INFCOMW.2017.8116502
Laishram R, Areekijseree K, Soundarajan S. Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling. In 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 940-945. 8116502 https://doi.org/10.1109/INFCOMW.2017.8116502
Laishram, Ricky ; Areekijseree, Katchaguy ; Soundarajan, Sucheta. / Predicted max degree sampling : Sampling in directed networks to maximize node coverage through crawling. 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 940-945
@inproceedings{a6a56258e1b345388b1b7ec9fbaf4885,
title = "Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling",
abstract = "When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web crawling and network sampling in general; however, this work typically assumes that a query on a node either reveals all edges incident to that node (in the case of an undirected graph) or all edges outgoing from that node (e.g., in the case of a web crawler). There is relatively little work on sampling directed networks where incoming and outgoing edges may be obtained with separate queries. This type of sampling is relevant to networks like Twitter, in which the 'Friends' and 'Follower' connections are reciprocal relationships (i.e., if A is a follower of B, then B is a friend of A). In this paper we present Predicted Max Degree Sampling (PMD), a new method of sampling a directed network, with the objective of maximizing the total number of nodes observed. We consider two types of crawling, corresponding to the scenarios in which there is or is not a limit on the number of nodes returned by a query. We compared PMD against three baseline algorithms with three real networks, and saw large improvements vs. baseline sampling algorithms: With a budget of 2000, PMD found 15{\%} to 170.2{\%} more nodes than the closest baseline algorithm for the different datasets considered.",
author = "Ricky Laishram and Katchaguy Areekijseree and Sucheta Soundarajan",
year = "2017",
month = "11",
day = "20",
doi = "10.1109/INFCOMW.2017.8116502",
language = "English (US)",
pages = "940--945",
booktitle = "2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Predicted max degree sampling

T2 - Sampling in directed networks to maximize node coverage through crawling

AU - Laishram, Ricky

AU - Areekijseree, Katchaguy

AU - Soundarajan, Sucheta

PY - 2017/11/20

Y1 - 2017/11/20

N2 - When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web crawling and network sampling in general; however, this work typically assumes that a query on a node either reveals all edges incident to that node (in the case of an undirected graph) or all edges outgoing from that node (e.g., in the case of a web crawler). There is relatively little work on sampling directed networks where incoming and outgoing edges may be obtained with separate queries. This type of sampling is relevant to networks like Twitter, in which the 'Friends' and 'Follower' connections are reciprocal relationships (i.e., if A is a follower of B, then B is a friend of A). In this paper we present Predicted Max Degree Sampling (PMD), a new method of sampling a directed network, with the objective of maximizing the total number of nodes observed. We consider two types of crawling, corresponding to the scenarios in which there is or is not a limit on the number of nodes returned by a query. We compared PMD against three baseline algorithms with three real networks, and saw large improvements vs. baseline sampling algorithms: With a budget of 2000, PMD found 15% to 170.2% more nodes than the closest baseline algorithm for the different datasets considered.

AB - When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web crawling and network sampling in general; however, this work typically assumes that a query on a node either reveals all edges incident to that node (in the case of an undirected graph) or all edges outgoing from that node (e.g., in the case of a web crawler). There is relatively little work on sampling directed networks where incoming and outgoing edges may be obtained with separate queries. This type of sampling is relevant to networks like Twitter, in which the 'Friends' and 'Follower' connections are reciprocal relationships (i.e., if A is a follower of B, then B is a friend of A). In this paper we present Predicted Max Degree Sampling (PMD), a new method of sampling a directed network, with the objective of maximizing the total number of nodes observed. We consider two types of crawling, corresponding to the scenarios in which there is or is not a limit on the number of nodes returned by a query. We compared PMD against three baseline algorithms with three real networks, and saw large improvements vs. baseline sampling algorithms: With a budget of 2000, PMD found 15% to 170.2% more nodes than the closest baseline algorithm for the different datasets considered.

UR - http://www.scopus.com/inward/record.url?scp=85041291557&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041291557&partnerID=8YFLogxK

U2 - 10.1109/INFCOMW.2017.8116502

DO - 10.1109/INFCOMW.2017.8116502

M3 - Conference contribution

AN - SCOPUS:85041291557

SP - 940

EP - 945

BT - 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -