Efficient and customizable data partitioning framework for distributed big RDF data processing in the cloud

Kisung Lee, Ling Liu, Yuzhe Tang, Qi Zhang, Yang Zhou

Research output: Contribution to journalConference Articlepeer-review

35 Scopus citations

Abstract

Big data business can leverage and benefit from the Clouds, the most optimized, shared, automated, and virtualized computing infrastructures. One of the important challenges in processing big data in the Clouds is how to effectively partition the big data to ensure efficient distributed processing of the data. In this paper we present a Scalable and yet customizable data PArtitioning framework, called SPA, for distributed processing of big RDF graph data. We choose big RDF datasets as our focus of the investigation for two reasons. First, the Linking Open Data cloud has put forwards a good number of big RDF datasets with tens of billions of triples and hundreds of millions of links. Second, such huge RDF graphs can easily overwhelm any single server due to the limited memory and CPU capacity and exceed the processing capacity of many conventional data processing software systems. Our data partitioning framework has two unique features. First, we introduce a suite of vertexcentric data partitioning building blocks to allow efficient and yet customizable partitioning of large heterogeneous RDF graph data. By efficient, we mean that the SPA data partitions can support fast processing of big data of different sizes and complexity. By customizable, we mean that the SPA partitions are adaptive to different query types. Second, we propose a selection of scalable techniques to distribute the building block partitions across a cluster of compute nodes in a manner that minimizes inter-node communication cost by localizing most of the queries on distributed partitions. We evaluate our data partitioning framework and algorithms through extensive experiments using both benchmark and real datasets. Our experimental results show that the SPA data partitioning framework is not only efficient for partitioning and distributing big RDF datasets of diverse sizes and structures but also effective for processing big data queries of different types and complexity.

Original languageEnglish (US)
Article number6676711
Pages (from-to)327-334
Number of pages8
JournalIEEE International Conference on Cloud Computing, CLOUD
DOIs
StatePublished - 2013
Externally publishedYes
Event2013 IEEE 6th International Conference on Cloud Computing, CLOUD 2013 - Santa Clara, CA, United States
Duration: Jun 27 2013Jul 2 2013

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Efficient and customizable data partitioning framework for distributed big RDF data processing in the cloud'. Together they form a unique fingerprint.

Cite this