Optimizing data transfers for improved performance on shared GPUs using reinforcement learning

Ryan S. Luley, Qinru Qiu

Research output: Chapter in Book/Entry/PoemConference contribution

2 Scopus citations

Abstract

Optimizing resource utilization is a critical issue in cloud and cluster-based computing systems. In such systems, computing resources often consist of one or more GPU devices, and much research has already been conducted on means for maximizing compute resources through shared execution strategies. However, one of the most severe resource constraints in these scenarios is the data transfer channel between the host (i.e., CPU) and the device (i.e., GPU). Data transfer contention has been shown to have a significant impact on performance, yet methods for optimizing such contention have not been thoroughly studied. Techniques that have been examined make certain assumptions which limit effectiveness in the general case. In this paper, we introduce a heuristic which selectively aggregates transfers in order to maximize system performance by optimizing the transfer channel bandwidth. We compare this heuristic to traditional first-come-first-served approach, and apply Monte Carlo reinforcement learning to find an optimal policy for message aggregation. Finally, we evaluate the performance of Monte Carlo reinforcement learning with an arbitrarily-initialized policy. We demonstrate its effectiveness in learning optimal data transfer policy without detailed system characterization, which will enable a general adaptable solution for resource management of future systems.

Original languageEnglish (US)
Title of host publicationProceedings - 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages378-381
Number of pages4
ISBN (Electronic)9781538658154
DOIs
StatePublished - Jul 13 2018
Event18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018 - Washington, United States
Duration: May 1 2018May 4 2018

Publication series

NameProceedings - 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018

Other

Other18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018
Country/TerritoryUnited States
CityWashington
Period5/1/185/4/18

Keywords

  • Concurrent kernel execution
  • Data transfer
  • GPGPU
  • Reinforcement learning
  • Resource contention

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Optimizing data transfers for improved performance on shared GPUs using reinforcement learning'. Together they form a unique fingerprint.

Cite this