Doctoral Dissertations

Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

Narasimha Raju Gottumukkala, Louisiana Tech University

Date of Award

Spring 2008

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Analysis and Modeling

First Advisor

Chokchai Leangsuksun

Abstract

The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users.

This dissertation makes three major contributions for reliability evaluation and resource management in HPC systems. First we study the failure properties of HPC systems and observe that Times To Failure (TTF's) of individual compute nodes follow a time-varying failure rate based distribution like Weibull distribution. We then propose a model for the TTF distribution of a system of k independent nodes when individual nodes exhibit time varying failure rates. Based on the reliability of the proposed TTF model, we develop reliability-aware resource allocation algorithms and evaluated them on actual parallel workloads and failure data of a HPC system. Our observations indicate that applying time varying failure rate-based reliability function combined with some heuristics reduce the performance loss due to unexpected failures by as much as 30 to 53 percent. Finally, we also study the effect of reliability with respect to the number of nodes and propose reliability-aware optimal k node allocation algorithm for large scale parallel applications. Our simulation results of comparing the optimal k node algorithm indicate that choosing the number of nodes for large scale parallel applications based on the reliability of compute nodes can reduce the overall completion time and waste time when the k may be smaller than the total number of nodes in the system.

Recommended Citation

Gottumukkala, Narasimha Raju, "" (2008). Dissertation. 510.
https://digitalcommons.latech.edu/dissertations/510

Download

Included in

Computer Sciences Commons

COinS

Doctoral Dissertations

Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Author Corner

Doctoral Dissertations

Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Author Corner