Date of Award

Spring 2008

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Analysis and Modeling

First Advisor

Chokchai Leangsuksun

Abstract

The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users.

This dissertation makes three major contributions for reliability evaluation and resource management in HPC systems. First we study the failure properties of HPC systems and observe that Times To Failure (TTF's) of individual compute nodes follow a time-varying failure rate based distribution like Weibull distribution. We then propose a model for the TTF distribution of a system of k independent nodes when individual nodes exhibit time varying failure rates. Based on the reliability of the proposed TTF model, we develop reliability-aware resource allocation algorithms and evaluated them on actual parallel workloads and failure data of a HPC system. Our observations indicate that applying time varying failure rate-based reliability function combined with some heuristics reduce the performance loss due to unexpected failures by as much as 30 to 53 percent. Finally, we also study the effect of reliability with respect to the number of nodes and propose reliability-aware optimal k node allocation algorithm for large scale parallel applications. Our simulation results of comparing the optimal k node algorithm indicate that choosing the number of nodes for large scale parallel applications based on the reliability of compute nodes can reduce the overall completion time and waste time when the k may be smaller than the total number of nodes in the system.

Share

COinS