Date of Award

Spring 2007

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Analysis and Modeling

First Advisor

Chokchai Leangsuksun

Abstract

Computational power demand for large challenging problems has increasingly driven the physical size of High Performance Computing (HPC) systems. As the system gets larger, it requires more and more components (processor, memory, disk, switch, power supply and so on). Thus, challenges arise in handling reliability of such large-scale systems. In order to minimize the performance loss due to unexpected failures, fault tolerant mechanisms are vital to sustain computational power in such environment. Checkpoint/restart is a common fault tolerant technique which has been widely applied in the single computer system. However, checkpointing in a large-scale HPC environment is much more challenging due to complexity, coordination, and timing issues. In this dissertation, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims to address the fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

Recommended Citation

Liu, Yudan, "" (2007). Dissertation. 524.
https://digitalcommons.latech.edu/dissertations/524

Download

Included in

Computer Sciences Commons

COinS

Doctoral Dissertations

Reliability -aware optimal checkpoint /restart model in high performance computing

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Author Corner

Doctoral Dissertations

Reliability -aware optimal checkpoint /restart model in high performance computing

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Author Corner