关键词:
Fault-tolerant computing
Computer Science
摘要:
In-situ scientific workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, running in-situ scientific workflows on extreme-scale computing systems presents fault tolerance challenges which significantly affect the correctness and performance of workflows. First, scientific in-situ workflow requires sharing and moving data between coupled applications through data staging. As the data volumes and generate rates keep growing, the traditional data resilience approaches such as n-way replication and erasure codes become cost prohibitive, and data staging requires more scalable and efficient approach to support the data resilience. Second, Increasing scale is also expected to result in an increase in the rate of silent data corruption errors, which will impact both the correctness and performance of applications. Moreover, this impact is amplified in the case of in-situ workflows due to the dataflow between the component applications of the workflow. Third, since coupled applications in workflows frequently interact and exchange the large amount of data, simply applying the state of the art fault tolerance techniques such as checkpoint/restart to individual application component can not guarantee data consistency of workflows after failure recovery. Furthermore, naive use of these fault tolerance techniques to the entire workflows will limit the diversity of resilience approaches of application components, and finally incur a significant latency, storage overheads, and performance degradation. This thesis addresses these challenges related to data resilience and fault tolerance for in-situ scientific workflows, and makes the following contributions. This thesis first presents CoREC, a scalable resilient in-memory data staging runti