关键词:
Computer science
摘要:
In this work, we introduced PFault, a general methodology for analyzing the reliability of High-Performance parallel file systems. PFault emulates the failure state of each storage device in the parallel file systems based on a set of well defined fault models, and enables examining the system behavior under faults systematically. To demonstrate the effectiveness of PFault, we apply PFault to analyze two representative high-performance parallel file systems: Lustre and BeeGFS. Our analysis reveals multiple cases where the widely used parallel file systems are unable to function properly under faults, even after running the default checking and repairing utilities (i.e., LFSCK and BeeGFS-fsck). For example, as the online consistency checker of Lustre, LFSCK itself may hang or trigger the rebooting of nodes. By running LFSCK, the subsequent workloads may still hang. Moreover, we are able to find a potential security vulnerability of Lustre where reading a file after faults may retrieve the content of another irrelevant file. While running LFSCK may fix the vulnerability, it leads to another resource leak problem where a portion of Lustre’s internal namespace and the storage space become unusable. To address the resource leak problem, we design and implement a simple tool called LeakCK, which can detect the leak of internal data files based on their reachability from user files. Similar reliability issues have also been observed in BeeGFS and its checker BeeGFS-FSCK. In BeeGFS, PFault triggers less fatal problems but exposes a comparable number of faults as Lustre. BeeGFS-FSCK, the consistency checker of BeeGFS, still can’t handle most of the faults. Especially when power outage caused network failure happens, BeeGFS-FSCK itself may hang or be aborted and finally fails to reconnect BeeGFS servers. On the other hand, by studying multiple versions of Lustre and comparing their behaviors, we verify that the latest Lustre has made noticeable improvement in terms of failure