Research
My research interests span computer systems, including distributed (training) systems, OS, and storage, with a special focus on fault tolerance. I am currently working on systematic solutions to tackle the reliability and efficiency bottlenecks inherent in training large foundation models.
|
One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
Ruiming Lu,
Yunchi Lu,
Yuxuan Jiang,
Guangtao Xue,
Peng Huang
NSDI 2025
[PDF]
[Slides]
[Software]
Coverage
[UMich CSE]
[Tech XPlore]
|
Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
Ruiming Lu*,
Erci Xu*,
Yiming Zhang,
Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu (*Co-first)
FAST 2023   (🏆 Best Paper Award, Inivited to Appear in USENIX ;login:, Fast-tracked to ToS)
[PDF]
[Slides]
[Video]
[Dataset]
Coverage
[AliCloud]
[CitiNews]
|
NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow
Ruiming Lu*,
Erci Xu*,
Yiming Zhang,
Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Minglu Li, Jiesheng Wu (*Co-first)
ATC 2022
[PDF]
[Slides]
[Video]
[Dataset]
Coverage
[ChinaSys]
[Shanghai Computer Association - Storage]
|
- Artifact Evaluation Committee, FAST 2024 2025
- Artifact Evaluation Committee, EuroSys 2024
- Artifact Evaluation Committee, SOSP 2023
|
|