Ruiming Lu

> Building reliable LLM training infrastructure at Tencent Hunyuan

I received my Ph.D from Shanghai Jiao Tong University, advised by Prof. Guangtao Xue and Prof. Minglu Li. In 2023-2024, I spent a wonderful year as a visiting PhD student at OrderLab, University of Michigan, Ann Arbor, hosted by Prof. Ryan Huang. I visited the Systems Research Group | MSR Asia during Fall 2024, hosted by Dr. Jilong Xue. I received B.S. (Electrical and Computer Engineering) in 2020 from UM-SJTU Joint Institute.

Email  /  CV  /  Google Scholar  /  Github

profile photo

Research

My research interests span computer systems, including distributed (training) systems, OS, and storage, with a special focus on fault tolerance. I am currently working on systematic solutions to tackle the reliability and efficiency bottlenecks inherent in training large foundation models.

Selected Publications (See full publication list)

One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
Ruiming Lu, Yunchi Lu, Yuxuan Jiang, Guangtao Xue, Peng Huang
NSDI 2025   [PDF]   [Slides]   [Software]
Coverage   [UMich CSE]   [Tech XPlore]

Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
Ruiming Lu*, Erci Xu*, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu (*Co-first)
FAST 2023   (🏆 Best Paper Award, Inivited to Appear in USENIX ;login:, Fast-tracked to ToS)
[PDF]   [Slides]   [Video]   [Dataset]
Coverage   [AliCloud]   [CitiNews]

NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow
Ruiming Lu*, Erci Xu*, Yiming Zhang, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Minglu Li, Jiesheng Wu (*Co-first)
ATC 2022   [PDF]   [Slides]   [Video]   [Dataset]
Coverage   [ChinaSys]   [Shanghai Computer Association - Storage]

Professional Service

  • Artifact Evaluation Committee, FAST 2024 2025
  • Artifact Evaluation Committee, EuroSys 2024
  • Artifact Evaluation Committee, SOSP 2023

Template credits to jonbarron. Last modified: Dec 12th, 2024.