Large Transparent Checkpointing Process Completed
John Simpson | December 14, 2016A team of researchers at Northeastern University (NU) led by Jiajun Cao, PhD candidate in the College of Computer and Information Science (CCIS), has completed what they believe to be the largest known instance of transparent checkpointing.
Transparent checkpointing allows computer scientists and engineers working on large projects to save and reopen programs without modifying any code. This assures researchers operating across hundreds or thousands of computers that their work will be safe in case of a computer failure. For example, with transparent checkpointing software, meteorologists can process and analyze billions of pieces of weather data without the fear that a computer crash could erase that work.
Transparent checkpointing assures researchers operating across potentially thousands of computers that their work will be safe in the event of a crash. Image credit: Pixabay.“The idea of checkpointing is that one can take a running computation, automatically stop it in the middle and save the state of everything to a file on disk,” says Gene Cooperman, professor at CCIS and Cao’s advisor. “Then you can copy that file to another computer or keep it on the same one. When you restart, the program continues running from where it left off.”
The significance of the NU researchers' example of transparent checkpointing is the massive amount of data that was run and saved in a short period of time. MVAPICH software developed by the Ohio State University, which supported the Message Passing Interface, was used to run the High Performance Conjugate Gradients program for linear algebra in parallel over 32,768 central processing unit (CPU) cores on 2,048 computers. It used a total memory of 38 terabytes and was checkpointed in 10 minutes and 53 seconds.
A second program, Nanoscale Molecular Dynamics, was run in parallel over 16,368 CPU cores on 1,024 computers, using a total memory of 10 terabytes. It was checkpointed in 2 minutes and 38 seconds. According to the researchers, checkpointing these amounts of data in under 11 minutes is a breakthrough for scientists usually restricted by having to run their programs before modifying and saving them within a 24-hour timeframe.
“These results show how the Extended Collaborative Support Services from the National Science Foundation-supported Extreme Science and Engineering Discovery Environment can help scientists and developers improve the scalability and efficiency of their code on high-performance computing clusters,” says Jérôme Vienne, research associate at the Texas Advanced Computer Center, where the processes were carried out on the Stampede supercomputer.