
Introduction
Hello. We are the five MLPerf HPC members from the ICT Systems Laboratory of Fujitsu Limited. In November 2021, at the international Supercomputing Conference (SC'21), the new supercomputer Fugaku, jointly developed by RIKEN and Fujitsu, held the number one spot in four different supercomputing rankings (TOP500, HPCG, HPL-AI, and Graph500), for the fourth consecutive term. At the same conference, we also won the number one spot in the MLPerfTM HPC benchmark, which is dedicated to the actual deep learning (DL) training process. In this blog, we discuss the challenges of training CosmoFlow, one of the applications of MLPerf HPC, using more than half of the entire Fugaku and becoming the best in the world.
- Introduction
- What is MLPerf HPC? (Shirahata)
- What is CosmoFlow? (Tabuchi)
- What is Fugaku? (Tabuchi)
- Processor
- Interconnect
- Storage
- Performance Tunings
- DL frameworks and library (Yamazaki)
- TensorFlow + OneDNN for aarch64
- Mesh TensorFlow
- Weak scaling
- Synchronization and scheduling (Tabuchi)
- Inter-job synchronization
- Placing multiple jobs
- Data staging (Kasagi)
- Result (Shirahata)
- Conclusion (Tabaru, Shirahata, Kasagi, Tabuchi, Yamazaki)
続きを読む