High Performance Computing (HPC) systems have become increasingly complex and have significant impacts on the economy and society. However, their high energy consumption is a critical issue in the face of environmental and energy crises. Therefore, it is crucial to develop strategies to optimize the management of HPC systems, ensuring both top-tier performance and improved energy efficiency. One such strategy is to predict job failures before their execution on the system, allowing for resource allocation and scheduling adjustments. This paper focuses on job failure prediction at submit-time using machine learning algorithms, combined with Natural Language Processing (NLP) tools to represent jobs. Additionally, the approach is designed to work in an online fashion with a real system. The study utilizes a dataset from an HPC center in Italy, and the experimental results demonstrate promising outcomes.
Predicting Online Job Failures in an HPC System Using an HPC System (arXiv:2308.15481v1 [cs.DC])
by instadatahelp | Aug 31, 2023 | AI Blogs