Predicting HVAC faults with AWS Glue and Amazon SageMaker for Carrier

In 1902, Willis Carrier invented air conditioning, which revolutionized the way we control indoor environments. Today, Carrier products are trusted for creating comfortable spaces, ensuring food safety, and transporting medical supplies. At Carrier, we prioritize reliability and minimizing equipment downtime, especially as extreme temperatures become more common due to climate change. While we have used threshold-based systems to detect equipment issues, we wanted to go beyond that and predict faults before they occur. To achieve this, we collaborated with the Amazon Machine Learning Solutions Lab to develop a custom machine learning model. This model can analyze historical sensor data and accurately predict equipment failures. With this information, we can notify HVAC dealers in advance, allowing them to schedule inspections and prevent unit downtime. The solution we developed is scalable and can be applied to different modeling tasks. We used AWS Glue for data processing and Amazon SageMaker for feature engineering and building a scalable deep learning model. Our main goal was to reduce downtime by predicting equipment failures and notifying dealers. We faced challenges in terms of data scalability, model scalability, and model precision. We have over 50 TB of historical data and expect it to grow rapidly as more units are connected to the cloud. Our modeling approach needed to be capable of scaling across thousands of units and provide accurate predictions to avoid unnecessary maintenance inspections. We partnered with the Amazon ML Solutions Lab for a 14-week development effort, resulting in two primary components: a data processing module built with AWS Glue and a model training interface managed through SageMaker. The data processing module compresses and summarizes the sensor data to reduce its size and complexity. AWS Glue allows for parallel data preprocessing and feature extraction. SageMaker is used for training, tuning, and evaluating the model. Each HVAC unit generates data from 90 sensors, resulting in millions of data points per unit per day. We compress this data into cycle features to reduce its size while capturing important information about the equipment’s behavior. We used AWS Glue to process the data and summarize unit behavior, reducing the dataset size from millions to around 1,200 data points per unit per day. Amazon SageMaker Processing was then used to calculate features and label the data. We formulated the problem as a binary classification task, aiming to predict equipment faults within the next 60 days. We focused on summertime when HVAC systems are under extreme conditions. Our model uses a transformer architecture, which handles temporal data effectively. The model processes the features of the previous 128 equipment cycles and uses a multi-layered perceptron classifier. We incorporated weighted negative log-likelihood loss to prioritize precision and avoid false alarms. Training the model involved addressing data imbalance and randomly sampling cycles to ensure equal representation of units. The training was performed using a GPU-accelerated instance on SageMaker. The model achieved the best results after 180 training epochs, with an area under the ROC curve of 81%. While the model is trained at the cycle level, evaluation is done at the unit level. We count multiple true positive detections from the same unit as a single true positive to avoid overcounting.