Radiology reports are detailed documents that provide an interpretation of radiological imaging results. Typically, radiologists review and interpret the images, and then summarize the key findings. The summary, known as the impression, is crucial for clinicians and patients to focus on the important information for clinical decision-making. However, creating an effective impression requires significant effort beyond simply restating the findings. The process is time-consuming, laborious, and prone to errors. It takes years of training for doctors to develop expertise in writing concise and informative radiology report summaries, underscoring the need for automation.

Automating the generation of report findings summaries is essential for radiology reporting. It helps translate reports into easily understandable language, relieving patients from the burden of reading lengthy and complex reports. To address this problem, we propose the use of generative AI, a type of AI that can generate new content and ideas, including conversations, stories, images, videos, and music. Generative AI relies on machine learning models, specifically large pre-trained models called foundation models (FMs). Recent advancements in machine learning, particularly the transformer-based neural network architecture, have led to the development of models with billions of parameters.

Our proposed solution utilizes fine-tuning of pre-trained large language models (LLMs) to generate summarizations based on findings in radiology reports. In this post, we demonstrate the strategy of fine-tuning publicly available LLMs for the task of radiology report summarization using AWS services. LLMs have shown impressive capabilities in natural language understanding and generation, serving as adaptable models for various domains and tasks. Using a pre-trained model offers significant benefits, such as reducing computation costs, carbon footprints, and the need to train a model from scratch.

Our solution employs the FLAN-T5 XL FM, using Amazon SageMaker JumpStart, an ML hub that provides algorithms, models, and ML solutions. We illustrate the implementation of this solution using a notebook in Amazon SageMaker Studio. Fine-tuning a pre-trained model involves further training on specific data to enhance performance on a different but related task. In this case, we fine-tune the FLAN-T5 XL model, an enhanced version of T5 (Text-to-Text Transfer Transformer) general-purpose LLMs. T5 reformulates natural language processing (NLP) tasks into a unified text-to-text format, unlike BERT-style models that can only output class labels or input spans.

To fine-tune the model, we use a dataset of 91,544 free-text radiology reports obtained from the MIMIC-CXR dataset. The solution focuses on fine-tuning the model to generate impression sections based on the findings section in radiology reports. The findings section provides detailed examination diagnoses and results, while the impression section summarizes the most significant findings and interpretations, including assessments of significance and potential diagnoses based on observed abnormalities.

After fine-tuning the model using SageMaker JumpStart, we evaluate the results using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which is commonly used for summarization evaluation. ROUGE1 measures the overlap of unigrams (individual words) between the model’s output and reference summaries. ROUGE2 measures the overlap of bigrams (two-word phrases). ROUGEL is a sentence-level metric that calculates the longest common subsequence (LCS) between two text pieces, ignoring newlines. ROUGELsum is a summary-level metric that considers newlines as sentence boundaries.

The overall solution architecture involves a model development environment in SageMaker Studio, model deployment with a SageMaker endpoint, and a reporting dashboard using Amazon QuickSight. We provide a walkthrough of the steps to set up the development environment, an overview of the radiology report datasets used for fine-tuning and evaluation, a demonstration of fine-tuning the FLAN-T5 XL model with SageMaker JumpStart, and the comparison of results between pre-trained models and fine-tuned models.

To get started, you need an AWS account with access to SageMaker Studio. You also need to create an S3 bucket to host the training and evaluation datasets. The training instance type used in this solution is ml.p3.16xlarge, which requires a service quota limit increase. Access to the MIMIC CXR dataset requires user registration and completion of a credentialing process.

Once the development environment is set up, you can proceed with cleaning the report data, fine-tuning the model, inferencing, evaluating the models, and comparing the results. The solution is available on the Generating Radiology Report Impression using generative AI with Large Language Model on AWS GitHub repository.