Extracting information from crash narratives using text analysis is a common practice in traffic safety research. With the recent advancements in large language models (LLM), it is important to understand how popular LLM interfaces perform in classifying or extracting information from crash narratives. To explore this, our study utilized three of the most popular publicly available LLM interfaces: ChatGPT, BARD, and GPT4. We investigated their usefulness and limitations in extracting information and answering queries related to accidents using 100 crash narratives from Iowa and Kansas.

During our investigation, we assessed the capabilities and limitations of these interfaces and compared their responses to the queries. We asked five questions related to the crash narratives:

1) Who is at fault?
2) What is the manner of collision?
3) Has the crash occurred in a work-zone?
4) Did the crash involve pedestrians?
5) What is the sequence of harmful events in the crash?

For questions 1 through 4, the overall similarity among the LLMs was found to be 70%, 35%, 96%, and 89%, respectively. The similarities were higher when answering direct questions that required binary responses, but significantly lower for complex questions.

To compare the responses to question 5, we analyzed network diagrams and centrality measures. The network diagrams generated by the three LLMs were not always similar, although they sometimes identified the same influencing events with high in-degree, out-degree, and betweenness centrality.

Based on our study, we suggest using multiple models to extract reliable information from crash narratives. Additionally, caution should be exercised when using these interfaces to obtain crucial safety-related information.