Diagnostic performance of a large language model-based artificial intelligence in in-hospital calls to icu : a prospective monocentric observational pilot study
Position du problème et objectif(s) de l’étude
Diagnosis in the intensive care unit (ICU) is often complex due to uncertainty and cognitive baises (1). A systematic review indicates that 18% of diagnostic errors in ICU resulted in harm, with these errors contributing to 7% of total deaths (2). These errors can stem from multiple sources, such as physician-related factors, patient-specific traits, or the conditions under which the diagnosis is made. Chat GPT, a language model developped with deep learning, has shown impressive performance in medicine. The diagnosis in ICU present an opportunity to evaluate ChatGPT4's accuracy and reliability. This study aims to compare the diagnostic accuracy of Chat GPT with the diagnostic accuracy of the ICU « on-call » team.
Matériel et méthodes
A prospective, observational, single-center study conducted in the surgical intensive care unit of a University Hospital Center. This research work was submitted to the ethics committee : "Comité d’Ethique pour la Recherche en Anesthésie-Réanimation" and was approved on November 30, 2024, under the Institutional Review Board reference number IRB 00010254 ‐ 2024 – 101. Included calls involved the ICU team being consulted for the management of a patient with unexplained organ failure. The patient had to be an adult aged 18 or older and accessible for clinical examination.Based on available clinical, imaging, and biological data at the time of the calls, anonymized clinical vignettes were retrospectively submitted to ChatGPT for diagnostic evaluation. Predictions from the on-call team and a senior intensivist from a level 1 trauma center were also collected. Each physician, along with ChatGPT, independently provided multiple diagnostic hypotheses (up to five) without knowing the predictions of others, the patient management, or the clinical outcome. The intensivist and ChatGPT had access to the same set of information. The primary outcome was the comparison of ChatGPT’s diagnosis with that of the on-call team. Performance in predicting secondary outcomes (diagnostic certainty, ICU admission, and therapeutic interventions initiated within the first 6 hours) was also assessed.
Résultats & Discussion
Here are the primary results of the pilot study. As of 01/31/2025, 20 patients had been included. The reasons for calls involved four types of organ failure: 35% hemodynamic, 30% respiratory, 22% neurological, and 13% infectious. For a total of 12 hospital surgical services and 11 hospital medical services, the calls are mainly related to two departments: neurosurgery and orthopaedics: 8 out of 20 calls. Most calls concerned male patients over 75 years old.Their primary history is cardiovascular risk factors: high blood pressure and atrial fibrillation and recent surgery less than a month ago. In 64% of cases, the on-call team made at least one diagnostic hypothesis (out of five) with a confidence level ranging from 4 to 10 out of 10.The number of patients who died was 7 out of 20, 2 of whom were admitted to intensive care units.
Conclusion
The results of this study are ongoing and will serve as a preliminary study to conduct a multi-centre study on the same subject.
Auteurs
Chloé PAMART (1) , Jean-Denis MOYER (2), Sean PANIZZI (1), Swann ARCHIMÈDE (1), Damiano CERASUOLO (3), Clément GAKUBA (4) - (1)Interne D'anesthésie Réanimation, Caen, France, (2)Praticien Hospitalier En Anesthésie Réanimation, Caen, France, (3)Praticien Hospitalier Universitaire Méthodologiste Biostatisticien, Caen, France, (4)Praticien Hospitalier, Caen, France