Abstract
ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation
- Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland
- Department of Biophysics, Faculty of Medical Sciences in Zabrze, Medical University of Silesia in Katowice, Poland
- Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland
Introduction
The incredibly rapid development of artificial intelligence (AI) in recent years has created new opportunities for its application in medical advancements. This raises questions about the reliability and limitations of AI.
Aim
The aim of the present study was to evaluate the effectiveness of the ChatGPT-3.5 language model in solving the test component of the National Specialist Examination (PES) in the field of thoracic surgery.
Material and methods
A total of 120 test questions from 2015 PES examination were analyzed. They were grouped according to subject matter, clinical character, and cognitive requirements. In independent sessions, each question was submitted five times. The following statistical tests were applied: c2, Kruskal-Wallis, Mann-Whitney and Spearman’s rank correlation. The consistency of the answers was assessed using Fleiss’ k coefficient.
Results
The AI tool achieved a score of 42.2% correct answers, with the passing threshold set at 60%. A statistically significant difference was found between clinical and non-clinical questions (p = 0.041). Correct answers were characterized by a higher confidence coefficient (p < 0.001). No correlation was observed between confidence and psychometric indicators. The response consistency was assessed as moderate (k = 0.341).
Conclusions
The result obtained by ChatGPT-3.5 is equivalent to a failing score on the examination. The confidence of responses correlated with their correctness, whereas limitations in clinical knowledge and consistency indicate the need for caution when using this model to assess specialized knowledge.
>Keywords
thoracic surgery, artificial intelligence, ChatGPT, specialist examination
Coverage in
Integrated with
