Polish Journal of Thoracic and Cardiovascular Surgery
eISSN: 1897-4252
ISSN: 1731-5530
Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery
Current issue Archive Manuscripts accepted About the journal Supplements Editorial board Reviewers Abstracting and indexing Contact Instructions for authors Publication charge Ethical standards and procedures
Editorial System
Submit your Manuscript
SCImago Journal & Country Rank
3/2025
vol. 22
 
Share:
Share:
abstract:
Original paper

ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation

Adam Mitręga
1
,
Dominika Kaczyńska
1
,
Mikołaj Magiera
1
,
Natalia Denisiewicz
1
,
Michał Bielówka
2
,
Anna Kożuch
1
,
Miłosz Korbaś
1
,
Aleksandra Gaweł
1
,
Jakub Kufel
3

  1. Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland
  2. Department of Biophysics, Faculty of Medical Sciences in Zabrze, Medical University of Silesia in Katowice, Poland
  3. Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland
Kardiochirurgia i Torakochirurgia Polska 2025; 22 (3): 169-173
Online publish date: 2025/10/29
View full text Get citation
 
PlumX metrics:
Introduction
The incredibly rapid development of artificial intelligence (AI) in recent years has created new opportunities for its application in medical advancements. This raises questions about the reliability and limitations of AI.

Aim
The aim of the present study was to evaluate the effectiveness of the ChatGPT-3.5 language model in solving the test component of the National Specialist Examination (PES) in the field of thoracic surgery.

Material and methods
A total of 120 test questions from 2015 PES examination were analyzed. They were grouped according to subject matter, clinical character, and cognitive requirements. In independent sessions, each question was submitted five times. The following statistical tests were applied: c2, Kruskal-Wallis, Mann-Whitney and Spearman’s rank correlation. The consistency of the answers was assessed using Fleiss’ k coefficient.

Results
The AI tool achieved a score of 42.2% correct answers, with the passing threshold set at 60%. A statistically significant difference was found between clinical and non-clinical questions (p = 0.041). Correct answers were characterized by a higher confidence coefficient (p < 0.001). No correlation was observed between confidence and psychometric indicators. The response consistency was assessed as moderate (k = 0.341).

Conclusions
The result obtained by ChatGPT-3.5 is equivalent to a failing score on the examination. The confidence of responses correlated with their correctness, whereas limitations in clinical knowledge and consistency indicate the need for caution when using this model to assess specialized knowledge.

keywords:

thoracic surgery, artificial intelligence, ChatGPT, specialist examination

Quick links
© 2025 Termedia Sp. z o.o.
Developed by Bentus.