ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation

Adam Mitręga; Dominika Kaczyńska; Mikołaj Magiera; Natalia Denisiewicz; Michał Bielówka; Anna Kożuch; Miłosz Korbaś; Aleksandra Gaweł; Jakub Kufel

doi:10.5114/kitp.2025.154986

eISSN: 1897-4252
ISSN: 1731-5530

Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery

Current issue Archive Manuscripts accepted About the journal Supplements Editorial board Reviewers Abstracting and indexing Contact Instructions for authors Publication charge Ethical standards and procedures

Editorial System

Submit your Manuscript

3/2025
vol. 22

Send email

Copy url:

abstract:

Original paper

ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation

Adam Mitręga

¹

,

Dominika Kaczyńska

¹

,

Mikołaj Magiera

¹

,

Natalia Denisiewicz

¹

,

Michał Bielówka

²

,

Anna Kożuch

¹

,

Miłosz Korbaś

¹

,

Aleksandra Gaweł

¹

,

Jakub Kufel

³

Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland
Department of Biophysics, Faculty of Medical Sciences in Zabrze, Medical University of Silesia in Katowice, Poland
Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland

Kardiochirurgia i Torakochirurgia Polska 2025; 22 (3): 169-173

DOI: https://doi.org/10.5114/kitp.2025.154986

Online publish date: 2025/10/29

View full text Get citation

EndNote

Papers, Reference Manager, RefWorks, Zotero

AMA

Mitręga A, Kaczyńska D, Magiera M, et al. ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation. Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery. 2025;22(3):169-173. doi:10.5114/kitp.2025.154986.

APA

Mitręga, A., Kaczyńska, D., Magiera, M., Denisiewicz, N., Bielówka, M.,  & Kożuch, A. et al. (2025). ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation. Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery, 22(3), 169-173. https://doi.org/10.5114/kitp.2025.154986

Chicago

Mitręga, Adam, Dominika Kaczyńska, Mikołaj Magiera, Natalia Denisiewicz, Michał Bielówka, Anna Kożuch,  and Miłosz Korbaś et al. 2025. "ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation". Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery 22 (3): 169-173. doi:10.5114/kitp.2025.154986.

Harvard

Mitręga, A., Kaczyńska, D., Magiera, M., Denisiewicz, N., Bielówka, M., Kożuch, A., Korbaś, M., Gaweł, A.,  and Kufel, J. (2025). ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation. Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery, 22(3), pp.169-173. https://doi.org/10.5114/kitp.2025.154986

MLA

Mitręga, Adam et al. "ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation." Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery, vol. 22, no. 3, 2025, pp. 169-173. doi:10.5114/kitp.2025.154986.

Vancouver

Mitręga A, Kaczyńska D, Magiera M, Denisiewicz N, Bielówka M, Kożuch A et al. ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation. Kardiochirurgia i Torakochirurgia Polska/Polish Journal of Thoracic and Cardiovascular Surgery. 2025;22(3):169-173. doi:10.5114/kitp.2025.154986.

PlumX metrics:

Introduction
The incredibly rapid development of artificial intelligence (AI) in recent years has created new opportunities for its application in medical advancements. This raises questions about the reliability and limitations of AI.

Aim
The aim of the present study was to evaluate the effectiveness of the ChatGPT-3.5 language model in solving the test component of the National Specialist Examination (PES) in the field of thoracic surgery.

Material and methods
A total of 120 test questions from 2015 PES examination were analyzed. They were grouped according to subject matter, clinical character, and cognitive requirements. In independent sessions, each question was submitted five times. The following statistical tests were applied: c2, Kruskal-Wallis, Mann-Whitney and Spearman’s rank correlation. The consistency of the answers was assessed using Fleiss’ k coefficient.

Results
The AI tool achieved a score of 42.2% correct answers, with the passing threshold set at 60%. A statistically significant difference was found between clinical and non-clinical questions (p = 0.041). Correct answers were characterized by a higher confidence coefficient (p < 0.001). No correlation was observed between confidence and psychometric indicators. The response consistency was assessed as moderate (k = 0.341).

Conclusions
The result obtained by ChatGPT-3.5 is equivalent to a failing score on the examination. The confidence of responses correlated with their correctness, whereas limitations in clinical knowledge and consistency indicate the need for caution when using this model to assess specialized knowledge.

keywords:

ChatGPT-3.5 and the Polish thoracic surgery specialty examination: a performance evaluation

Adam Mitręga 1 , Dominika Kaczyńska 1 , Mikołaj Magiera 1 , Natalia Denisiewicz 1 , Michał Bielówka 2 , Anna Kożuch 1 , Miłosz Korbaś 1 , Aleksandra Gaweł 1 , Jakub Kufel 3

thoracic surgery, artificial intelligence, ChatGPT, specialist examination

Adam Mitręga

¹

,

Dominika Kaczyńska

¹

,

Mikołaj Magiera

¹

,

Natalia Denisiewicz

¹

,

Michał Bielówka

²

,

Anna Kożuch

¹

,

Miłosz Korbaś

¹

,

Aleksandra Gaweł

¹

,

Jakub Kufel

³