Alergologia Polska - Polish Journal of Allergology
eISSN: 2391-6052
ISSN: 2353-3854
Alergologia Polska - Polish Journal of Allergology
Bieżący numer Archiwum Artykuły zaakceptowane O czasopiśmie Suplementy Zeszyty specjalne Rada naukowa Bazy indeksacyjne Prenumerata Kontakt Zasady publikacji prac Opłaty publikacyjne Standardy etyczne i procedury
Panel Redakcyjny
Zgłaszanie i recenzowanie prac online
4/2025
vol. 12
 
Poleć ten artykuł:
Udostępnij:
streszczenie artykułu:
Artykuł oryginalny

Performance analysis of large language models on a specialized medical examination: allergy and clinical immunology

Betül Dumanoğlu
1
,
Pamir Çerçi
2
,
Özge Can Bostan
3
,
Ümit M. Şahiner
4

  1. Department of Allergy and Immunology, Agri Training and Research Hospital, Agri, Turkey
  2. Department of Allergy and Immunology, Eskisehir City Hospital, Eskisehir, Turkey
  3. Department of Allergy and Immunology, Canakkale Mehmet Akif Ersoy State Hospital, Canakkale, Turkey
  4. Department of Pediatric Allergy and Asthma, Faculty of Medicine, Hacettepe University, Ankara, Turkey
Alergologia Polska – Polish Journal of Allergology 2025; 12, 4: 271–277
Data publikacji online: 2025/11/13
Pełna treść artykułu Pobierz cytowanie
 
Metryki PlumX:


Introduction
Advances in artificial intelligence (AI) have enhanced the potential of large language models (LLMs) in medical education and assessment. While there are studies in the literature evaluating their impact on medical exams, this is the first study to evaluate the performance of LLMs in the field of allergy and immunology.

Aim
Our study aims to evaluate the performance of five different large language models and compare with human participants in an allergy and clinical immunology exam.

Material and methods
In this comparative and cross-sectional study, the performances of five LLMs (ChatGPT-4o, Gemini 1.5 pro, Claude 3.5 Sonnet, Llama 3.1 405b and ChatGPT o1-preview) and 58 expert physician candidates were evaluated in the Turkish National Allergy and Clinical Immunology Examination. Each participant responded to 100 multiple-choice questions. The questions were classified based on their medical topics (e.g., Allergic Diseases, Immunology, Therapeutic and Diagnostic Approaches) and cognitive levels according to Bloom’s taxonomy.

Results
GPT o1-preview demonstrated the highest performance with an accuracy rate of 90%, significantly outperforming other LLMs and human participants (p < 0.01). The accuracy rates of the other LLMs were 81% for Claude 3.5 Sonnet, 76% for ChatGPT-4o, 70% for Llama 3.1 405b, and 68% for Gemini 1.5 pro. The average accuracy rate of human participants was 56%. In the Bloom’s taxonomy analysis, all LLMs except GPT o1-preview showed the lowest performance at the “Application” level. Topic-wise, “Allergic Diseases” emerged as the category with the lowest success rate among all LLMs.

Conclusions
AI has outperformed human experts, indicating the significant potential of AI in medicine. However, considering the limitations of LLMs on ‘Application’ and the importance of human expertise, AI should be used as a supportive tool in the medical field, with further research required to clarify both its capabilities and limitations.



© 2025 Termedia Sp. z o.o.
Developed by Bentus.