INTRODUCTION
Resistance training is a key element in many sports and for physical fitness, promoting muscle hypertrophy and strength development [1]. Designing resistance training programs is a nuanced process, requiring expertise e.g. in exercise physiology [2, 3], biomechanics [4, 5] and training science [6, 7, 8]. Athletes lacking this knowledge are prone to designing erroneous strength training plans, which can result in underperformance or even health issues. This underscores the need for guidance in creating individualized strength training programs. The emergence of artificial intelligence (AI), and more specifically of Large Language Models (LLMs), has the potential to assist inexperienced athletes by providing them with well-designed strength training plans. LLMs such as GPT-4 and Google Gemini have been trained on an extensive corpus of text and enable humanlike conversational interactions in various applications by providing responses to user input [9, 10]. LLMs hold promise but also limitations in providing assistance across various disciplines, including medicine [11, 12, 13], providing health promotion [14, 15], designing endurance training plans [16], or designing resistance training plans [17]. However, their limitations and imperfections are also evident.
While AI shows potential in medicine, such as in administrative tasks and decision aids, significant limitations exist in accuracy, coherence, and transparency, raising ethical concerns [10, 11]. For example, ChatGPT, used as a psychiatric provider for imaginary patients, delivered appropriate advice for simple cases but deteriorated in quality with complex scenarios, potentially leading to dangerous outcomes [12]. In nutrition, ChatGPT can offer general dietary advice but often fails to account for specific health conditions and may not adhere to evidence-based guidelines. Additionally, in sports science, ChatGPT correctly calculated only 1 out of 4 sample sizes, with inconsistent results upon repeated prompts [13]. These limitations pose risks, particularly in health-related fields, where inaccuracies can lead to harmful outcomes.
In a sports context, experts rated ChatGPT-generated running plans using 22 criteria as suboptimal, though running plans improved with more detailed input. Similarly, Washif et al. assessed GPT3.5 and GPT-4.0’s 12-week strength training programs for intermediate and advanced lifters [17]. Despite aiming for strength development, the AI-generated plans included “high volume” hypertrophy blocks that did not align with the primary goal [17]. While strength and hypertrophy training variables overlap, optimizing muscle hypertrophy may require distinct approaches when pure strength development is the focus [8, 18].
While such research has improved our understanding of LLMs capability of providing recommendations for training plans, it is currently unknown if recommendations of contemporary and publicly available LLMs are in-line with recent scientific evidence as rated by coaching experts. To address this research gap, our study primarily aimed to investigate and compare muscle hypertrophy-focused resistance training plans generated by Google Gemini and GPT-4, as assessed by coaching experts based on evidence-based criteria. Our secondary goal was to determine whether generated training plans are reproducible if the same prompts were used multiple times concomitantly.
MATERIALS AND METHODS
General Design
To evaluate the hypertrophy-related resistance training programs generated by GPT-4 and Google Gemini, we based our analytical approach on existing literature from the fields of exercise and medical science [16, 19, 20, 21] adapting it to the goal and settings of our research. Specifically, we i) defined criteria of relevance for hypertrophy-related training plans, ii) established input information for publicly available LLMs, iii) generated hypertrophy-related training plans using the defined input information, and iv) involved coaching experts in the field of hypertrophy to evaluate the generated training plans based on the previously defined criteria. We specifically aimed to compare the training quality in three areas: 1) between GPT-4 and Google Gemini, 2) with little versus detailed prompt input within each LLM, and 3) with the same prompt (both little and detailed input) repeated within the same LLM.
Definition of criteria of relevance for hypertrophy-related training plans
So far, there is no generally agreed-upon consensus on quality criteria for hypertrophy-related parameters. Thus, we defined criteria of relevance for our specific case after consulting with experts in resistance exercise aiming at hypertrophy and reviewing the related scientific literature. The derived aspects of relevance for the design of hypertrophy-related training plans are:
Screening for individuals at increased risk for adverse exercise-related events, such as those related to cardiovascular, pulmonary, metabolic, and other diseases [22].
Definition of a goal [18].
Definition of a reliable and valid testing procedure to assess initial performance status. This procedure should derive individual training variables, such as years of resistance training, body composition, previous training volume, training weights, and define training effects, including performance, physiological, subjective, biomechanical or cognitive measures [18, 22, 23].
Use of training principles to evaluate the principle of specificity (e.g., exercises selected to achieve a specific goal), the principle of progressive overload (e.g., increasing intensity, load, repetitions, or volume over time), the principle of variation (e.g., changing exercises, repetition ranges, training intensities over time), and the principle of recovery (e.g., ensuring adequate rest between training days or between training the same muscle group) [18, 24].
Definition of basic strength training aspects including, but not limited to exercise selection, exercise order, and exercise technique (e.g., regarding safety aspects), as well as training variables like frequency, intensity, and volume [6, 18, 25, 26, 27, 33].
In addition to general training-related aspects, advanced aspects may be considered when prescribing (evidence-based) training plans, such as:
Use of advanced exercise methods like the manipulation of movement speed, range of motion or kinematics. Furthermore, time under tension can be manipulated as well as the set endpoint (e.g., ratings of perceived exertion [RPE], reps in reserve [RIR], proximity to failure) [26, 27, 28].
Use of advanced unconventional training methods (e.g., drop sets, rest-pause training, or pre-exhaustion), or the equipment used (e.g., blood flow restriction bandages) [6, 18].
Application of advanced recovery strategies (e.g., heat therapy, cooling, sleep) [18].
Application of nutritional aspects (e.g., micro-/macronutrient intake, hydration) [18, 29].
Definition of information input into publicly available LLMs
For our study, we selected GPT-4 (accessed via Microsoft Copilot) and Google Gemini (1.0 Pro), which we used on February 15, 2024. These LLMs have rarely been investigated, but since they are available to the public for free, they are likely to be widely used in various everyday use cases.
LLMs, due to their chatbot nature, will encounter diverse inputs from individuals seeking hypertrophy-related training plans, leading to the development of two input scenarios based on factors like prior knowledge and personal experience. We have reported the prompts as entered in the LLMs, including little information (prompt 1) and detailed information (prompt 2) with an additional training plan to provide information about previous training habits (Table 1).
“Please provide me with a resistance training plan to increase muscle hypertrophy.”
“Please provide me with a resistance training plan to increase muscle hypertrophy over the next 16 weeks. I am a 25-year-old male and have been doing resistance training 4 times a week for the past 8 years. Previous resistance training sessions have lasted 90 minutes. I have access to free weights and machines, both of which I would like to use. I also have training equipment such as belts, straps, bandages that I can use, and I have a body composition scale for monitoring purposes. My body weight is 80 kg with 12% body fat. I am 180 cm tall. I would like to increase the frequency to 5–6 times a week. I want to increase my total muscle mass as much as possible, although I am at an advanced level. I want to emphasize my arms as they are proportionally smaller than the rest of my body. I like to train with 3 seconds long eccentric actions, while the concentric action is explosive. My one-rep maximums in the squat, bench press, and deadlift are 200 kg, 140 kg, and 230 kg respectively. Overall, I want to incorporate advanced training strategies such as drop sets because I enjoy them. I also want to focus on my nutrition and recovery for muscle hypertrophy.”
TABLE 1
Previous resistance training plan provided in prompt 2 (inserted February 15, 2024)
Two authors (TH and LM) independently inserted each prompt into each LLM on the same day to investigate the reliability of each LLM (2/15/2024). This resulted in a total of 8 weekly training plans generated by the LLMs. Among them were two plans created by Google Gemini using little information about a fictitious person provided by two different researchers (referred to as GGL1 and GGL2). Another two plans were generated by Google Gemini based on detailed information about the same fictitious person (referred to as GGD1 and GGD2). A similar approach was used with GPT-4 (accessed via Microsoft Copilot), producing two plans with little information (GPT-4L1 and GPT-4L2) and two more with detailed information (GPT-4D1 and GPT-4D2). The conversation with both LLMs is available in the Appendix (S-Table 1–8).
Coaching experts
The evaluation of the Google Gemini and GPT-4 derived training plans followed the procedure of previously published studies [16, 19, 20, 21]. Experienced coaches evaluated the provided resistance training plans, focusing on key aspects essential for effective training plan design, as outlined in Table 2. Each aspect was rated on a 1–5 Likert scale. To evaluate the training plans, each coach was required to have at least a bachelor’s degree in sport science and 3 years of coaching experience in strength and conditioning or the field of resistance training. The study was approved by the Ethics Committee of the Faculty of Exercise Science and Training at the University of Würzburg (EV2023/7-2609) and conducted in accordance with the Declaration of Helsinki. Coaches gave informed consent to participate in the study.
TABLE 2
Relevant aspects when designing a training plan and corresponding rating scale which was used to evaluate Google Gemini and GPT-4 generated training plans
Statistical analysis
As previously performed [16], we calculated descriptive statistics (i.e., median, range) for the Likert scores of all rated items for each question. We tested for normal distribution using the Shapiro-Wilk test. Since the majority of our variables were not normally distributed, we performed a Friedmann ANOVA with Bonferroni correction. The significance level was set at p < 0.05. Fleiss’ Kappa was calculated to assess inter-rater reliability [30]. SPSS (IBM, version 28.0.1.1) was used for all statistical analyses.
RESULTS
A total of 12 coaching experts (age span: 23–49 years; 4 with a PhD, 5 with a Master’s degree, and 3 with a Bachelor’s degree in Sport Science) with coaching experience of 11.3 ± 5.7 years in resistance training, participated in our study.
For Google Gemini, Fleiss’ Kappa was 0.188 for GGL1, 0.100 for GGL2, 0.139 for GGD1, and 0.121 for GGD2. For GPT-4, Fleiss’ Kappa was 0.046 for GPT-4L1, 0.216 for GPT-4L2, 0.140 for GPT-4D1, and 0.1785 for GPT-4D2. Likert scale charts of each training plan are illustrated in the supplementary material (S-Figure 1–8).
Reproducibility of LLM output following the same prompt input
Descriptive statistics and results of significance testing for reproducibility between the same prompts within the LLMs are presented in Table 3.
TABLE 3
Descriptive analysis (median and range) and results of the significance testing of AI repeatability comparing different training plans generated by Google Gemini and GPT-4. Likert-Scale Ratings were from 1 (“bad”) to 5 (“good”) with 0 indicating “not applicable”.
[i] GGL1 = Google Gemini, little information, first try; GGL2 = Google Gemini, little information, second try; GPT-4L1 = GPT-4, little information, first try; GPT-4L2 = GPT-4, little information, second try; GGD1 = Google Gemini, detailed information, first try; GGD2 = Gemini, detailed information, second try; GPT-4D1 = GPT-4, detailed information, first try; GPT-4D2 = GPT-4, detailed information, second try.
Differences between Google Gemini and GPT-4
Descriptive statistics and results of significance testing between Google Gemini and GPT-4 with different input information are presented in Table 4.
TABLE 4
Descriptive analysis (median and range) and results of the significance testing of different AIs comparing different training plans generated by Google Gemini and GPT-4. Likert-Scale Ratings were from 1 (“bad”) to 5 (“good”) with 0 indicating “not applicable”.
[i] GGL1 = Google Gemini, little information, first try; GGL2 = Google Gemini, little information, secondtry; GPT-4L1 = GPT-4, little information, first try; GPT-4L2 = GPT-4, little information, second try; GGD1 = Google Gemini, detailed information, first try; GGD2 = Gemini, detailed information, secondtry; GPT-4D1 = GPT-4, detailed information, first try; GPT - 4D2 = GPT - 4, detailed information, second try.
Differences in prompt information density (little information versus detailed information)
Descriptive statistics and results of significance tests between Google Gemini and GPT-4 are presented in Table 5. All other statistical comparisons between LLMs and prompt information density that are not representative of the results presented here (e.g., GGL1 versus GPT-4D2) can be found in the Appendix (S-Table 9).
TABLE 5
Descriptive analysis (median and range) and results of the significance testing of different Prompts comparing different training plans generated by Google Gemini and GPT-4. Likert-Scale Ratings were from 1 (“bad”) to 5 (“good”) with 0 indicating “not applicable”.
[i] GGL1 = Google Gemini, little information, first try; GGL2 = Google Gemini, little information, second try; GPT - 4L1 = GPT - 4, little information, first try; GPT - 4L2 = GPT - 4, little information, second try; GGD1 = Google Gemini, detailed information, first try; GGD2 = Gemini, detailed information, secondtry; GPT - 4D1 = GPT - 4, detailed information, first try; GPT - 4D2 = GPT - 4, detailed information, second try.
DISCUSSION
Our study aimed to investigate the quality of resistance training plans focusing on muscle hypertrophy generated by Google Gemini and GPT-4 accessed via Microsoft Copilot and whether the provided training plans can be generated repeatedly when providing similar prompts multiple times. We report here that when hypertrophy-focused training plans are repeatedly generated by the same LLM (i.e., Google Gemini or GPT-4) using the same prompts, the resulting plans consistently maintain a comparable level of quality as assessed by coaching experts. Moreover, the quality of muscle hypertrophy related training plans generated by GPT-4 was rated higher compared to Google Gemini, irrespective of level of provided input information. Noticeably, the quality of muscle hypertrophy-related training plans increase with more detailed information input.
Reproducibility of LLM´s
When provided with identical prompts, Google Gemini and GPT-4 generated muscle hypertrophy-related training plans that were rated similarly on the 5-point Likert Scale across 27 of 28 items. The only exception was the “set endpoint” item. In this case, GPT-4D1 had a median rating of 5, whereas GPT-4D2 had a median rating of 0. The set endpoint was identified within the previous resistance training program in GPT-4D1 but not in GPT-4D2. Therefore, it is recommended that users request any missing information by submitting a follow-up prompt (i.e., check-backs) if the initial prompt is insufficient [16].
Despite being rated similarly in quality by the coaching experts, muscle hypertrophy-related training plans differed in their exercise prescriptions and variables (see S-Table 1–8). Athletes and coaches must verify that the recommended exercises are feasible for the individual athlete and can be performed with the available equipment. To the best of our knowledge, this research is the first to assess the reproducibility of the output quality of publicly available LLMs, such as Google Gemini and GPT-4. Consequently, we cannot compare our results to existing literature. However, a recent study investigated ChatGPT’s use as a sample-size calculator for study design development and found that when the same prompt was reused, ChatGPT produced a completely different output [13]. This is partially consistent with our study, although the quality as rated by the coaching experts was similar. As LLMs are rapidly evolving, we encourage further research in this area to investigate the reproducibility of recommendations provided by LLMs.
Differences in quality of LLMs generated muscle hypertrophyrelated training plans
Our results show that the quality of hypertrophy-related training plans generated by GPT-4 was rated higher compared to Google Gemini (regardless of the level of input information provided) and that for both LLMs, the quality of generated hypertrophy-related training plans increased with more detailed input information.
Our second prompt with little information (“Please provide me with a resistance training plan to increase muscle hypertrophy”; S-Table 2) inserted into Google Gemini did not result in a resistance training plan. Instead, Gemini responded with general principles of resistance training that require further input to generate the appropriate training plan. Although this is not always the case, as shown by our first attempt in Gemini (S-Table 1), it seems necessary to provide sufficient information to the LLM. Furthermore, providing little information often resulted in training prescriptions that were missing (i.e., training intensity) or irrational. For example, using the same prompt as in Google Gemini in GPT-4 resulted in a recommendation to train each muscle group at least twice a week. However, in the training plan itself, muscle groups were trained once a week, indicating inconsistency within the LLM.
Our findings are in line with available literature which states that training plans improve with more input information but are not rated optimally [12, 14, 15, 16, 17, 31].
Washif et al. assessed GPT-3.5 and GPT-4’s ability to create resistance training plans for intermediate and advanced athletes and found some programming variables and recommendations were insufficient (e.g., exercise selection, exercise tempo, contemporary practices) [17]. The selection of exercises was evaluated as moderately sufficient for promoting strength development and hypertrophy, and the authors identified discrepancies in the prescribed exercise tempo (e.g., 2 seconds eccentric phase/0 seconds pause/2 seconds concentric phase [2/0/2]), noting that it was inconsistently applied and did not fully align with contemporary research recommendations, which suggest a medium-paced eccentric phase and a rapid concentric phase (e.g., 3–4/0/1) [32]. Furthermore, time-efficient techniques for promoting strength gains or muscle hypertrophy (e.g., drop sets) were omitted, and an overemphasis on muscle damage as a mechanism for hypertrophy was noted as another limitation [17]. These limitations suggest that while GPT-3.5 and GPT-4 can generate training plans, they may not always align with specific goals. This aligns with our findings, where time under tension, which is important for hypertrophy [26], was often omitted. These shortcomings indicate the need for further refinement of these LLMs, such as GPT or Google Gemini, and emphasize the need for caution in their use.
Other studies also noticed the imperfection of LLMs [12, 14, 15, 31]. For instance, Zaleski et al. used a mixed-methods approach to evaluate the comprehensiveness and accuracy of ChatGPT’s exercise recommendations from open-text queries and found that AI-generated advice was 41.2% comprehensive and 90.7% accurate compared to gold-standard exercise guidelines [31]. Dergaa et al. evaluated GPT-4’s ability to generate exercise prescriptions for five hypothetical patient profiles and concluded that while AI-generated plans offer a safe starting point, they are inadequate for optimizing long-term fitness and health [14]. In addition, previous studies of ChatGPT’s ability to act as a psychiatric provider [12] or nutrition consultant [15] yielded similar results. ChatGPT can provide appropriate information in less complex scenarios. However, as complexity and vulnerability increase, tailored recommendations are inadequate and sometimes dangerous [12].
Therefore, exercise professionals should provide LLMs with detailed information and carefully review LLM-generated recommendations for muscle hypertrophy-related training regimens and not blindly implement them into practice due to the risk of lack of information output
Our research shows that GPT-4 accessed via Microsoft Copilot has higher ratings in exercise selection, exercise order, training intensity, repetition range, training volume, rest periods, and set endpoint compared to Google Gemini. Similarly, it has been reported that GPT-4 outperforms previous versions of GPT (i.e., GPT-3.5) in variables such as ‘sets and repetitions’ and ‘rest intervals’ [17]. Although the existing literature is limited, it might be argued that GPT-4 currently outperforms both its previous versions and Google Gemini in providing recommendations for strength training plans.
Strength, limitations and future research
We were able to compare two different LLMs, provide them with different input information densities, and assess for the first time the reproducibility of the quality of recommendations provided by LLMs.
Although some LLMs provide better quality resistance training plans than others, caution should be taken when implementing them blindly. It should be noted that while a high-quality hypertrophy-related training plan is important for athletes, other aspects such as explanation of the training program to the athlete, frequent training plan adjustments and the athlete-coach relationship are crucial in the training process. Additionally, athletes may lack evidence-based resistance training knowledge and may be unable to evaluate and adjust inappropriate training recommendations from LLMs. Consequently, a coach is essential in the training process of athletes and cannot currently be replaced by LLMs and their recommendations for muscle hypertrophy, although LLMs can provide a baseline for training recommendations [14, 16, 17]. Since the quality of LLM recommendations depends on the quality of the input, and given their widespread use and increasing availability, it seems prudent for athletes and coaches to be educated about the use, potential and limitations of these forms of AI in order to use them safely.
Our study is not without limitations. Firstly, our research is limited to the versions of GPT-4 and Google Gemini as of February 15, 2024. Because publicly available LLMs are evolving rapidly, new models are continuously being developed. Therefore, future LLMs may be capable of providing high-quality, reference-based, hypertrophy-related resistance training plans. However, in agreement with our work, previous studies have highlighted that LLMs (specifically ChatGPT) can be used as tool for creating initial, context-dependent frameworks in medicine [12], health promotion [14], and exercise science [16, 17]. These frameworks still require the expertise of human specialists to tailor them to individual scenarios, ensuring that users are not put at risk by relying solely on LLMs. As LLM versions continue to evolve, we also highlight the challenge of comparing specific outcomes, such as the quality of training prescriptions, across studies that utilize different versions of LLMs.
This is because updates and changes to the algorithms in newer versions of LLMs could significantly affect their performance and the quality of their outputs. Consequently, we emphasize that studies involving LLMs are relevant only to the specific versions being investigated at the time. We suggest the development of a regulatory framework in sport and exercise science to address the proper use and application of LLMs, as well as methods for comparing their outputs within the field. This framework would help sport practitioners understand how to effectively integrate LLMs into exercise science and apply them appropriately in practice.
It is important to note that the versions of Google Gemini and GPT-4 used in this study include references for generating hypertrophy-related resistance training plans. Caution should be taken regarding the quality and existence of these references, as LLMs can fabricate invented references [34].
We reported low Fleiss Kappa values (Fleiss Kappa = 0.046 to 0.216), indicating low interrater reliability This is consistent with previous work [16] and despite the fact that the coaching experts were well-educated and experienced in the field of exercise science. However, the influence of certain training parameters (e.g., resistance training volume [35, 36]) or novel resistance training trends (e.g., stretch-mediated hypertrophy [37]) on muscle hypertrophy has not been fully elucidated. For example, although research suggests that a resistance training volume of at least 10 sets per muscle group is efficient for maximizing muscle hypertrophy [36], a threshold at which a certain number of sets per week no longer induces “more” muscle hypertrophy is unclear [35]. Thus, coaching experts may have different perspectives on the importance of training aspects related to muscle hypertrophy.
We highlight the investigation of different, novel, and new versions of LLMs in future studies, with particular attention on the comparison of LLM-generated resistance training plans with traditionally designed training plans by certified coaches. Furthermore, it should be stressed that research on female-derived training plans compared to male-derived plans is very scarce and would open up new research opportunities in the field of artificial intelligence. Although previous research with ChatGPT has shown that prompts containing female versus male individuals lead to similar strength training recommendations [17], it is unclear whether this is consistent for other sports or training regimens in different LLMs.
CONCLUSIONS
Our findings indicate that AI technology (in this case GPT-4 and Google Gemini) can generate muscle hypertrophy-related training plans consistently with similar quality when identical prompts are used with both LLMs concomitantly. We found that the quality of these training plans improves with more detailed prompt information input. Notably, GPT-4 outperformed Google Gemini in quality, regardless of the input detail level. These findings underscore the importance of providing detailed information to LLMs for optimal outcomes. Moreover, LLMs did not always provide sufficient training prescriptions, highlighting the importance of human expertise and experience to manually customize LLM derived training plans. If LLMs are to be used safely in practice to take advantage of their potential benefits in training plan generation, sport professionals need to know what information to enter into LLMs and should carefully check provided training plans.