Artificial intelligence–assisted statistical analysis and statistical review: evidence (2023–2025) and implications for internal medicine

Ordak, Michal

Review articles

Artificial intelligence–assisted statistical analysis and statistical review: evidence (2023–2025) and implications for internal medicine

Michal Ordak
Department of Pharmacotherapy and Pharmaceutical Care, Faculty of Pharmacy, Medical University of Warsaw, Warszawa, Poland

DOI: 10.20452/pamw.17243

Published online: February 27, 2026.

Key words: artificial intelligence, internal medicine, statistical analysis
CC BY 4.0

In this article

Introduction Methods

Artificial intelligence–assisted statistical analysis and review: empirical evidence (2023–2025)

Implications of artificial intelligence for statistical analysis and statistical review in internal medicine Article information

Abstract

Clinical research published in internal medicine journals relies heavily on statistical analysis and quantitative inference, making the quality of statistical reporting and statistical peer review central to the credibility of this literature. Despite long‑standing methodological recommendations, the quality of statistical analyses and reporting in medical journals remains suboptimal, and the proportion of manuscripts undergoing formal statistical review has not improved over recent decades. At the same time, generative artificial intelligence (AI) tools have been increasingly adopted in biomedical research, raising expectations that they may support statistical analysis and elements of the peer review process. This narrative review synthesizes evidence published between 2023 and 2025 on the use of AI‑assisted tools in statistical analysis and statistical review within medical research. The reviewed studies show that large language models can support selected tasks, including generation of analytical code, reproduction of simple statistical procedures, preliminary selection of statistical tests, and detection of certain formal statistical errors. However, AI performance is highly variable and frequently limited by incomplete consideration of statistical assumptions and reduced reliability in complex analytical scenarios. Current generative AI tools should not be regarded as fully autonomous instruments for statistical analysis or statistical peer review. Their effective use depends on statistical expertise, independent validation, and contextual judgment by human users. The review discusses implications for statistical practice and statistical review in internal medicine, a research setting characterized by heterogeneous observational data, multimorbidity, and frequent use of nonrandomized study designs, including pragmatic clinical trials.

Introduction

Clinical studies published in journals of internal medicine rely largely on statistical analyses and quantitative inference, which makes the quality of statistical reporting and statistical peer review a key determinant of the credibility of this literature. Unfortunately, the quality of statistical reporting in medical journals remains low. Despite publication of methodological recommendations, the proportion of journals using specialist statistical peer review has not changed over the past 2 decades.¹ Difficulties in recruiting qualified reviewers, impractical approaches to teaching biostatistics, and failure to implement existing recommendations contribute to the fact that only a small proportion of accepted manuscripts contain statistical analyses that are assessed as correctly performed. Analyses of reporting practices have shown that even after the publication of editorial recommendations, no improvement in the quality of statistical reporting has been observed.² For example, evaluations of articles published in radiology journals indexed in the Science Citation Index showed that almost all analyzed papers contained at least 1 error, most often related to the summarization of data and reporting of P values. Importantly, the frequency of these errors was not associated with the journal impact factor, indicating that the problem also affects highly ranked specialist journals.³ A survey study among members of the World Association of Medical Editors showed that, according to 40% of the respondents, only 31%–50% of the manuscripts accepted for publication are statistically correct, and most respondents estimated the frequency of statistical peer review at 1%–10% of the submitted manuscripts.⁴ Taken together, these observations indicate that despite the availability of recommendations and increasing awareness of the problem, the quality of statistical analyses and reporting in the medical literature remains a substantial systemic challenge.

In parallel with the persistent problems related to the quality of statistical analyses in the medical literature, tools based on generative artificial intelligence (AI) are beginning to play an increasingly important role in clinical research and diagnostics. A systematic review and meta‑analysis published in 2025 in NPJ Digital Medicine showed that generative AI models achieve moderate diagnostic accuracy and, overall, do not differ significantly in effectiveness from physicians, particularly those without extensive clinical experience. At the same time, the significantly poorer performance of AI, as compared with clinical experts, indicates that the use of these tools in medicine and scientific research should be considered a support rather than a replacement for expert judgment, with due consideration of their limitations and potential consequences for research integrity.⁵ Similar conclusions arise from a systematic review and meta‑analysis of studies evaluating the performance of ChatGPT in medical licensing examinations, which demonstrated substantial variability in results depending on the model version, the type of questions, and the linguistic and examination context. Although the most recent versions of the model achieved high mean accuracy and, in many cases, obtained results comparable to or higher than those of medical students, the authors emphasize that instability of performance and susceptibility to contextual factors limit its safe application in formal medical education.⁶ An example of such limitations is provided by studies in internal medicine in which ChatGPT did not meet the passing criteria of the Polish specialization examination in internal medicine.⁷

The growing interest in the use of AI‑based tools in biomedical research further complicates the problem of the quality of statistical analyses and the credibility of scientific conclusions. Automation of method selection, data modeling, and result interpretation may lead to both improvements in analytical efficiency and the reinforcement or masking of methodological errors, particularly in the absence of appropriate expert oversight. The literature increasingly highlights that the use of AI and machine learning intensifies existing problems related to overinterpretation of statistical significance, limited reproducibility of results, and insufficient reporting of analytical assumptions. Consequently, there is a growing need for a critical assessment of the role of AI not only as an analytical tool but also as a factor influencing research integrity, reporting practices, and the scientific peer review process.⁸ Initiatives using AI tools, including large language models (LLMs), to screen scientific manuscripts are being reported with increasing frequency, accompanied by warnings from researchers regarding the risks associated with such approaches.⁹ At the same time, attention is drawn to the fact that the rapidly expanding use of generative AI in the creation and editing of medical publications may in the short term substantially change the way scientific literature is produced, raising serious concerns about transparency, authorship accountability, and the quality and integrity of the resulting content.¹⁰

The most recent article published in December 2025 in Journal of Korean Medical Science presents the role of AI in the detection of statistical errors as a component supporting research integrity.¹¹ According to the abstract, the main emphasis was placed on tools such as Statcheck and the GRIM test, which demonstrate moderate effectiveness in identifying statistical errors and may accelerate the peer review process, although they require high‑quality data and continuous human oversight. The authors also point to potential benefits and limitations of using AI, including the risk of incorrect interpretation of results, algorithmic bias, ethical concerns, and the need to preserve the decisive role of editors and reviewers. At the same time, as indicated in the abstract, the article focuses primarily on specialized tools for detecting statistical errors and does not include a review of studies evaluating the use of generative language models and other AI tools in conducting statistical analyses and statistical peer review. The absence of such a synthetic overview constituted one of the premises for writing this review.

The aim of this narrative review was to comprehensively summarize the available evidence regarding the use of AI tools in statistical analysis and statistical peer review in medical research, to discuss the limitations of these approaches, and to assess the implications of their use for the integrity of medical research and the scientific peer review process, with particular emphasis on clinical studies published in journals of internal medicine. Although the available evidence is cross‑sectional in nature and not limited to a single clinical specialty, it relates to universal analytical and peer review practices that are also present in studies published in journals of internal medicine. Therefore, the conclusions of this review are of direct relevance to authors, reviewers, and editors of internal medicine journals, in which the reliability of statistical analyses constitutes one of the key conditions for the credibility of clinical conclusions.

Methods

This narrative review encompasses literature published between 2023 and 2025. The temporal scope of the review was restricted to the years 2023–2025, as this period corresponds to the widespread public availability and rapid development of LLMs, enabling the use of natural‑language interfaces for statistical analysis, result interpretation, and methodological review. Earlier applications of AI in data analysis were predominantly algorithmic or computation‑automation tools and did not encompass generative capabilities or language‑based evaluation of statistical analyses. A comprehensive literature search was conducted in PubMed, Web of Science, and Scopus databases. The search strategy employed combinations of key words related to AI and statistical practice, including but not limited to: “artificial intelligence,” “generative AI,” “large language models,” “statistical analysis,” “statistical review,” “statistical reporting,” “statistical errors,” “methodological errors,” and “research integrity,” as well as the names of specific AI tools evaluated in the literature. Eligible studies were English‑language publications addressing the use of AI tools in statistical analysis, statistical review, data extraction, methodological assessment, or detection of statistical errors within medical research. The review included original research articles, validation studies, methodological studies, systematic reviews and systematic review–related studies, editorials, perspectives, and tutorials, reflecting the heterogeneous and rapidly evolving nature of this research area. No restrictions were imposed on study design, provided that the publication directly evaluated or discussed AI‑assisted statistical analysis or statistical review in a medical research context. Study selection was based on relevance to predefined thematic areas, including AI‑assisted data analysis, statistical test selection, statistical reporting and guideline adherence, statistical peer review, detection of methodological or statistical errors, and implications for research integrity.

For clarity of synthesis and to reflect the rapid evolution of AI tools, the included studies were grouped according to the year of publication (2023, 2024, and 2025). This temporal stratification was adopted to capture changes in the scope, maturity, and methodological focus of AI‑assisted statistical analysis and statistical review over time. Within each yearly stratum, evidence was synthesized narratively, emphasizing study objectives, types of AI tools evaluated, analytical tasks addressed, and key findings, rather than presenting isolated study summaries. This approach allowed for identification of emerging trends, shifts in application domains, and recurring limitations across successive stages of AI development.

Artificial intelligence–assisted statistical analysis and review: empirical evidence (2023–2025)

An overview of the evolution of AI‑assisted statistical analysis and statistical review in medical research between 2023 and 2025 is summarized in Table 1.

Table 1. Evolution of artificial intelligence–assisted statistical analysis and statistical review in medical research (2023–2025)

Year	Dominant application domain	AI tools evaluated	Typical analytical tasks	Main findings	Key limitations identified
Abbreviations: AI, artificial intelligence, QC, quality control
2023	Exploratory use and feasibility	ChatGPT, ChatGPT‑4, Bard, Llama	Test selection, concept explanation, basic analyses, educational tasks	High variability of responses, acceptable performance in simple tasks only	Ignored assumptions, unstable answers, high error risk without supervision
2024	Validation and comparison	ChatGPT‑4, Bard, Bing, Perplexity	Comparison with SPSS/R/SAS, test recommendation, epidemiologic analyses	High agreement for simple procedures, usability advantages	Discrepancies in complex analyses, unreliable method selection
2025	Review, QC and research integrity	ChatGPT‑4o, Gemini, Elicit	Statistical review, guideline compliance, error detection, data extraction	Effective support for screening and standard checks	Contextual errors, limited assessment of methodological nuance

Early empirical evaluation of generative artificial intelligence for statistical analysis (2023)

Studies published in 2023 focused on the empirical evaluation of the correctness of responses generated by generative language models in the area of statistical analysis and the selection of analytical methods. In one of these studies, responses provided by ChatGPT to questions concerning general statistical concepts, the conduct of analyses, and the choice of statistical tests were assessed using examples from publications in the field of allergology. It was shown that a substantial proportion of the responses were incomplete and that the model frequently failed to take into account key assumptions required for the application of specific statistical tests. In addition, significant variability was observed in responses to identically formulated questions, indicating a risk of inappropriate method selection and incorrect interpretation of results when the model is used without adequate expert supervision.¹² At the same time, comparative studies were conducted to evaluate different AI chatbots with respect to their usefulness in statistics and teaching of quantitative methods. An analysis including ChatGPT, ChatGPT‑4, Bard, and Llama showed that newer models, particularly ChatGPT‑4, achieved better performance in solving tasks related to statistics and differential calculus. At the same time, the authors emphasized that even the highest‑rated models were not free from errors and required critical evaluation of the generated responses. These findings suggest that AI chatbots may serve as supportive tools in the educational process but do not constitute a standalone solution for statistical analysis.¹³ Editorial publications addressing the practical consequences of using generative AI in statistical analyses published in medical journals appeared in 2023. These papers highlighted the need to verify, in each case, the assumptions underlying statistical tests proposed by language models and the necessity of confirming AI recommendations in the methodological literature. It was also emphasized that the growing use of AI‑based tools should be associated with strengthening, rather than reduction, of the role of statistical peer review and editorial oversight.¹⁴ Subsequent studies assessed the effectiveness of ChatGPT as a tool for solving practical biostatistical problems used in medical education. The tests included tasks related, among others, to significance testing, analysis of variance, confidence interval estimation, and selection of appropriate analytical methods. It was demonstrated that the performance of the model was limited, especially in initial attempts, and that correct solutions often appeared only after iterative interactions and additional guidance. Although newer versions of the model achieved better results, the authors emphasized that the risk of generating incorrect answers remains a significant limitation of its practical use.¹⁵ In the same year, a case study was published illustrating the use of ChatGPT to support real‑world biostatistical data analysis using data from the National Health and Nutrition Examination Survey. The model was used for data preparation, proposing logistic regression models, and interpreting results related to epidemiologic trends. A key element of this approach was the independent verification of all AI‑generated recommendations using standard statistical methods. The results indicated that language models may facilitate the analytical process, but only under conditions of continuous control and validation by a user with methodological competence.¹⁶ A separate line of research in 2023 addressed the application of generative AI in the area of statistical process control. Analyses showed that ChatGPT performed well on structured tasks, such as explaining basic concepts or translating code, while it encountered difficulties with more complex problems requiring the development of new analytical solutions. The risk of generating misleading or incorrect results in tasks extending beyond standard patterns was also highlighted. These observations confirm the need for result validation and cautious use of generative AI in statistical analyses.¹⁷

In summary, the publications from 2023 discussed above indicated that early applications of generative AI in statistical analysis focused on evaluating the correctness and stability of responses and the selection of analytical methods. At the same time, it was consistently noted that the lack of expert oversight and independent verification leads to a substantial risk of methodological errors, limiting the possibility of treating AI as an autonomous analytical tool.

Comparative validation against classical statistical tools (2024)

Studies published in 2024 shifted the focus from a general evaluation of responses provided by generative language models to direct comparisons of their results with those obtained using classical statistical tools, and to the assessment of their usefulness in specific analytical tasks. In one study, the results of statistical analyses obtained using ChatGPT‑4 were compared with those produced by the SPSS package for tests commonly used in medical and dental research. Full agreement of results was demonstrated for simple analyses, such as t tests and linear regression, whereas in more complex procedures, including post hoc analyses, confidence interval estimation, and some nonparametric tests, substantial discrepancies between the tools were observed. These findings indicate that although ChatGPT‑4 can correctly reproduce simple statistical analyses, its use in more advanced procedures requires careful validation and comparison with established statistical software.¹⁸ In 2024, extended analyses were also published addressing the quality of statistical advice generated by ChatGPT in the context of real scientific publications. In one editorial analysis, statistical recommendations generated by the model on the basis of an accepted article in allergology were evaluated, with particular attention to the appropriateness of test selection and consideration of methodological assumptions. It was shown that ChatGPT consistently recommended analyses that were inadequate for the nature of the data, including methods sensitive to outliers and tests with low power for small sample sizes. Repeatedly asking the same question did not lead to a correct recommendation of the nonparametric method used in the original study, highlighting the limited reliability of the model in the selection of statistical tests without expert supervision.¹⁹ Further studies published in 2024 focused on comparing the capabilities of ChatGPT‑4 with classical biostatistical packages, such as SAS, SPSS, and R, in the analysis of epidemiological data. The analyses included descriptive statistics, between‑group comparisons, and correlation analyses, and the results were evaluated in terms of consistency, analytical efficiency, and user usefulness. High agreement of results was demonstrated for descriptive statistics, along with a clear advantage of ChatGPT‑4 in ease of use, while minor discrepancies were observed in more complex analyses. The authors indicated that this tool may lower the entry threshold for epidemiologic data analysis, although its application in more advanced analyses remains limited.²⁰ In 2024, in some studies, the ability of LLMs to recommend appropriate statistical tests based on descriptions of research scenarios was systematically evaluated. In a study comparing ChatGPT‑3.5, Google Bard, Microsoft Bing Chat, and Perplexity, high agreement between model recommendations and expert responses was demonstrated, particularly in the assessment of the acceptability of the proposed tests. At the same time, moderate agreement among the models themselves and substantial variability in response stability across repeated queries were observed. These results suggest that LLMs may serve as tools supporting methodological decision‑making but do not eliminate the need for verification by an experienced statistician.²¹ An important new research direction in 2024 was the evaluation of the use of LLMs in the context of statistical consultations and educational support. In one project, the usefulness, effectiveness, and satisfaction associated with the use of LLMs in statistical consultations were assessed from both the consultant and user perspective. The project included qualitative and quantitative components as well as an evaluation of a training module, indicating the potential of LLMs as tools supporting the consultation process, while emphasizing the importance of user competence and awareness of model limitations.²² At the same time, studies were conducted to evaluate the nature and types of errors made by language models in complex tasks requiring clinical and analytical reasoning. In a study comparing ChatGPT‑4o and Claude 3 with resident physicians, the models showed lower effectiveness in situations of diagnostic uncertainty. Analysis of error structure indicated that logical and informational errors predominated in model responses, whereas errors classified as statistical constituted a smaller proportion of all incorrect answers. These findings suggest that even when formal analytical procedures are applied, language models may have difficulty correctly integrating available information, which has important implications for the interpretation of results generated by AI.²³ In 2024, a study was also published evaluating the performance of ChatGPT‑4 in a formal professional examination related to audiological qualifications. High overall performance of the model was demonstrated in multiple choice tests, while indicating that the dominant source of errors consisted of incorrect or incomplete information rather than logical or statistical errors. The authors emphasized that improving the quality of information sources may increase model performance, although at present its use remains limited to supportive functions.²⁴

In summary, publications from 2024 indicated clear progress in AI‑based tools, as compared with 2023, particularly in terms of direct comparisons with classical statistical software and the evaluation of usefulness in specific analytical tasks. At the same time, these studies consistently demonstrated that result discrepancies, instability of recommendations, and limitations in more complex statistical analyses necessitate expert oversight and independent validation of results generated by AI.

Expansion to statistical review, quality control, and integrity‑focused applications (2025)

Studies published in 2025 clearly expanded the scope of AI applications from single statistical analyses to support quality control processes, statistical peer review, systematic reviews, and the assessment of research integrity. In one study, the ability of the Gemini Advanced 2.0 Flash model to verify compliance of oncology publications with the Statistical Analyses and Methods in the Published Literature guidelines was evaluated, demonstrating high agreement between AI assessments and those of a statistical editor, while difficulties were observed in identifying errors related to multiple comparison correction. These results indicate that AI can effectively support statistical peer review but does not eliminate the need for expert oversight in more complex aspects of reporting.²⁵ In 2025, the potential use of AI tools to automate data extraction in systematic reviews was also evaluated. Analyses comparing Elicit and ChatGPT with double data extraction performed by humans showed high agreement for standardized variables, while confabulations and errors were observed for review‑specific variables. The authors proposed a hybrid model in which AI replaces the second extractor and the human focuses on resolving discrepancies.²⁶ At the same time, review and perspective papers were published analyzing the role of AI in detecting methodological and statistical errors and unethical publication practices. It was indicated that AI tools can support the identification of statistical errors, image manipulation, and inappropriate citations, but their effectiveness remains limited, and susceptibility to circumvention and contextual errors prevents their independent use in manuscript evaluation. At the same time, the growing potential of AI as an element of scalable quality control in the publishing process was emphasized.²⁷ Methodological studies analyzing the risks associated with AI‑assisted data analysis showed that generative language models frequently make errors in descriptive and inferential analyses, particularly in more complex statistical procedures. Better results were obtained when AI was used to generate analytical syntax for classical statistical software, although the effectiveness of this approach was strongly dependent on user competence. The authors indicated that the more complex the analysis, the greater the risk of errors and the stronger the need for result verification.²⁸ In 2025, attention was drawn in Nature to the growing movement of using AI tools to detect errors in scientific publications, accompanied by warnings regarding their limitations and the risk of false alarms.⁹ At the same time, a record number of scientific retractions was reported, further highlighting the scale of problems related to research reliability and the importance of effective quality control mechanisms.²⁹ A study based on case analyses in immunology assessed whether AI‑assisted statistical peer review could help identify errors leading to article retraction. The ChatGPT‑4o and Gemini Advanced 2.0 models detected numerous methodological problems, such as absence of power analysis, unjustified parametric tests, and lack of correction for multiple comparisons, but missed some important shortcomings requiring contextual expert assessment. These results confirm that AI can support the process of error identification but does not replace statistical peer review conducted by a specialist.³⁰ Further studies examined the use of ChatGPT as a tool supporting quantitative analyses in the social sciences and biomedical research. It was shown that the model can facilitate replication of simpler analyses but has substantial difficulties in tasks requiring complex methodological reasoning, leading to repetitive responses and incorrect conclusions. The authors emphasized the need for caution when using AI in more advanced research procedures.³¹ Other studies published in 2025 compared the ability of different language models to assess the quality of systematic reviews and meta‑analyses. Moderate agreement between AI assessments and expert evaluations was demonstrated when using ROBIS and AMSTAR 2 tools, along with a tendency of some models to underestimate the risk of bias. These findings indicate the potential of AI for preliminary quality assessment of secondary research, with limited reliability in evaluating methodological nuances.³² In the area of statistical education and data analysis, it was shown that generative AI can facilitate the conduct of analyses by individuals with limited statistical training, mainly through code generation and procedural support. At the same time, it was emphasized that a lack of deep understanding of statistical methods leads to partial or incorrect application of analytical procedures. The authors pointed to the need to shift the emphasis in statistical education from software operation to understanding methodological concepts.³³ In 2025, some studies analyzed the use of ChatGPT as an analytical tool in medical research, indicating its usefulness in exploratory data analysis, data preparation, and selection of statistical tests when prompts are appropriately formulated. Analytical performance increased significantly with greater prompt precision, as demonstrated by the marked increase in inferential statistics accuracy from 32.5% with basic prompts to 81.3% and 92.5% with intermediate and advanced prompts, respectively. This substantial difference indicates that performance variability is strongly associated with prompt specificity and suggests that prompt engineering represents a critical methodological factor influencing AI‑based analytical outcomes. However, even in the most advanced scenarios, verification of results by a human was required. These findings confirm that AI can increase research efficiency but does not eliminate the need for quality control.^34,35 An important research direction in 2025 was also the evaluation of the potential use of AI to assess the trustworthiness of clinical research. It was shown that ChatGPT can effectively support the evaluation of checklists such as the Trustworthiness in Randomised Clinical Trials, and automate extraction of data Tables, significantly accelerating the assessment of randomized controlled trials. At the same time, the authors emphasized the need for repeated prompting and user corrections, which limits full automation of this process.³⁶ Further studies addressed the use of language models for reviewing statistical analysis plans and pharmacokinetics and pharmacodynamics components in clinical trial protocols. Good ability of AI to identify key elements of study designs and compliance with regulatory guidelines was demonstrated, along with limitations in assessing clinical context. These results suggest the usefulness of AI as a supportive rather than a decision‑making tool.³⁷ In radiology‑related research, the ability of LLMs to generate statistical solutions, data visualizations, and code for deep learning–based tasks was evaluated.³⁸ It was shown that the models can provide useful baseline code and support the user in iterative resolution of execution errors, but they require validation and correction by a person with methodological expertise, confirming their role as a starting point for analyses rather than autonomous tools.³⁹ In 2025, it was also demonstrated that contemporary language models achieve very high accuracy in the selection of statistical tests in standardized scenarios, while differences were observed in the quality of justifications and statistical reasoning. Although test selection accuracy was high, the quality of explanations and identification of statistical assumptions remained variable across models. These findings indicate their potential as consultative tools, particularly in simpler applications.⁴⁰ At the same time, tutorial‑type papers were published presenting practical applications of ChatGPT in the work of a biostatistician. Both successful applications and significant errors were highlighted, emphasizing the need for critical evaluation of AI‑generated results. The authors clearly indicated that language models do not replace statistical knowledge but can support routine tasks.⁴¹ Experimental studies assessed the competencies of students and researchers in using ChatGPT for epidemiologic analyses. No significant differences in result quality were observed between analyses conducted with integrated AI support and those using a classical approach, whereas analysis completion time was shorter in the group using AI tools. These findings suggest that AI can increase work efficiency while maintaining the need for expert oversight.⁴² Other studies published in 2025 addressed the use of AI in data extraction and the conduct of systematic reviews and meta‑analyses. Moderate effectiveness of AI was demonstrated in study selection and high agreement in meta‑analytical calculations, along with limitations at the screening and interpretation stages. These findings confirmed that AI can effectively support selected stages of systematic reviews but does not replace the work of experienced researchers.^43,44

In summary, publications from 2025 show a clear shift in AI applications from simple analytical support toward tools assisting statistical peer review, quality control, and assessment of research integrity. In highly structured, textbook‑like scenarios, such as statistical test selection based on clearly defined assumptions, performance has been reported as near‑perfect in some evaluations. At the same time, it was consistently demonstrated that the effectiveness of these tools is limited in tasks requiring contextual reasoning and advanced methodological assessment, which clearly confirms the need to maintain the decisive role of humans in statistical analysis and statistical peer review.

Implications of artificial intelligence for statistical analysis and statistical review in internal medicine

The literature reviewed above focuses on the use of AI in statistical analysis and statistical review within medical research. These methodological processes are common across clinical disciplines and become particularly consequential in areas of medicine, where clinical evidence is predominantly derived from heterogeneous observational studies and pragmatic clinical trials. Within this broader methodological context, internal medicine serves as a clinically relevant setting in which the implications of AI‑assisted statistical analysis and statistical review can be discussed, without implying that the reviewed evidence is specific to this field.

Key implications of AI‑assisted statistical analysis and statistical review for internal medicine are summarized in Table 2.

Table 2. Implications of artificial intelligence–assisted statistical analysis and statistical review for internal medicine

Area	Potential role of AI	Key limitations and risks	Required human oversight
Abbreviations: see Table 1
Statistical analysis	Support for code generation and preliminary method selection	Ignoring assumptions, inappropriate default methods	Verification of assumptions, final model selection
Complex clinical data	Facilitation of routine analyses in large datasets	Limited handling of multimorbidity and complex end points	Contextual clinical interpretation
Statistical peer review	Screening for reporting deficiencies and formal errors	Inability to assess clinical plausibility	Expert statistical and editorial judgment
Interpretation of results	Drafting narrative summaries of findings	Risk of overinterpretation of significance	Critical appraisal of uncertainty and limitations
Research integrity	Support for data extraction and consistency checks	Susceptibility to contextual and conceptual errors	Accountability of authors, reviewers, and editors

First, the reviewed evidence consistently indicates that generative AI tools should not be treated as fully autonomous instruments for statistical analysis in clinical research. Although recent agentic systems demonstrate increasing operational autonomy, their performance remains dependent on human‑defined objectives, prompt design, and expert oversight. Across studies published between 2023 and 2025, models such as ChatGPT demonstrated substantial variability in responses, incomplete consideration of statistical assumptions, and a tendency to recommend inappropriate analytical methods, particularly in more complex or nonstandard scenarios.^12,15,19,28 In light of these findings, the use of such tools in the context of research published in internal medicine journals raises specific concerns, as studies in this field commonly rely on heterogeneous observational datasets, multiple clinically‑driven end points, and analytical decisions that require careful consideration of confounding, model assumptions, and clinical plausibility. Under these conditions, reliance on AI‑generated recommendations without expert verification may compound, rather than mitigate, the risk of methodological error identified in the reviewed literature.

Second, the findings suggest that the most robust role of AI lies in augmenting, rather than replacing, statistical expertise. Studies comparing AI outputs with classical statistical software showed high concordance for simple analyses, while discrepancies emerged in advanced procedures requiring contextual judgment and methodological nuance.^18,20 Similarly, evidence from consultation‑based and educational settings indicates that AI can support routine tasks, such as code generation or preliminary method selection, but its effectiveness remains strongly dependent on the user’s statistical competence.^22,33 In the context of research published in internal medicine journals, where analyses frequently involve complex clinical variables, multiple interrelated outcomes, and nonrandomized study designs, these findings imply that AI‑assisted tools may facilitate preliminary or repetitive analytical tasks, while responsibility for methodological choices, model validation, and interpretation of results must remain with experienced statisticians and clinically informed reviewers.

Third, the expansion of AI applications into statistical review and quality control has direct implications for editorial and peer review practices in internal medicine journals. Empirical studies published in 2025 show that AI‑assisted tools are capable of identifying selected reporting deficiencies, inconsistencies with statistical reporting guidelines, and certain formal statistical errors with moderate‑to‑high agreement, as compared with human reviewers.^25,30,32 However, as highlighted across the reviewed evidence, these tools have limited capacity to evaluate context‑dependent methodological decisions that are common in internal medicine research, including analyses involving multimorbidity, complex exposure definitions, composite or competing outcomes, and clinically motivated subgroup analyses. Consequently, while AI may support initial screening or standardization of statistical review, the evaluation of methodological adequacy and clinical plausibility of statistical choices in internal medicine manuscripts must remain the responsibility of experienced human reviewers.

Fourth, the reviewed literature highlights potential risks associated with AI‑assisted statistical analyses that are relevant when such tools are applied to the interpretation of clinical research in internal medicine. These risks include masking of methodological flaws through fluent narrative explanations, reinforcement of inappropriate default analytical choices, and overconfidence in statistically significant findings without adequate consideration of assumptions, uncertainty, or study limitations.^14,27,31 When transferred to the context of internal medicine, where statistical results often inform broad diagnostic and therapeutic decision‑making across diverse patient populations, these risks may affect not only methodological rigor but also the clinical interpretability and credibility of published evidence.

Finally, the reviewed evidence supports a model in which AI contributes primarily to scalability and efficiency in statistical workflows, while responsibility for analytical validity remains with human experts. Across the analyzed studies, hybrid approaches in which AI assists with preliminary screening, data extraction, or generation of analytical syntax, and statisticians retain responsibility for validation and interpretation, appear most consistent with the strengths and limitations identified for current AI tools.^16,26,36 When such models are considered in relation to research published in internal medicine journals, this implies that any integration of AI into statistical analysis or statistical review should be explicitly framed as supportive rather than substitutive, and accompanied by clear editorial policies regarding disclosure, acceptable use, and the continued primacy of expert statistical judgment. In the editorial phase, AI‑assisted tools may play an increasingly prominent role in automated screening for reporting guideline compliance (eg, CONSORT or STROBE), verification of citation formatting, and detection of incomplete methodological reporting. Such applications are well‑suited to structured, rule‑based checks, and may enhance consistency and efficiency in manuscript processing. However, their integration should complement, rather than replace, expert editorial and statistical oversight.⁴⁵

Given the rapid evolution of LLMs, including the emergence of reasoning‑oriented (“thinking”) systems and agentic workflows, findings from earlier studies should be interpreted within the temporal context of the specific model version evaluated. Continuous reassessment will therefore be necessary as AI capabilities further develop.

ARTICLE INFORMATION

Acknowledgments: None.

Funding: None.

Conflict of interest: None declared.

AI statement: ChatGPT (OpenAI, San Francisco, California, United States) was used to improve the grammar, clarity, and style of the English language in this manuscript. All scientific content is solely the responsibility of the author.

References

Thiese MS, Arnold ZC, Walker SD. The misuse and abuse of statistics in biomedical research. Biochem Med (Zagreb). 2015; 25: 5‑11. | Crossref
Diong J, Butler AA, Gandevia SC, et al. Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice. PLoS One. 2018; 13: e0202121. | Crossref
Günel Karadeniz P, Uzabacı E, Atış Kuyuk S, et al. Statistical errors in articles published in radiology journals. Diagn Interv Radiol. 2019; 25: 102‑108. | Crossref
Ordak M. Statistical reviews in journals of the World Association of Medical Editors. Pol Arch Intern Med. 2024; 134: 16778. | Crossref
Takita H, Kabata D, Walston SL, et al. A systematic review and meta‑analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025; 8: 175. | Crossref