Introduction

Applications of artificial intelligence (AI) to the medical field have been gaining increasing attention recently. Bibliometric analyses have demonstrated an explosive trend in AI-related published studies since 2019.1 Dedicated medical journals2,3 or journal sections4 devoted to AI in medicine have emerged. Hospitals adopt large language models (LLMs) in an unprecedented way for a variety of tasks.5 Neurology as a medical field is no exception. Historically, neuroscience is considered to have inspired the invention of artificial neural networks,6 and the neurological community understands the need for the adoption of AI into their daily practice, minding ethical, safety, and equity risks.7 To facilitate navigation in the complex glossary of terms and techniques used in the AI field, several neurological journals have published “primers” on AI.8,9 Others focused on narrative7,10 or scoping11 reviews. To systematically analyze the current impact of AI on neurology, we performed an umbrella review on this topic.

Methods

The review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines12 (Figure 1) and methodology established specifically for umbrella reviews,13 as opposed to any other type of reviews (scoping or narrative), to identify and synthesize systematic reviews.

Figure 1. Study flowchart

Search strategy and eligibility criteria

The authors independently searched the PubMed database for relevant studies up to May 31, 2025, using the following search terms: (“artificial intelligence”[MeSH Terms] OR (“artificial”[All Fields] AND “intelligence”[All Fields]) OR “artificial intelligence”[All Fields]) AND (“neurology”[MeSH Terms] OR “neurology”[All Fields] OR “neurology s”[All Fields]) AND “review”[Publication Type]; and the Scopus database using the search term: “TITLE-ABS-KEY (artificial AND intelligence AND neurology) AND ALL (review).” Forward and backward citation searching of the included studies was performed to identify additional reviews not captured during the database searches. Only English-language articles were included. The titles and abstracts were screened using the following inclusion criteria: systematic reviews and meta-analyses, as well as articles related to adult neurology and its subspecialties, including ischemic and hemorrhagic stroke, epilepsy, multiple sclerosis (MS), dementia, movement disorders (including Parkinson disease [PD]), neuromuscular diseases, headache, neurocritical care, and neuro-oncology. Systematic reviews were defined as reviews in which specific search criteria were applied. Narrative and scoping reviews (without specified systematic search criteria), original papers, bibliometric analyses, book chapters or conference papers, and articles in languages other than English were excluded.

Assessment of study quality

The assessment of study quality was performed using A Measurement Tool to Assess Systematic Reviews, version 2 guidelines.14 This tool has been selected, as it is specifically designed for critical appraisal of systematic reviews, which include nonrandomized studies.14 The methodological quality of the studies was rated as high, moderate, low, or critically low, based on the identification of critical and noncritical flaws.

Data extraction

The manuscripts were manually reviewed, and pertinent information was extracted and summarized. We focused on reporting: 1) sensitivity: a ratio of true positive predictions to all positive instances; 2) specificity: a ratio of true negative predictions to all negative instances; 3) accuracy: a ratio of correct machine learning (ML) model’s predictions; 4) area under the curve (AUC): a plot of true positive and false positive ratio across different thresholds; and 5) F-score: a harmonic mean of precision (positive predictive value) and recall (sensitivity). Where appropriate, other indicators, such as the interclass correlation coefficient (ICC) or Dice coefficient, were reported. It should be noted that metrics such as accuracy can be misleading in the context of class imbalance, which is why we prioritized reporting imbalance-robust metrics, such as AUC, sensitivity, and specificity, where available. We did not plan quantitative analysis of the manuscripts due to the expected high heterogeneity of the data. We also extracted reference lists from all manuscripts and crosschecked them to determine the citation overlap of primary studies, using the corrected covered area (CCA) metric.15 Methodological papers (reporting guidelines, quality assessment tools) were excluded from the overlap calculations.

Results

A total of 58 studies were eligible for further analysis. The most relevant findings are summarized in Table 1. There was a slight citation overlap (CCA, 1.86%). Of the 3432 primary studies (unique citations), 258 (7.5%) appeared in at least 2 reviews.

Table 1. Summary of the most relevant findings

Neurological subspecialty

Relevant artificial intelligence applications

Stroke

  • Detection of early ischemic lesions on CT16;
  • Detection of LVO19;
  • CT perfusion assessment17;
  • Detection of stroke18 and DWI/FLAIR mismatch19 on MRI;
  • ICH detection26;
  • Prediction of stroke hemorrhagic transformation,24 cerebral edema,27 and clinical outcomes30

Dementia

  • Prediction of conversion of MCI to AD34;
  • Distinguishing AD from healthy controls or MCI patients40;
  • Differentiating dementias, particularly FTLD43;
  • Aiding the caregivers in their daily tasks41

Movement disorders

  • Differentiating PD patients from healthy controls44 and atypical parkinsonian syndromes45;
  • Video-based assessment of movement disorders46;
  • Cognitive impairment prediction in PD47;
  • STN detection before DBS48;
  • Classification of hyperkinetic movement disorders49

Neuro-oncology

  • Interpretation of histopathological slides50;
  • Differentiation of glioma, lymphoma, and metastasis51;
  • Prediction of cognitive decline after radiation55

Epilepsy

  • Epileptiform discharge detection56;
  • NLP-based extraction of EHR data57;
  • Prediction of antiseizure medication response58 and surgical outcome59

MS

  • Diagnosis of MS63;
  • MR lesion segmentation63;
  • MS classification61;
  • Prediction of conversion of CIS to MS, cognitive outcome, and disability64

Neuromuscular disorders

  • ALS diagnosis, classification,66 and prognosis67;
  • EMG signal classification68;
  • Muscle segmentation and classification of myopathies based on muscle ultrasound and MRI69

Headache

  • Extraction of data from EHRs70;
  • Headache diagnosis and classification71;
  • Incident headache prediction70

Neurocritical care

  • Prediction of neurological outcome following cardiac arrest73

Abbreviations: AD, Alzheimer disease; ALS, amyotrophic lateral sclerosis; CIS, clinically isolated syndrome; CT, computed tomography; DBS, deep brain stimulation; DWI, diffusion-weighted imaging; EHR, electronic health record; EMG, electromyography; FLAIR, fluid-attenuated inversion recovery; FTLD, frontotemporal lobar degeneration; ICH, intracranial hemorrhage; LVO, large-vessel occlusion; MCI, mild cognitive impairment; MRI, magnetic resonance imaging; MS, multiple sclerosis; NLP, natural language processing; PD, Parkinson disease; STN, subthalamic nucleus

Stroke

As many as 17 systematic reviews were related to stroke. Of those, 6 discussed ischemic stroke detection on imaging,16-21 4 hemorrhagic transformation prediction,22-25 1 intracranial hemorrhage (ICH) detection,26 2 cerebral edema prediction,27,28 3 stroke outcome prediction,29-31 and 1 identification of time from symptom onset.32

In terms of article quality, 4 works were considered of high, 1 moderate, 6 low, and 6 critically low quality (Supplementary material, Table S1).

The classical approach to stroke detection on computed tomography (CT) images is the automated Alberta Stroke Program Early CT Score (ASPECTS)33. This method showed moderate (ICC, 0.54) and good (ICC, 0.72) reliability between automated and expert readings, and between automated predictions and the reference standard, respectively.16 This translated into mean (range, 45%–98%) sensitivity of 68% and mean (range, 57%–95%) specificity of 81%.17 A more novel approach includes AI-based analysis of CT perfusion scans, with the accuracy above 80%,17 as well as the analysis of magnetic resonance imaging (MRI), which demonstrated a pooled sensitivity and specificity both amounting to 93%, with half of the studies showing a low risk of bias.18 Interestingly, the time from stroke symptom onset could be inferred based on imaging with 79% accuracy.32 In turn, stroke with unknown time from onset (diffusion-weighted imaging / fluid-attenuated inversion recovery mismatch) was detected by AI with sensitivity and specificity of 85% and 84%, respectively.19 In another task relevant to mechanical thrombectomy, large vessel occlusion automatic detection, ML models demonstrated up to 85% sensitivity.17 Notably, the Viz.ai model (Viz.ai Inc., San Francisco, California, United States) showed 96% specificity in this task, which significantly improved all workflow metrics; however, it did not have an impact on patients outcomes.20 Regarding the more difficult problem of the occlusion of M2 segment of the middle cerebral artery, AI platforms were equally specific (97%), but not very sensitive (64%) across 8 heterogenous studies.21

Concerning stroke hemorrhagic transformation, ML outperformed traditional models, demonstrating overall median AUC of up to 0.9122,23 and 0.95 in patients undergoing thrombolysis.24 For automated ICH detection, the accuracy ranged from 81% to almost perfect, 99%.26 In turn, cerebral edema was predicted by ML models with the AUC of 0.84,27 whereas malignant edema, with the AUC of 0.94.28

With respect to clinical outcome prediction, the accuracy of AI models was good, with the AUC reaching 0.92 for algorithms using radiomics-based features.30 More specifically, the outcome after mechanical thrombectomy was predicted with a slightly lower pooled AUC of 0.85.31

Dementia

We identified 11 systematic reviews related to dementia. Of those, 1 was of general scope,34 3 focused on progression from mild cognitive impairment (MCI) to Alzheimer disease (AD),35-37 4 dealt with neuroimaging,38-41 1 covered the assistance to caregivers,42 1 discussed neuropsychiatric symptoms,43 and 1 was related to frontotemporal lobar degeneration (FTLD).44

In terms of article quality, none were deemed high quality. One was deemed moderate, 4 low, and 6 critically low (Supplementary material, Table S1).

The majority of the studies were based on the Alzheimer’s Disease Neuroimaging Initiative datase45; only about half of them used a hold-out test set, and only 17 out of 92 articles performed an external validation.34 About 67% of the studies used imaging alone, whereas almost all used imaging in conjunction with other parameters, such as demographics, comorbidities, laboratory, genetic, neurophysiological, neuropsychological, and ophthalmological examinations, as well as acoustic and semantic speech parameters. Imaging was performed mostly with MRI, either feeding the whole image into the model (here 3-dimensional rather than 2-dimensional data) or extracting features from voxel-based (volume) or vertex-wise (surface morphometry) analysis. Interpretability of the models by clinicians was achieved by ranking the features or visualization of brain regions contributing to the output (class-activation mapping). Not unexpectedly, the brain region which was consistently reported as most informative in classifying AD vs healthy patients was the hippocampus.34

Regarding the progression of MCI to AD, the use of 18F-fluorodeoxyglucose positron emission tomography (18F-FDG-PET) or cognitive measures were the most important factors that improved model’s performance. Interestingly, the type of algorithm (most frequently support vector machine [SVM]) or the dataset size did not influence model’s performance.35 The accuracy of published models was in the range of 66.1%–96.3%. However, most studies had a high risk of bias.37

With regards to discrimination between AD vs healthy controls and AD vs MCI patients based on neuroimaging, accuracies up to 91% and (balanced) 83%, respectively, have been reported.41 More recent analysis of the same task using vision transformers demonstrated similar accuracy (pooled AUC of 0.92), although with a substantial heterogeneity across the studies.39 The task of phenotyping AD from other dementias was performed with the accuracy of up to 97%. Specifically, Wu et al44 found that FTLD could be distinguished from healthy and AD patients with pooled sensitivity of 86% and 84%, and pooled specificity of 89% and 81%, respectively.

Of equal importance is the use of AI technology to aid the caregivers of patients with dementia. Such tools include social or assistive robots that facilitate social interaction and help with daily tasks, smart home environment that ensures safety, and educational programs that provide cognitive stimulation. There are also models that may predict falls or detect incorrect dressing events with accuracies ranging from 23% to 98%.42 Unfortunately, the majority of the studies included in this review were qualitative, and the major identified gap was the lack of systematic design and evaluation of new technologies in everyday life of a patient with AD.

Movement disorders

In this subspecialty, we identified 6 systematic reviews, of which 2 discussed PD diagnosis and parkinsonian syndrome differentiation,46,47 1 video analysis of movement disorders,48 1 cognitive impairment prediction in PD,49 1 subthalamic nucleus (STN) localization for deep brain stimulation (DBS) procedures,50 and 1 focused specifically on hyperkinetic movement disorders.51

In terms of article quality, 2 were considered of moderate, 2 low, and 2 critically low quality (Supplementary material, Table S1).

With respect to PD diagnosis, most studies focused on differentiating between PD and healthy controls. In this task, AI models achieved up to 100% accuracy; however, an alarming 80% of the studies failed to pass minimal quality standards of AI reporting. The major reasons for that were: circular reasoning (inclusion of a modality which was used to stratify patients into the model), data leakage, data imbalance, and a lack of feature importance reporting or external validation.46 Concerning the differentiation between PD and parkinsonian syndromes, 18F-FDG-PET seemed the most promising with the AUC of up to 0.98.47 Moreover, video-based assessment of parkinsonian symptoms including tremor, gait (also freezing of gait), dyskinesia, and hypomimia achieved moderate or good results.48 With reference to cognitive impairment, ML models commonly utilized both clinical and neuroimaging features, attaining an AUC of 0.83.49 Interestingly, various ML models have been used to detect STN before DBS, with the hidden Markov model achieving the best result of diagnostic odds ratio of 838.50 Finally, the classification of hyperkinetic movement disorders, including ataxia, dystonia, or chorea, using ML (with features including accelerometer, imaging, video, and electrophysiology data) has also been the subject of a systematic review.51 The accuracies of detection ranged from 54% to 100%; however, there were no studies with external validation, and only 5 out of 55 had a low risk of bias.

Neuro-oncology

We identified 6 systematic review articles devoted to neuro-oncology, mainly concerning classification and grading of brain tumors. One was related exclusively to histopathological diagnosis,52 1 examined models combining imaging and histopathology,53,54 2 discussed imaging only,55,56 and 1 was related to prediction of cognitive functioning after brain radiation.57

In terms of article quality, 5 were of low and 1 of critically low quality (Supplementary material, Table S1).

In the assessment of histopathological slides, ML models were trained to detect specific pathology, such as microvascular proliferation (AUC, 0.99), quantify immunohistochemical staining (accuracy of up to 97%), or provide tumor classification (accuracy, 85%–100%). However, all examined studies displayed a high risk of bias.52 Interestingly, after pooling the results of histopathological and imaging studies, ML models achieved excellent metrics in differentiating glioma from lymphoma (AUC, 0.99), low- from high-grade glioma (AUC, 0.89), and primary from metastatic tumors (sensitivity, 0.89; specificity, 0.87).53 Regarding radiation-induced cognitive decline, ML models had a high risk of bias and only modest performance (AUC, 0.78).55

Epilepsy

In the field of epilepsy, we identified 5 systematic reviews. Surprisingly, only 1 article focused directly on epileptiform discharge detection,58 whereas others discussed: natural language processing (NLP)-based data extraction for epilepsy research,59 aid in predicting antiseizure medication (ASM) response,60 and general patient management.61 One article reviewed unsupervised ML, which is characterized by the lack of external guidance or “labels,” letting the model learn from the data.62

In terms of article quality, 1 was of moderate, 1 low, and 3 critically low quality (Supplementary material, Table S1).

Epileptiform discharge detection accuracy ranged from 74% to 97% on a level of 1 electroencephalography (EEG) window, depending on the architecture used. On the patient level, accuracies were lower (85%–90%).58 Regarding other AI applications in epilepsy, an NLP-based approach has been shown to be feasible in extracting various information from electronic health records (EHRs), including the type of epilepsy (F-score of up to 0.86), the presence of psychogenic nonepileptic seizures (0.67–0.96), sudden unexpected death in epilepsy risk (0.86), and identification of surgical candidates (0.94).59 In ASM response prediction, the models included clinical, imaging, and EEG-extracted features, and presented an AUC of 0.45–0.97.60 Drug-resistant epilepsy was predicted with an AUC in the range of 0.76–0.83, whereas surgical outcome was determined with up to 96% accuracy.61 Unsupervised models also performed very well in seizure detection, prediction, signal propagation, as well as seizure localization and classification, with accuracies of over 90% for all but the latter task (for which the accuracy ranged from 80% to 90%).62

Multiple sclerosis

We identified 5 studies related to multiple sclerosis. Two articles discussed general ML applications in MS,63,64 1 the diagnosis,65 1 prognosis prediction,64 and 1 biomarkers other than neuroimaging.67

In terms of article quality, 1 was of low and 4 critically low quality (Supplementary material, Table S1).

The main applications of ML in the field of MS involve establishing the diagnosis, classifying disease subtypes, and predicting the outcome.63 The ML models used are mainly decision trees and SVMs, and involve MRI-based features, followed by optical coherence tomography, blood and cerebrospinal fluid biomarkers, as well as neurophysiological studies.65 Using this multimodal approach, the accuracy in diagnosing MS using ML ranged from 81% to 100%.63 It is worth noting that MS lesion segmentation on MRI scans has been a topic of extensive research; however, as of the date of the last systematic review, it still appears to be a challenging task, in which ML models perform worse than human experts.65 Using non-MRI–based biomarkers, the accuracy of MS diagnosis was slightly lower, but still above 90%.67 The accuracy in subtype classification ranged from 71% to 96%.63 In contrast, conversion from clinically isolated syndrome (CIS) to MS could be predicted with the accuracy of 65%–92% (cognitive outcome, 72%–82%; disability, 42%–79%).64

Neuromuscular disorders

In this subspecialty, we identified 4 articles. Two examined amyotrophic lateral sclerosis (ALS),68,69 1 classification of electromyographic (EMG) signals,70 and 1 was of general scope.71

In terms of article quality, 3 were of low and 1 of critically low quality (Supplementary material, Table S1).

ML models used in ALS detection and classification were based on gait, EMG, and MRI data. In the tasks of ALS detection and classification, the pooled sensitivities were as high as 94.3% and 90%, respectively, and specificities, 98.9% and 92.3%, respectively.68 However, there were concerns as to the methodological quality of the studies. In terms of ALS prognosis, only 1 out of 16 ML models reported an AUC of 0.78 in a model that utilized clinical characteristics to predict survival without tracheostomy or mechanical ventilation.69 Regarding EMG signals, most studies relied on ML, with only 8% incorporating deep learning (DL) algorithms. Only 2 out of 51 studies classified signals at rest. Regrettably, although the reported accuracies ranged up to 100%, the methodological limitations rendered the existing models unable to be incorporated into clinical practice.70 Other applications of ML in the field of neuromuscular disease included muscle ultrasound segmentation with accuracies up to 88%,71 and myopathy classification based on muscle ultrasound texture parameters (accuracy of 76%).71 In muscle MRI, ML models were used to estimate water and fat fraction from conventional MRI sequences, segment muscle tissue (the Dice coefficient of up to 0.88), and classify dystrophinopathies (accuracy of 91%–96%).69

Headache

In this field, 3 systematic reviews were identified. One was of general scope,72 1 was related to diagnostic tools,73 and 1 focused on headache classification.74

In terms of article quality, 1 was of low and 2 critically low quality (Supplementary material, Table S1).

As in epilepsy, 1 of the applications of AI (NLP) to headache is the extraction of data from EHRs, including headache frequency, with some further potential to differentiate migraine from cluster headache based on self-reported patients’ narrative.72 The most extensively studied applications of AI in the headache field are classification and diagnosis. Many digital tools are currently available, and their performance reaches 87%–90% in terms of concordance, sensitivity, and specificity.73 Most classify different primary and secondary headaches, relying on questionnaires and diagnostic criteria (which might be problematic due to circular reasoning); however, some utilize data from MRI, magnetoencephalography, or EEG.72 Interestingly, AI has also been used to predict incident headaches, with a modest AUC of 0.62, as well as forecast treatment response, with the AUC ranging from 0.62 to 0.98.72

Neurocritical care

In this field, only 1 critically low-quality systematic review was identified,74 in which EEG together with EHR data were used to predict neurological outcome following cardiac arrest. The most commonly used ML technique was random forest with the AUC in the range of 0.8–0.97, whereas in the scope of DL—a convolutional neural network with the AUC of 0.7–0.92.75

Discussion

In this study, we identified systematic reviews on the applications of AI in neurological subspecialties, such as stroke, dementia, movement disorders, neuro-oncology, epilepsy, MS, neuromuscular disorders, headache, and neurocritical care. We reviewed these applications, which included not only diagnosis, prognosis, imaging and, signal interpretation, but also data extraction from EHRs and caregiver support. Finally, we recognized the main obstacles in AI implementation—the lack of external validation and diversified datasets, which collectively compromise the generalizability of the models.

Given the rapid AI development and increased use, there have already been some attempts to narratively summarize AI applications in neurology. Some authors described the most popular tools,7 while others delved into more technical76 or regulatory11 details. In this manuscript, we adopted a different approach consisting in a methodological study of all systematic reviews on AI applications in neurology. Understandably, by doing so, we might have omitted emerging tools or solutions described in original studies, which is a limitation of this study. However, we have probably better captured the well-established trends, which have made their way to reviews.

Stroke is the subspecialty with the greatest number of systematic high-quality reviews. Most of them focused on imaging, and some described models used for many years by clinicians, incorporated into the stroke guidelines (such as the ASPECTS score). Importantly, several tools have already received regulatory clearance (eg, Food and Drug Administration’s approval of RapidAI [San Mateo, California, United States] or Viz.ai). Together with the widespread hospital adoption, this reflects the maturity of the field unmatched in other neurological subspecialties, which cannot claim the same number of systematic reviews and, consequently, AI applications. Perhaps EEG signal analysis is getting close to being incorporated into clinical practice; however, this was not captured by our umbrella review due to the novelty of this information.77 Some of the described applications, vital to the neurological field, such as conversion from MCI to AD or CIS to MS, ALS prognosis, headache diagnosis, or PD differentiation, still remain in the research realm. The reason for it is most likely the fact that, although the reported accuracies in some tasks reached 90%–100%, most studies had a high risk of bias and inherent flaws, such as the lack of external validation or diversified datasets. Interestingly, even some systematic reviews that identified limitations in the reviewed studies were of low or critically low quality. This highlights the fact that the strength of the AI models lies not merely in achieving the highest metrics, but in comprehensive validation across diverse dataset using rigorous methodology. This cannot be achieved without collaboration between clinicians, AI engineers, and policymakers, grounded in the patient needs and guided by ethical frameworks that balance innovation with safety.

An interesting finding is the application of NLP techniques in neurology, which has been greatly eased by the advent of LLMs. NLP has been applied in data extraction from EHRs in a variety of fields,78 and in neurology, its use seems highly justified for several reasons. Firstly, in some areas, such as headache or epilepsy, the diagnosis relies heavily on anamnesis; therefore, extraction and analysis of the patient’s history would be often the most contributive to the diagnosis (eg, whether the patient fulfills migraine criteria or not). In other subspecialties relying more on neurological examination, such as stroke, NLP can help extract information about stroke severity from clinical notes, achieving near-registry-level agreement.79 Secondly, NLP models could assess the patient’s speech (after speech processing techniques had been applied), looking for grammatical errors, anomia, paraphasias, or errors that might suggest cognitive dysfunction. Such emergent applications include AD prediction using speech80 and patient monitoring using multimodal data integration.81 Finally, neurological patients often require caregiver support as well as social, occupational, and rehabilitation arrangements. NLP models might better “understand” their functional status and needs, possibly facilitating further care or transitions in care.

Our umbrella review comprehensively identified and analyzed current systematically reviewed AI applications in neurology. As AI is gaining traction, novel applications emerge, such as smartphone and wearable-based telemonitoring systems for movement disorders,82 wearable seizure detection devices,83 AI-driven radiogenomics for noninvasive molecular characterization and intraoperative guidance systems,84 as well as uncertainty quantification, synthetic imaging,85 and federated learning86 in the field of MS. Another example of a novel, though not yet systematically reviewed technique in neuromuscular disorders, is computer vision for muscle strength quantification.87 In neurocritical care, externally validated tools for intracranial pressure prediction have emerged, although their widespread implementation is hindered by integration challenges, data fragmentation, and limited clinician trust.88 The absence of systematic reviews on these applications likely reflects their recent emergence rather than lack of clinical promise.

The limitation of the current review is that, although numerical citation overlap was low, we were unable to detect the overlap of populations included in the primary studies. This might have led to biased synthesis and overstatement of AI generalizability in neurology. Exclusion of non–English-language articles and narrative reviews might have caused publication and topic bias; however, we attempted to mitigate this by stratifying the findings by application type (diagnostic, prognostic, or interventional), neurological subspecialty, and data modality (imaging, neurophysiology, and laboratory data). Of note, some reviews reported only raw accuracy, which is error-prone in imbalanced datasets and can lead to erroneous conclusions when comparing ML models’ performance. Also, our reliance on systematic reviews, while methodologically rigorous, might underestimate the real-world impact of recently deployed AI technologies, given the rapidly evolving nature of the field.

Conclusions

To conclude, AI applications encompass the entire range of neurological subspecialties, but are most comprehensively utilized in the field of stroke. The use of AI in other subspecialties involves diagnosis establishment, classifying disease subtypes, prognosticating, and analyzing imaging and signals. Although the reported model metrics seem promising, most of the studies carry methodological limitations and bias. To facilitate real-world AI adoption in neurology, we propose the following framework: 1) standardization of reporting, including patient-level data partitioning, preregistration, and public availability of protocols; 2) external validation requirements on at least 1 independent external dataset; and 3) clinical outcome reporting including real-world metrics, time- and cost-effectiveness, and patient-centered outcomes.