Clinical evaluation of an automated Alberta Stroke Program Early Computed Tomography Score (ASPECTS)-scoring system
Introduction
The Alberta Stroke Program Early Computed Tomography Score (ASPECTS) is a well-established and widely used tool for the assessment of early ischemic changes on non-contrast computed tomography (NCCT) in patients with acute ischemic stroke (1). By providing a standardized method for quantifying the extent of ischemic involvement in the middle cerebral artery (MCA) territory, ASPECTS plays a critical role in both diagnosis and treatment decision-making, including patient selection for reperfusion therapies such as intravenous thrombolysis or mechanical thrombectomy. Despite its clinical value, ASPECTS scoring is inherently observer-dependent, with substantial variability reported across raters of different experience levels (2). To support and standardize human interpretation, a growing number of automated ASPECTS solutions have been developed, both in commercial settings and academic research (2-6). These tools aim to improve objectivity, reproducibility, and availability of ASPECTS interpretation, particularly in emergency settings or in institutions lacking on-site neuroradiology expertise.
In this study, we evaluated an automated ASPECTS scoring system based on NCCT. Beyond assessing its agreement with expert human ratings, we specifically investigated whether the integration of such automated software into the clinical workflow may influence human decision-making, i.e., whether it is associated with a systematic shift in human scoring. Using two reader conditions—blinded unassisted reads and software-assisted reads—we evaluated the reliability of automated scoring and its potential impact on clinical assessment. We present this article in accordance with the STROBE reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2336/rc).
Methods
Data acquisition
We retrospectively sampled cases from our stroke registry across the full ASPECTS spectrum (0–10) to ensure coverage of stroke severity. In addition, 87 cases without vessel occlusion (ASPECTS =10) were included, resulting in a total dataset of 224 cases. Occlusion site was determined from CTA images and characterized as internal carotid artery (ICA), M1, M2, anterior cerebral artery (ACA) and posterior circulation [basilar artery and posterior cerebral artery (PCA) P1]; tandem/multiple occlusions were allowed. NCCT examinations were acquired on four computed tomography (CT) scanner models from Siemens Healthineers (Erlangen, Germany): SOMATOM Force (n=152), SOMATOM Definition AS (n=28), SOMATOM Definition Flash (n=23), and SOMATOM go.Top (n=21). All scans were acquired in helical mode with a slice thickness ranging from 1.5 to 5 mm, reflecting heterogeneous real-world clinical acquisition protocols. Cases were not excluded a priori based on image quality; the cohort reflects routine clinical NCCT acquisitions and may include motion, beam-hardening, or metallic artifacts.
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of University of Freiburg (No. EK 20/1047). Owing to the retrospective design and use of de-identified data, the requirement for written informed consent was waived.
Image analysis
The software “VEOcore ASPECTS” (VEObrain GmbH, Freiburg, Germany) was used for automated ASPECTS scoring. The processing pipeline operates as follows: NCCT scans are first mapped into a normalized space using non-rigid coregistration to a standardized template. The image is then intensity-calibrated by scaling the peak of the Hounsfield unit (HU) histogram to 33 HU. To isolate brain tissue, voxels corresponding to bone and cerebrospinal fluid (CSF) are excluded by selecting only those with intensities between 18 and 65 HU. Morphological operations are applied subsequently to refine the resulting brain mask.
An ASPECTS atlas is overlaid onto the normalized image, and HU histograms are computed for each anatomical region. A weighted mean is then derived using the first moment of the histogram raised to the third power. For each region, the hemisphere with the higher mean intensity is used as reference to compute the relative HU difference in percent, referred to as “HU shift”. This metric corresponds to the concept of net water uptake (NWU), defined as (1 − HUaffected / HUreference) * 100 (7-10).
A region is marked as affected if the HU shift exceeds a predefined threshold. Based on empirical optimization, a threshold of NWU >6.6% was chosen. This value corresponds to an absolute shift of ΔHU >2.25 with respect to the calibration of whole brain values to 33 HU (8,9).
For visualization, a report is generated that displays the NCCT alongside the ASPECTS region overlay in a rotationally aligned native space—not the normalized template space—ensuring that patient-specific anatomy (e.g., ventricles) is preserved. To assist with interpretation, HU differences are displayed using directional bar indicators, which show the left-right asymmetry in each region. This approach allows the reader to assess even sub-threshold shifts and compare them with visual impressions from the scan, aiding in the identification and interpretation of false positives. The final ASPECTS score is computed as 10 minus one point for each affected region and is also shown in the report.
Visual rating
Human assessment was performed by six radiologists. Two expert neuroradiologists performed unassisted readings (U1, U2; 8 and 9 years of neuroimaging experience) and served as the reference reader group. Four radiologists performed software-assisted readings: A1 (neuroradiologist, 9 years), A2 (neuroradiologist, 4 years), A3 (radiologist, 4 years), and A4 (radiologist, 7 years). Unassisted readers were blinded to the software output, whereas assisted readers had access to the automated ASPECTS results during assessment. For comparisons involving reader groups (e.g., automated vs. unassisted), pairwise metrics were computed between the automated score and each reader and then averaged within the respective reader group.
Statistical analysis
All analyses were performed using MATLAB (MathWorks, Natick, MA, USA). Agreement was assessed using the intraclass correlation coefficient (ICC), mean difference (MD), and mean absolute difference (MAD). Additionally, the proportion of cases with absolute score differences of 0, 1, 2, and ≥3 points was calculated.
Additionally, we assessed the proportion of score comparisons falling on opposite sides of a clinically relevant ASPECTS threshold (≤5 vs. ≥6), referred to as threshold crossings.
Differences between unassisted and assisted ratings were tested using a Wilcoxon rank-sum test (Mann-Whitney U test). To account for cohort composition, all metrics were computed under two bootstrapped stratification schemes (1,000 resampling runs each): (I) uniform stratification, sampling approximately equal numbers of cases per ASPECTS score and including additional healthy controls with ASPECTS 10; and (II) clinical stratification, approximating a typical gamma-like ASPECTS distribution with most cases around 6–8 and fewer very low scores. For each run, metrics were calculated for all rater pairs and then averaged across runs and within reader groups (unassisted and assisted). ICC values were interpreted as follows: <0.50 poor, 0.50 to <0.75 moderate, 0.75–0.90 good, and >0.90 excellent agreement.
Finally, as an exploratory post-hoc analysis based on the software’s regional outputs, we compared the frequency of cortical and subcortical ASPECTS region involvement between cases with larger versus smaller reader-software score discrepancies.
Results
The cohort comprised 224 cases with a median age of 78 years [interquartile range (IQR), 66–86 years]; 103 were male and 121 female. Occlusion locations (multiple occlusions possible) were: no occlusion 87 (39%), ICA 41 (18%), M1 83 (37%), M2 24 (11%), ACA 4 (2%), and posterior circulation (basilar artery and PCA P1) 5 (2%).
Correlation analyses between the automated scores, the unassisted reference reader group, and software-assisted ratings are summarized in Figure 1, and agreement metrics are reported in Table 1. Illustrative examples of agreement and typical discrepancy patterns are shown in Figure 2. Stratification distributions and the NWU-threshold sensitivity analysis are summarized in Figure 3.
Table 1
| Stratification | Comparison | ICC | MD | MAD | Δ=0 | Δ=1 | Δ=2 | Δ≥3 |
|---|---|---|---|---|---|---|---|---|
| Uniform | U vs. U | 0.96 | +0.03 | 0.65 | 52% | 32% | 16% | 1% |
| A vs. A | 0.95 | +0.08 | 0.66 | 55% | 30% | 10% | 5% | |
| A vs. U | 0.96 | −0.02 | 0.60 | 58% | 29% | 10% | 3% | |
| VC vs. U | 0.89 | −0.22 | 1.01 | 42% | 33% | 14% | 11% | |
| VC vs. A | 0.89 | −0.20 | 0.97 | 43% | 34% | 15% | 9% | |
| Clinical | U vs. U | 0.91 | +0.10 | 0.83 | 40% | 38% | 21% | 1% |
| A vs. A | 0.88 | +0.09 | 0.89 | 40% | 39% | 14% | 7% | |
| A vs. U | 0.90 | −0.05 | 0.80 | 45% | 35% | 15% | 5% | |
| VC vs. U | 0.76 | −0.21 | 1.27 | 32% | 36% | 16% | 16% | |
| VC vs. A | 0.77 | −0.16 | 1.18 | 34% | 37% | 18% | 12% |
Δ, absolute ASPECTS score difference; A, assisted readers; ASPECTS, Alberta Stroke Program Early Computed Tomography Score; ICC, intraclass correlation coefficient; MAD, mean absolute difference; MD, mean difference; U, unassisted readers; VC, VEOcore.
Across the uniform stratification, inter-rater agreement within both reader groups was excellent (unassisted: ICC 0.96, MAD 0.65; assisted: ICC 0.95, MAD 0.66). Agreement between assisted and unassisted readings was similarly high (ICC 0.96, MAD 0.60; MD −0.02), and there was no significant difference between unassisted and assisted score distributions (Wilcoxon P=0.82). Under the clinically representative stratification, agreement decreased slightly but remained excellent/good (unassisted: ICC 0.91, MAD 0.83; assisted: ICC 0.88, MAD 0.89; assisted vs unassisted: ICC 0.90, MAD 0.80), again without significant score differences (Wilcoxon P=0.61).
Agreement between the automated system and human ratings was good in the uniform stratification (automated vs. unassisted: ICC 0.89, MAD 1.01, MD −0.22; automated vs. assisted: ICC 0.89, MAD 0.97, MD −0.20). In the clinically representative stratification, automated–human agreement was lower (automated vs unassisted: ICC 0.76, MAD 1.27, MD −0.21; automated vs assisted: ICC 0.77, MAD 1.18, MD −0.16). Negative MD values indicate that automated scores were slightly lower than human ratings. The proportion of exact matches (Δ=0) ranged from 42–43% (uniform) and 32–34% (clinical), while larger discrepancies (Δ≥3) occurred in 9–11% (uniform) and 12–16% (clinical) of comparisons (Table 1).
In the threshold analysis (≤5 vs. ≥6), crossings occurred in 5%, 8%, and 7% of score comparisons (unassisted-unassisted, assisted-assisted, assisted-unassisted) under uniform stratification and 6%, 11%, and 9% under clinical stratification, indicating somewhat more frequent crossings among assisted readers near the decision boundary.
In a post-hoc exploratory analysis, larger discrepancies between unassisted readers and the software (ASPECTS difference >2) were more often associated with cortical MCA involvement (M1–6) (47.5% vs. 33.9%), whereas subcortical regions (C, L, IC, I) showed no enrichment (28.7% vs. 32.7%). As region-wise human annotations were not available, this analysis is exploratory and should be interpreted cautiously.
Sensitivity analysis (Figure 3C,3D) showed an optimal plateau around an NWU threshold of approximately 7%, based on a favorable trade-off between ICC and MAD. This supports the threshold used for defining affected ASPECTS regions in the automated scoring pipeline.
Discussion
In this study, we compared human ASPECTS ratings with those generated by an automated software tool. The agreement metrics between human raters and the software were found to be in line with previously published studies, supporting the general validity of automated approaches for ASPECTS scoring (2-6,11,12). Interestingly, the inter-rater agreement between the two neuroradiologists was slightly higher than the agreement between human raters and the automated system. These findings suggest that, despite its overall strong performance, the automated method is subject to limitations and may produce occasional discrepancies. Importantly, our study evaluates agreement/reliability rather than diagnostic accuracy, because no independent tissue-based reference standard such as DWI or follow-up infarct imaging was used. Accordingly, discordant cases between software and readers cannot be adjudicated as true or false within this dataset.
To further evaluate the role of automation in clinical interpretation, we compared two reader conditions: unassisted reads performed without software input and software-assisted reads performed with access to the automated ASPECTS output. This allowed us to assess whether software assistance was associated with a systematic shift in human scoring or whether it aided in identifying and correcting potential errors. Our findings suggest that software assistance was not associated with a systematic group-level shift in ASPECTS scoring and may serve as a useful adjunct in clinical practice when used appropriately. However, the absence of a significant Wilcoxon difference does not exclude case-level anchoring or more subtle cognitive effects.
One central feature of the evaluated software is its reliance on left-right (L/R) hemispheric comparison to detect regional hypodensity. This approach offers several advantages: it inherently adjusts for individual anatomical and radiodensity variability by using each patient’s contralateral hemisphere as an internal reference. As such, it does not require external templates or population-based, region-wise thresholds and can be more robust in detecting subtle asymmetries in acute ischemia. However, this intrinsic normalization strategy also comes with potential limitations. In cases of bilateral pathology, such as diffuse metabolic injury or toxic encephalopathy, both hemispheres may be globally abnormal, albeit not perfectly symmetrically. If one side is only marginally less impaired, the algorithm may incorrectly label it as “normal”, while the relatively worse side is marked as pathological, even though both are abnormal in absolute terms (13). This limitation becomes particularly relevant when no truly healthy reference region exists. Therefore, while L/R comparison is efficient and generally reliable, it may fail in settings characterized by global or bilateral changes (13). Nevertheless, recent multicenter and outcome-focused studies further support the clinical utility and increasing adoption of automated ASPECTS and AI-derived imaging metrics in acute stroke care (14,15).
Because absolute HU values can vary across scanners and acquisition/reconstruction settings, using the contralateral hemisphere as an internal reference provides an intra-patient normalization. The intended role of the tool is to support the reader with interpretable quantitative cues (overlay and HU-shift indicators) rather than to replace clinical judgment. Compared with end-to-end deep-learning approaches, this design provides a more transparent rationale for each regional decision, while remaining susceptible to artifacts and to bilateral/global hypodensity where no truly healthy reference exists.
Despite the overall satisfactory agreement, three illustrative cases from Figure 2 revealed critical discrepancies. In Figure 2D, a unilateral metallic artifact, caused by an implanted device, prompted a false-positive labeling of normal tissue, demonstrating the algorithm’s vulnerability to artifact-induced imaging distortions. In Figure 2E, diffuse bilateral hypodensity due to CO intoxication was misinterpreted as unilateral ischemia: the L/R comparison strategy chose the slightly less affected hemisphere as a reference, thus flagging only the contralateral side. In Figure 2F, human raters identified subtle cortical ischemia in M5/M6, whereas the software did not flag these regions because the interhemispheric HU difference remained just below the predefined threshold (ΔHU ≤2.25); note that M5/M6 lie above the displayed axial level. Together, these examples highlight the need for adaptive thresholding and robust artifact suppression to improve performance in complex or borderline scenarios.
An important limitation of our study is the single-center design. Although the dataset reflects heterogeneous real-world clinical CT protocols, all scans originated from a single institution and a single vendor environment, which may limit generalizability to other settings. Although multiple readers were included, the number of readers remained limited (two unassisted and four assisted), and the assisted ratings may have been influenced by the software overlay and therefore do not represent fully independent reads; larger multi-reader studies would further improve robustness. Second, the study did not employ a cross-over (within-reader) design; therefore, within-reader effects of software assistance and potential case-level cognitive influences cannot be disentangled from inter-individual reader variability. Additionally, the number of cases was modest, although we attempted to mitigate this by applying a stratified sampling strategy to ensure balanced representation across the entire ASPECTS score spectrum. Moreover, scans were not excluded a priori based on image quality, and artifacts or foreign material may disproportionately affect L/R comparison-based algorithms. In addition, the presence and type of implanted/foreign material were not available as structured variables in this retrospective cohort, precluding an exact count. Finally, our analysis focused on the ASPECTS sum score rather than individual ASPECTS regions and did not assess region-wise agreement, which may provide additional insights into systematic patterns of disagreement.
Conclusions
In summary, the automated system demonstrated good agreement with manual ASPECTS ratings and was not associated with a systematic group-level shift in human ASPECTS scoring. These findings support the use of automated ASPECTS scoring as an adjunct to human interpretation, while emphasizing careful review in the presence of artifacts, bilateral/global hypodensity, and borderline hypoattenuation.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2336/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2336/dss
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2336/coif). A.R. reports support from the Berta-Ottenstein-Programme for Advanced Clinician Scientists, Faculty of Medicine, University of Freiburg, outside the submitted work. E.K. is a shareholder of and received payments from VEObrain GmbH. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of University of Freiburg (No. EK 20/1047). Owing to the retrospective design and use of de-identified data, the requirement for written informed consent was waived.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Barber PA, Demchuk AM, Zhang J, Buchan AM. Validity and reliability of a quantitative computed tomography score in predicting outcome of hyperacute stroke before thrombolytic therapy. ASPECTS Study Group. Alberta Stroke Programme Early CT Score. Lancet 2000;355:1670-4.
- Lambert J, Demeestere J, Dewachter B, Cockmartin L, Wouters A, Symons R, Boomgaert L, Vandewalle L, Scheldeman L, Demaerel P, Lemmens R. Performance of Automated ASPECTS Software and Value as a Computer-Aided Detection Tool. AJNR Am J Neuroradiol 2023;44:894-900. [Crossref] [PubMed]
- Chiang PL, Lin SY, Chen MH, Chen YS, Wang CK, Wu MC, Huang YT, Lee MY, Chen YS, Lin WC. Deep Learning-Based Automatic Detection of ASPECTS in Acute Ischemic Stroke: Improving Stroke Assessment on CT Scans. J Clin Med 2022;11:5159. [Crossref] [PubMed]
- Mair G, White P, Bath PM, Muir KW, Al-Shahi Salman R, Martin C, Dye D, Chappell FM, Vacek A, von Kummer R, Macleod M, Sprigg N, Wardlaw JM. External Validation of e-ASPECTS Software for Interpreting Brain CT in Stroke. Ann Neurol 2022;92:943-57. [Crossref] [PubMed]
- Adamou A, Beltsios ET, Bania A, Gkana A, Kastrup A, Chatziioannou A, Politi M, Papanagiotou P. Artificial intelligence-driven ASPECTS for the detection of early stroke changes in non-contrast CT: a systematic review and meta-analysis. J Neurointerv Surg 2023;15:e298-304. [Crossref] [PubMed]
- Goebel J, Stenzel E, Guberina N, Wanke I, Koehrmann M, Kleinschnitz C, Umutlu L, Forsting M, Moenninghoff C, Radbruch A. Automated ASPECT rating: comparison between the Frontier ASPECT Score software and the Brainomix software. Neuroradiology 2018;60:1267-72. [Crossref] [PubMed]
- Trofimov A, Agarkova D, Trofimova K, Lidji-Goryaev C, Atochin D, Bragin D. On Net Water Uptake in Posttraumatic Ischemia Foci. Adv Exp Med Biol 2023;1425:629-34. [Crossref] [PubMed]
- Broocks G, Meyer L, Elsayed S, McDonough R, Bechstein M, Faizy TD, et al. Association Between Net Water Uptake and Functional Outcome in Patients With Low ASPECTS Brain Lesions: Results From the I-LAST Study. Neurology 2023;100:e954-63. [Crossref] [PubMed]
- Broocks G, Flottmann F, Ernst M, Faizy TD, Minnerup J, Siemonsen S, Fiehler J, Kemmling A. Computed Tomography-Based Imaging of Voxel-Wise Lesion Water Uptake in Ischemic Brain: Relationship Between Density and Direct Volumetry. Invest Radiol 2018;53:207-13. [Crossref] [PubMed]
- Fu B, Qi S, Tao L, Xu H, Kang Y, Yao Y, Yang B, Duan Y, Chen H. Image Patch-Based Net Water Uptake and Radiomics Models Predict Malignant Cerebral Edema After Ischemic Stroke. Front Neurol 2020;11:609747. [Crossref] [PubMed]
- Austein F, Wodarg F, Jürgensen N, Huhndorf M, Meyne J, Lindner T, Jansen O, Larsen N, Riedel C. Automated versus manual imaging assessment of early ischemic changes in acute stroke: comparison of two software packages and expert consensus. Eur Radiol 2019;29:6285-92. [Crossref] [PubMed]
- Neuhaus A, Seyedsaadat SM, Mihal D, Benson JC, Mark I, Kallmes DF, Brinjikji W. Region-specific agreement in ASPECTS estimation between neuroradiologists and e-ASPECTS software. J Neurointerv Surg 2020;12:720-3. [Crossref] [PubMed]
- Rau A, Reisert M, Taschner CA, Demerath T, Elsheikh S, Frank B, Köhrmann M, Urbach H, Kellner E. Reducing False-Positives in CT Perfusion Infarct Core Segmentation Using Contralateral Local Normalization. AJNR Am J Neuroradiol 2024;45:277-83. [Crossref] [PubMed]
- Wei J, Shang K, Wei X, Zhu Y, Yuan Y, Wang M, Ding C, Dai L, Sun Z, Mao X, Yu F, Hu C, Chen D, Lu J, Li Y. Deep learning-based automatic ASPECTS calculation can improve diagnosis efficiency in patients with acute ischemic stroke: a multicenter study. Eur Radiol 2025;35:627-39. [Crossref] [PubMed]
- Nguyen QA, Vu DL, Nguyen TT, Le QT, Nguyen HA, Nguyen VH, Tran AT, Nguyen QV, Tran TH, Pierot L. Visual Alberta stroke program early computed tomography score versus RAPID-AI perfusion in predicting outcome after late-window thrombectomy. Interv Neuroradiol 2025; Epub ahead of print. [Crossref]

