Evaluating convolutional neural network models for automated abdominal aortic calcification scoring in chronic kidney disease patients across multiple centers

Zhenhong Shao; Mingrui Song; Beibei He; Hupo Yi; Aijing Li; Ge Chen; Zhehao Zhang; Yuning Pan

doi:10.21037/qims-2025-1-2612

Original Article

Evaluating convolutional neural network models for automated abdominal aortic calcification scoring in chronic kidney disease patients across multiple centers

Zhenhong Shao^1,2#, Mingrui Song^1#, Beibei He³, Hupo Yi⁴, Aijing Li⁵, Ge Chen¹, Zhehao Zhang¹, Yuning Pan¹

¹Department of Radiology, The First Affiliated Hospital of Ningbo University, Ningbo, China; ²Department of Radiology, Cixi People’s Hospital, Wenzhou Medical University, Cixi, China; ³Department of Radiology, Yinzhou District Second Hospital, Ningbo, China; ⁴Department of Radiology, Wenzhou People’s Hospital, Wenzhou, China; ⁵Department of Radiology, Ningbo No. 2 Hospital, Ningbo, China

Contributions: (I) Conception and design: Y Pan, Z Shao, Z Zhang, M Song; (II) Administrative support: Y Pan, Z Shao; (III) Provision of study materials or patients: Z Shao, M Song, B He, H Yi, A Li, G Chen; (IV) Collection and assembly of data: Z Shao, M Song, B He, H Yi, A Li, G Chen; (V) Data analysis and interpretation: Y Pan, Z Shao, Z Zhang, M Song; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Yuning Pan, MD; Zhehao Zhang, MS. Department of Radiology, The First Affiliated Hospital of Ningbo University, 59 Liuting Road, Haishu District, Ningbo 315010, China. Email: fyypanyuning@nbu.edu.cn; Zhangzh9605@163.com.

Background: Abdominal aortic calcification (AAC) is a condition particularly common in patients with chronic kidney disease (CKD). This study aims to assess the accuracy and generalizability of convolutional neural networks (CNNs) in automating AAC scoring in CKD patients.

Methods: Abdominal aortic X-ray images from five hospitals were employed to evaluate three CNN architectures (DenseNet, RegNet and ResNet) using the internal hold-out and five-fold cross-validation method. Performance metrics included intraclass correlation coefficient (ICC), R-squared value (R2), and root mean square error (RMSE), along with standard binary classification measures, across the independent training, validation, internal test, and external test sets.

Results: The present study analyzed 2,853 X-ray images. DenseNet outperformed ResNet and RegNet in the internal test set, with an RMSE of 2.141, a Pearson correlation of 0.926, a Spearman’s rank correlation of 0.908, and an ICC of 0.919. On the external test set, the performance of DenseNet was slightly better than that of ResNet. For the AAC severity classification, all models performed comparably in the “none or mild” and “moderate” categories, with an accuracy metric of over 80% and 70%, respectively, and a sensitivity of above 60% and 65%, respectively. In the “severe” category, the sensitivity of DenseNet and ResNet was higher (77.30% and 75.50%, respectively), while the sensitivity of RegNet lagged at 67.30%, although the other metrics exceeded 75% for all models.

Conclusions: In conclusion, the DenseNet and ResNet models exhibit strong performance and reliability in AAC scoring for CKD patients, suggesting their viability for clinical use. Their effectiveness in differentiating AAC severity highlights the potential for enhancing diagnostic accuracy in medical settings.

Keywords: Deep learning; abdominal aortic calcification (AAC); automatic scoring; convolutional neural networks (CNNs)

Submitted Dec 05, 2025. Accepted for publication Mar 05, 2026. Published online Apr 08, 2026.

doi: 10.21037/qims-2025-1-2612

Introduction

Abdominal aortic calcification (AAC) is characterized by the deposition of calcium and phosphorus on the arterial walls, and is a condition particularly common in patients with chronic kidney disease (CKD) (1,2). Studies have consistently revealed a strong positive relationship between AAC severity in CKD patients and increased risk of major adverse cardiovascular events (MACEs) and overall mortality (3-6). Following the guidelines of Kidney Disease: Improving Global Outcomes (KDIGO), a regular assessment of AAC through scoring is recommended to monitor cardiovascular risk in CKD patients, helping clinicians effectively adjust medical approaches (7). Similar findings have been observed in other chronic conditions, such as type-2 diabetes (8,9), and broader population studies (10-12), in which higher AAC scores correlated to greater risk of MACE and mortality, when compared to lower scores.

The Kauppila method, which is a widely utilized X-ray-based AAC scoring technique (13), involves semi-quantitative assessments on plain X-ray films. Despite its prevalence, the method’s manual execution is marred by inefficiencies, poor reproducibility, and issues with accuracy, which are often influenced by the scorer’s expertise and variability in equipment (14,15). The advent of artificial intelligence (AI), particularly deep learning, has sparked innovations in imaging diagnostics, enhancing feature extraction and image analysis capabilities (16-18). Recent initiatives have aimed to adapt the Kauppila scoring into automated systems using machine learning or deep learning (14,15,19,20). However, the clinical utility and generalizability of these automated models require further validation, as existing studies are often limited by single-center data (15), or reliance on a single model architecture (14,15,19).

To address these gaps, we collated data from various medical centers, and leveraged multiple convolutional neural network (CNN) architectures to develop and test models that automate AAC scoring. These models were benchmarked against the evaluations of experienced radiologists to assess its accuracy and generalizability. The present study aimed to advance the application of CNNs in automating AAC scoring, in order to establish a foundation for developing robust deep learning solutions in this domain. We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2612/rc).

Methods

Data collection and image processing

Abdominal aorta X-ray images were collected from five medical centers across Zhejiang Province (Table 1). The inclusion criteria were as follows: (I) images taken from the abdominal lateral or lumbar lateral view; (II) images that encompass the 12^th thoracic vertebra down to the 1st sacral vertebra, with the front soft tissue layer thickness of the lumbar region surpassing the front-to-back diameter of the vertebral bodies. The exclusion criteria were as follows: (I) incomplete or significantly damaged lumbar vertebral structures, with any L1-L4 vertebral anterior height of <2 cm; (II) lumbar vertebral structures that could not be distinctly identified; (III) prominent high-density overlapping shadows in the lumbar spine or anterior area of the lumbar soft tissues.

Table 1

Baseline patient characteristics and X-ray image acquisition conditions

Characteristics	Training set, validation set and internal test set		External test set			Total	P value	f
Characteristics	Hospital A	Hospital B	Hospital C	Hospital D	Hospital E	Total	P value	f
Gender							<0.001	13.36
Female	700 (49.90)	409 (56.60)	68 (40.50)	47 (29.20)	187 (46.60)	1,411 (49.40)
Male	701 (50.10)	313 (43.40)	100 (59.50)	114 (70.80)	214 (53.40)	1442 (50.60)
Age (years)	70.30±10.90 [23–94]	68.50±11.00 [21–94]	61.90±13.20 [21–94]	65.50±11.10 [37–89]	62.10±11.20 [32–89]	68.00±11.60 [21–94]	<0.001	56.03
CKD patient	289	161	168	161	401	1,180
Lumbar lateral	1,112 (39.00)	561 (19.70)	–	–	–	1,673 (58.60)
Company	Philips	Shimadzu	–	–	–	–
Tube current (mA)	400 [400–517]	400 [400–630]	–	–	–	–
Tube voltage (kV)	85 [85–95]	95 [80–95]	–	–	–	–
Abdominal lateral	289 (10.10)	161 (5.70)	168 (5.90)	161 (5.70)	401 (14.00)	1,180 (41.40)
Company	Philips	Shimadzu	Philips	Philips	United film	–
Tube current (mA)	320 [320–320]	250 [250–400]	320 [320–320]	320 [320–320]	400 [400–508]	–
Tube voltage (kV)	85 [85–85]	80 [80–100]	85 [85–85]	85 [85–85]	85 [85–85]	–
AAC score
No/low	3 [0, 4]	2 [0, 4]	1 [0, 4]	2 [0, 4]	2 [0, 4]	2 [0, 4]	<0.001	10.83
Moderate	9 [5, 15]	8 [5, 15]	9 [5, 15]	7 [5, 15]	10 [5, 15]	9 [5, 15]	<0.001	6.63
High	18 [16, 24]	18 [16, 24]	16 [16, 19]	17 [16, 23]	19 [16, 24]	18 [16, 24]	0.026	2.81
AAC severity							–	–
No/low	234	243	107	55	88	727
Moderate	992	408	51	86	221	1,758
High	175	71	10	20	92	368

Data are presented as mean ± standard deviation, median [range], or n (%), as appropriate. A, The First Affiliated Hospital of Ningbo University; B, Cixi City People’s Hospital; C, Ningbo No. 2 Hospital; D, Wenzhou People’s Hospital; E, Yinzhou District Second Hospital. AAC, abdominal aortic calcification; CKD, chronic kidney disease.

The X-ray images were retrieved from the PACS system of the respective hospitals, and converted into DICOM format. All personal identifiers were meticulously removed to uphold patient privacy, but essential clinical data were preserved for the subsequent analysis. In order to ensure compliance and accountability, a comprehensive documentation system was established. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of The First Affiliated Hospital of Ningbo University (Ethical Review Research 2025 No. 040A). The requirement for informed consent was waived by the approving ethics committee due to the retrospective nature of the study and the use of anonymized data. All participating hospitals were informed and agreed to the study.

AAC scoring and consensus

The present study used the Kauppila scoring system (13) to evaluate the AAC by examining the calcifications on both the anterior and posterior walls of the abdominal aorta across four segments, aligned with the L1–L4 vertebral regions. Each segment boundary was determined at the midpoint of the superior and inferior intervertebral spaces. The AAC score for each segment ranged from 0 to 3, based on the extent of calcification: a score of 0 indicated no observable AAC; a score of 1 indicated that the calcification covered less than one-third of the longitudinal aortic wall; a score of 2 indicated that the calcification extended over one-third, but less than two-thirds, of the wall; a score of 3 indicated that the calcification encompassed more than two-thirds of the aortic wall.

The composite AAC scores ranged from 0 to 24, allowing the AAC to be classified into three levels, based on the standard in previous studies (15,21): no or mild AAC (total score: 0–4), moderate AAC (total score: 5–15), and severe AAC (total score: 16–24).

For scoring accuracy, a double-blind review process was utilized. Initially, two radiologists with 3 and 4 years of experience, respectively, independently evaluated all radiographs under double-blind conditions. Prior to the formal assessment, both readers underwent rigorous calibration using the Kauppila scoring system, reviewing 451 annotated training cases to standardize the interpretation. Inter-observer agreement between these two initial readers was assessed using the intraclass correlation coefficient (ICC, two-way random-effects model for absolute agreement) for the total AAC-24 score. Discrepancies in the initial assessments led to a secondary review by a senior radiologist with 10 years of experience, who determined the final consensus score. This approach ensured rigor, and reduced bias in the scoring process.

Model design and training process

In the present study, the model design and training process were carefully structured to enhance the accuracy and consistency of the analysis of AAC. Image standardization was a preliminary step, which involved adjustments to key visual parameters, such as contrast, brightness and resolution, to prepare for the effective model training. Three CNNs were employed for the AAC scoring: DenseNet201, RegNetX_16GF and ResNet152. Each CNN is known for its robust performance in image recognition tasks (22-24).

A hybrid validation approach was adopted, which combined the internal hold-out and five-fold cross-validation, in order to optimize the training and testing of the present models (25,26). First, for model development, data from the two primary centers (The First Affiliated Hospital of Ningbo University and Cixi People’s Hospital) were pooled and then randomly split into a training-validation set (90%) and an internal test set (10%). This split was stratified by AAC severity level (no/mild, moderate, severe) to maintain class distribution. Subsequently, for external validation, data from the remaining three hospitals (Ningbo No. 2 Hospital, Wenzhou People’s Hospital, and Yinzhou District Second Hospital) were kept entirely separate as independent external test sets. Imaging protocols across the five facilities showed variation in scanner vendors, tube current, and tube voltage (Table 1). This design simulates real-world deployment on unseen data from different institutions. The model structure is illustrated in Figure 1.

Figure 1 The three model structures (DenseNet201, RegNetX_16GF, and ResNet152) and mapping relations. AAC, abdominal aortic calcification; ICC, intraclass correlation coefficient; R, Pearson correlation coefficient; R², coefficient of determination; RMSE, root mean square error; ρ, Spearman’s rank.

The detailed information for the training process are as follows: (I) the five-fold cross-validation: the training-validation data set was randomly segmented into five equal parts; (II) training cycle: one segment was selected as the validation set, and the other four segments were used for training; (III) after training, the model was evaluated on the internal test set. Then, the performance on the three external test sets was independently assessed, and all test results were recorded; (IV) iteration: steps 2 and 3 were repeated for four more times, rotating the validation set each time, and ensuring that each data segment served as the validation set once.

The models were developed and trained using PyTorch on an NVIDIA RTX 4090 GPU, with each model undergoing 90 training epochs. The initial learning rate was set at 0.001, with a decrement by a factor of 10 every 30 epochs, based on the performance metrics. Image augmentation techniques, such as flipping, brightness, and contrast adjustments, were employed, alongside normalization procedures, in order to ensure consistency across the images obtained from different sources. These steps were critical in managing the variations inherent in X-ray imaging through multiple equipment settings.

Data analysis and statistical evaluation

All statistical evaluations were carried out using SPSS (version 27.0). Continuous variables that adhered to the normal distribution were presented in mean ± standard deviation (SD). Variables that did not follow the normal distribution were detailed through median values and ranges. Categorical variables were expressed in percentage.

For the analytical process, the total AAC score was treated as a continuous variable. Both Pearson and Spearman’s rank correlation coefficients were applied to perform the linear correlation assessments. The accuracy of the predictive models for continuous data was gauged using the root mean square error (RMSE) and its 95% confidence interval. In addition, the agreement between structured clinical reports and model predictions was evaluated using the ICC and Bland-Altman plots, and the threshold for statistical significance was set at P<0.05.

In adherence to the five-fold cross-validation principle, the average performance metrics across the five instances of each network model type were computed. To formally compare the predictive accuracy between the three CNN architectures, paired-samples t-tests were conducted on their RMSE values obtained from all external test iterations.

In subclassifying the AAC severity, five-fold cross-validation was employed to train five models. One model that had the lowest RMSE was selected as the best performer for validation on the external datasets. The AAC scores were segmented into three severity categories (21): none or mild AAC (0≤ AAC score ≤4), moderate AAC (4< AAC score ≤15), and severe AAC (15< AAC score ≤24). Five key metrics were used for the exhaustive evaluation: accuracy, sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV).

In order to enhance the transparency and understanding of the model’s decision-making, gradient-weighted class activation mapping (Grad-CAM) (27) was employed to visualize the activation maps within the model’s final convolutional layer, providing a graphical representation of which areas most influenced the outcome predictions.

Results

Standardized manual scoring outcomes

In the present study, 2,967 abdominal aortic X-ray images were gathered from five different hospitals. Following the meticulous evaluation of image quality, images that did not conform to the study standards were removed. Consequently, 2,853 images remained eligible for inclusion for the present analysis. Table 1 presents the fundamental clinical data, imaging parameters, and manual scoring distribution across these hospitals.

Table 2 summarizes the data partitioning scheme used for the five-fold cross-validation, including the distribution of AAC severity levels across each fold. Two experienced radiologists independently evaluated the images. The inter-observer agreement between the two initial radiologists was excellent, with an ICC of 0.981 [95% confidence interval (CI): 0.978 to 0.983] for the total AAC-24 score. Based on the established scoring criteria, the images were categorized to three groups: no or mild group (0–4 points), which comprised 727 cases; moderate group (5–15 points), which comprised 1,758 cases; severe group (16–24 points), which comprised 368 cases. The detailed workflow of the experimental design is presented in Figure 2.

Table 2

Data splitting configuration of the five-fold cross-validation

Variables	Fold-1	Fold-2	Fold-3	Fold-4	Fold-5	Total	P value	f
Number of patients	383	382	382	382	382	1,911	–	–
Gender (F/M)	193/190	193/189	203/179	191/191	198/184	978/933	0.95	0.18
Age (years)	69.18±10.62 [21–94]	70.84±10.53 [39–94]	69.20±11.01 [23–97]	69.32±11.16 [22–94]	69.67±11.26 [28–94]	69.64±10.92 [21–97]	0.17	1.61
AAC score
No/low	3 [0, 4]	3 [0, 4]	2 [0, 4]	3 [0, 4]	3 [0, 4]	3 [0, 4]	0.45	0.93
Moderate	9 [5, 15]	8 [5, 15]	8 [5, 15]	9 [5, 15]	10 [5, 15]	9 [5, 15]	0.11	1.89
High	17 [16, 24]	18 [16, 24]	17 [16, 23]	18 [16, 24]	18 [16, 23]	18 [16, 24]	0.23	1.40
Total	8 [0, 24]	8 [0, 24]	8 [0, 23]	8 [0, 24]	9 [0, 23]	8 [0, 24]	0.19	1.54

Data are presented as mean ± standard deviation, median [range], or n, as appropriate. F, female; M, male.

Figure 2 Schematic diagram for the laboratory workflow design. AAC, abdominal aortic calcification.

Model consistency and generalization assessment

Throughout the evaluation phase, all models for the three model types exhibited robust consistency and precision. Notably, the Pearson and Spearman’s rank correlation coefficients were consistently above 0.75, and the ICC consistently exceeded 0.80. Three types of models were developed, creating five models within each category, and resulting in a total of 15 models. The average performance metrics for each model type are presented in Table 3. The extensive details are presented in Table S1. Among the internally tested models, the DenseNet model excelled, surpassing the others, in terms of consistency and accuracy. The average RMSE was 2.141, with a 95% CI for the mean error that ranged from −3.642 to 4.985. The average Pearson correlation coefficient was 0.926, the Spearman’s rank correlation coefficient averaged 0.908, and the ICC was 0.919. To strengthen our conclusions, paired-samples t-tests were performed. The results indicated no statistically significant difference in RMSE between DenseNet and ResNet (P=0.135), while both models exhibited significantly lower RMSE than RegNet (P≤0.001).

Table 3

Experimental average results for the five-fold cross-validation

Model	Test	RMSE	Error interval (95%)	R	ρ	R²	ICC
DenseNet	Internal	2.141	−3.642, 4.958	0.926	0.908	0.856	0.919
	External 1	2.658	−6.215, 4.92	0.866	0.760	0.749	0.838
	External 2	2.420	−4.932, 3.735	0.908	0.894	0.823	0.889
	External 3	2.457	−4.665, 4.747	0.915	0.921	0.837	0.907
RegNet	Internal	2.299	−4.196, 5.262	0.913	0.898	0.834	0.905
	External 1	2.674	−4.838, 5.973	0.870	0.772	0.757	0.821
	External 2	2.911	−6.031, 3.819	0.868	0.875	0.752	0.841
	External 3	2.689	−4.496, 6.014	0.896	0.900	0.803	0.89
ResNet	Internal	2.234	−3.912, 5.072	0.919	0.906	0.843	0.911
	External 1	2.592	−5.137, 5.696	0.871	0.774	0.757	0.842
	External 2	2.619	−5.696, 3.507	0.897	0.897	0.802	0.871
	External 3	2.506	−4.596, 5.404	0.910	0.915	0.828	0.904

ICC, intraclass correlation coefficient; R, Pearson correlation coefficient; R², coefficient of determination; RMSE, root mean square error; ρ, Spearman’s rank.

The five-fold cross-validation training produced five models, among which one was selected for presentation. The model with the lowest RMSE was chosen as the representative due to its minimal error. Figure 3 presents the correlation and consistency between the predictions of the selected model, and the manual standard scoring. Complete documentation of all cross-validation models is provided in Figures S1,S2.

Figure 3 The correlation and consistency between the predictions of the model with the lowest RMSE, and the manual standard scoring. (A,B) The DenseNet model; (C,D) the RegNet model; (E,F) the ResNet model. RMSE, root mean square error.

These findings show that the evaluated models were effectively able to discern patterns from the data, demonstrating strong linear and monotonic relationships, and exhibiting significant reliability across diverse testing environments.

Evaluation of subgroup classification performance

Figure 4 demonstrates the robust classification performance of the selected optimal model (lowest RMSE) across clinically defined subgroups. In the subgroup classification test, both the DenseNet and ResNet models exhibited a marginally superior performance, when compared to the RegNet model (Figure 4). The breakdown of the performance across different severity groups is as follows: In the “no or mild group”, the performance metrics for DenseNet, RegNet and ResNet were quite close, and all demonstrated strong outcomes. However, the sensitivity metric was notably lower: 61.60% for DenseNet, 64.00% for RegNet, and 62.80% for ResNet. In the “moderate group”, the performance was similarly consistent across the models, although the specificity metric was slightly lower. The results were as follows: 66.40% for DenseNet, 65.00% for RegNet, and 66.70% for ResNet. In the “severe group”, the models generally maintained a high performance across most indicators. The DenseNet and ResNet models exhibited superior sensitivity (77.30% and 75.50%, respectively), when compared to RegNet (67.30%). However, the PPV and sensitivity metrics still had room for improvement. Specifically, DenseNet had a PPV of 79.40%, with a sensitivity of 77.30%, RegNet had a PPV of 76.30%, with a sensitivity of 67.30%, and ResNet had a PPV of 76.90%, with a sensitivity of 75.50%.

Figure 4 The classification performance of the selected optimal model (lowest RMSE). (A,D) The DenseNet model; (B,E) the RegNet model; (C,F) the ResNet model. ACC, accuracy; NPV, negative predictive value; PPV, positive predictive value; RMSE, root mean square error; SE, sensitivity; SP, specificity.

These results indicate the generally high level of accuracy and reliability across the models, with particular strengths and weaknesses noted in specific metrics and groups.

Utilization of Grad-CAM for activation mapping

In the present study, Grad-CAM was employed as a pivotal visualization technique to illustrate the activation within the CNN models. This method helped to visually assess how the models identified the regions of interest (ROIs) within the abdominal aorta, and detected the calcifications. Figure 5 presents the Grad-CAM visualization for the ResNet model. For instance, in Figure 5C, the total manual score for the depicted sections of the abdominal aorta at vertebrae L3 and L4 was 11 points (5 points for the L3 segment and 6 points for the L4 segment). Remarkably, the ResNet model accurately focused on these specific segments during the analysis, mirroring the manual assessment with a predictive score of 15 points. However, a small subset of cases exhibited scoring discrepancies greater than 2 points, which may have been influenced by confounding factors, such as bowel gas interference (Figure 5B), high-density intestinal content (Figure 5D), and the superimposition of aortic calcifications over the lumbar spine (Figure 5F).

Figure 5 Visualization of neural network activation maps in the ResNet model: (A) human-labelled AAC score: 0, ResNet-predicted score: 0.30; (B) human-labelled AAC score: 1, ResNet-predicted score: 3.50; (C) human-labelled AAC score: 15, ResNet-predicted score: 15.00; (D) human-labelled AAC score: 17, ResNet-predicted score: 13.10; (E) human-labelled AAC score: 19, ResNet-predicted score: 19.20; (F) human-labelled AAC score: 24, ResNet-predicted score: 19.70. AAC, abdominal aortic calcification.

Discussion

The Kauppila scoring system has demonstrated its established prognostic value for key clinical outcomes, including all-cause mortality and fracture risk (13,22,28). Recent advances in deep learning, particularly the application of CNNs, have facilitated the highly accurate and fully automated AAC-24 scoring of lateral spine images. Notably, the landmark multicenter study conducted by Sharif et al. (28) has provided the robust external validation of the clinical utility of CNN-derived AAC-24 scores, reinforcing its relevance for cardiovascular risk stratification, and establishing a benchmark for future comparative studies in this domain. The present study implemented three distinct CNN architectures, and adopted both the internal hold-out and five-fold cross-validation to develop a total of 15 models. This approach allowed for the thorough evaluation of various neural networks in the task of scoring AAC X-rays, providing a comparative analysis that a number of previous studies lack (previous studies often relied on a single model structure for the evaluations) (14,15,19,20,28).

Throughout the data acquisition phase, the present study collected two types of X-ray images suitable for AAC diagnosis from five distinct centers: abdominal and lumbar lateral films. The images obtained from each center exhibited unique characteristics, underscoring the benefits of conducting a multicenter research, and contributing to sample diversity (Table 1). In contrast, similar studies in the field, such as the study conducted by Wang K and Wang X (15), typically sourced data from only a single center, amassing 1,359 cases, with both internal and external validations relying on this singular dataset.

Conversely, the present study obtained data from five centers, which comprised 2,853 cases. This not only ensured a robust dataset that adhered to the approximate normal distribution, but also adequately fulfilled the sample size requirements for effective model training. Critically, the present methodology extended beyond the internal validation to include external tests across multiple centers. This expansive and diverse data collection strategy not only increased the volume of data, particularly for the moderate group, but also significantly bolstered the generalization capabilities of the models, demonstrating its efficacy across varied clinical environments.

In the accuracy and consistency assessments, the tested models effectively identified patterns within the data, excelling in both linear relationships and monotonicity. These also demonstrated remarkable stability across various experimental conditions. Notably, both the DenseNet and ResNet models exhibited outstanding and statistically comparable performance on external test sets, as confirmed by paired t-test (P=0.135). This indicates that either architecture is a robust choice for this task. Both models achieved Pearson and Spearman’s correlation coefficients of above 0.90 on the internal test set, highlighting their exceptional accuracy and reliability.

Compared to previously reported traditional machine learning models (14,19,20) and deep learning algorithms (15) for AAC scoring, the three CNN architectures introduced in the present study demonstrated a significantly enhanced performance. For instance, the 2023 study conducted by Wang K, Wang X, et al. presented an AAC scoring model based on the U-Net architecture, achieving an adjusted R² of 0.82. In contrast, the DenseNet, RegNet and ResNet models in the present study attained an average adjusted R² value of 0.856, 0.834 and 0.843, respectively, on the internal test set. These findings highlight the superior accuracy and effectiveness of CNNs in AAC scoring model development.

Furthermore, compared to previously proposed network architectures (14,15,19,20,28), the models in the present study offered stronger feature extraction capabilities, higher training efficiency, and better convergence. In addition, the flexible structure enabled the models to adapt more effectively to the complex recognition tasks required for plain X-ray images. Another key factor that contributed to this success was the distribution of training data. Unlike earlier studies, in which the “no or mild group (AAC score: 0–4)” images predominated, the present study included a more balanced representation of “moderate group (AAC score: 5–15)” images, potentially making the models better suited for comprehensive training, and improving its generalization.

In the AAC subgroup classification experiment, the three representative models exhibited different strengths and weaknesses across the subgroups. DenseNet and ResNet demonstrated comparable performance, with the overall effectiveness surpassing that of the RegNet model. In order to further enhance the classification accuracy across varying disease severity levels, future research should focus on optimizing the network structures of the DenseNet and ResNet models.

To contextualize the statistical results, we considered whether the model’s error could lead to clinical misclassification. The best model achieved a mean error (RMSE) of 2.141 points and excellent agreement (ICC: 0.919) with radiologists. This error is substantially smaller than the score gaps between clinical risk categories, suggesting a low likelihood of misclassification. We emphasize that this is a technical validation study; prospective evaluation is needed to confirm its clinical utility.

However, there are several limitations in this study that should be considered. First, as a foundational exploratory effort, the primary focus of the present study was to develop and validate the scoring algorithm, rather than to conduct clinical endpoint analyses. Nevertheless, the findings reported by Sharif et al. (28) provides compelling rationale for future work, specifically the application of the present segment-specific AAC scores to predict CKD-related outcomes, such as dialysis-associated MACE, using clinically validated protocols.

Second, although the present analysis concentrated on the total AAC-24 score using regression models, the differential contribution of the individual abdominal aortic segments, or anterior vs. posterior wall involvement was not investigated. With the integration of the newly developed segmentation module in the present study, future studies would be positioned to conduct more granular, segment-specific analyses, potentially offering deeper insights into the spatial distribution and progression of vascular calcification.

In addition, although the incorporation of two types of data increased dataset diversity, variations in imaging conditions resulted in inconsistent model performance across different image types. In order to mitigate this issue, future research should prioritize enhancing the model’s generalization and robustness, ensuring better adaptability to varying imaging conditions.

Finally, during the visualization of activation maps, it was observed that the key recognition areas in some images were imprecise or overly broad. In order to further refine the model performance, future work should focus on developing a dedicated abdominal aorta segmentation module to more precisely constrain the ROIs, and minimize off-target influences. In addition, the implementation of tiered safety thresholds near clinically actionable cut-offs should be explored to enhance the reliability in real-world deployment. Together, this dual-strategy approach would proactively mitigate risks for critical decision thresholds, thereby improving both the precision and clinical utility of automated AAC scoring.

Conclusions

In summary, the AAC scoring models based on the DenseNet and ResNet deep learning frameworks exhibited outstanding technical performance, characterized by high precision and strong generalization capabilities. These two architectures hold significant promise for developing robust deep learning models for AAC quantification. Moving forward, future research should focus on further optimizing model structures, in order to enhance scoring efficiency and accuracy, and ensure greater applicability to real-world clinical and diagnostic needs.

Acknowledgments

We would like to thank Medjaden Inc. for scientific editing of this manuscript.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2612/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2612/dss

Funding: This work was supported by the Key Research and Development Project of Ningbo (Nos. 2024Z220 and 2022S133), the Natural Science Joint Foundation of Zhejiang Province (No. LKLY25H200010), the Ningbo Clinical Research Center for Medical Imaging (No. 2022LY-KEYB03), the HwaMei Research Foundation of Ningbo No. 2 Hospital (No. 2023HMKY50), and the Public Welfare Science and Technology Program Project of Cixi (No. CN2024007).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2612/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of The First Affiliated Hospital of Ningbo University (Ethical Review Research 2025 No. 040A). The requirement for informed consent was waived by the approving ethics committee due to the retrospective nature of the study and the use of anonymized data. All participating hospitals were informed and agreed to the study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Lee SJ, Lee IK, Jeon JH. Vascular Calcification-New Insights Into Its Mechanism. Int J Mol Sci 2020;21:2685. [Crossref] [PubMed]
Johnson RC, Leopold JA, Loscalzo J. Vascular calcification: pathobiological mechanisms and clinical implications. Circ Res 2006;99:1044-59. [Crossref] [PubMed]
Zhang H, Li G, Yu X, Yang J, Jiang A, Cheng H, et al. Progression of Vascular Calcification and Clinical Outcomes in Patients Receiving Maintenance Dialysis. JAMA Netw Open 2023;6:e2310909. [Crossref] [PubMed]
Jensky NE, Criqui MH, Wright MC, Wassel CL, Brody SA, Allison MA. Blood pressure and vascular calcification. Hypertension 2010;55:990-7. [Crossref] [PubMed]
Parkkila K, Kiviniemi A, Tulppo M, Perkiömäki J, Kesäniemi YA, Ukkola O. Abdominal aorta plaques are better in predicting future cardiovascular events compared to carotid intima-media thickness: A 20-year prospective study. Atherosclerosis 2021;330:36-42. [Crossref] [PubMed]
Rantasalo V, Gunn J, Kiviniemi T, Hirvonen J, Saarenpää I, Kivelev J, Rahi M, Lassila E, Rinne J, Laukka D. Intracranial aneurysm is predicted by abdominal aortic calcification index: A retrospective case-control study. Atherosclerosis 2021;334:30-8. [Crossref] [PubMed]
Global, regional, and national burden of chronic kidney disease, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet 2020;395:709-33.
Cheng S, Cohen KS, Shaw SY, Larson MG, Hwang SJ, McCabe EL, Martin RP, Klein RJ, Hashmi B, Hoffmann U, Fox CS, Vasan RS, O'Donnell CJ, Wang TJ. Association of colony-forming units with coronary artery and abdominal aortic calcification. Circulation 2010;122:1176-82. [Crossref] [PubMed]
Reaven PD, Sacks J. Investigators for the Veterans Affairs Cooperative Study of Glycemic Control and Complications in Diabetes Mellitus Type 2. Reduced coronary artery and abdominal aortic calcification in Hispanics with type 2 diabetes. Diabetes Care 2004;27:1115-20.
Estublier C, Chapurlat R, Szulc P. Association of severe disc degeneration with all-cause mortality and abdominal aortic calcification assessed prospectively in older men: findings of a single-center prospective study of osteoporosis in men. Arthritis Rheumatol 2015;67:1295-304. [Crossref] [PubMed]
Bastos Gonçalves F, Voûte MT, Hoeks SE, Chonchol MB, Boersma EE, Stolker RJ, Verhagen HJ. Calcification of the abdominal aorta as an independent predictor of cardiovascular events: a meta-analysis. Heart 2012;98:988-94. [Crossref] [PubMed]
Sethi A, Taylor DL, Ruby JG, Venkataraman J, Sorokin E, Cule M, Melamud E. Calcification of the abdominal aorta is an under-appreciated cardiovascular disease risk factor in the general population. Front Cardiovasc Med 2022;9:1003246. [Crossref] [PubMed]
Kauppila LI, Polak JF, Cupples LA, Hannan MT, Kiel DP, Wilson PW. New indices to classify location, severity and progression of calcific lesions in the abdominal aorta: a 25-year follow-up study. Atherosclerosis 1997;132:245-50. [Crossref] [PubMed]
Chaplin L, Cootes TF. Automated scoring of aortic calcification in vertebral fracture assessment images. Medical Imaging 2019 Computer-Aided Diagnosis SPIE. 2019;10950:811-9.
Wang K, Wang X, Xi Z, Li J, Zhang X, Wang R. Automatic Segmentation and Quantification of Abdominal Aortic Calcification in Lateral Lumbar Radiographs Based on Deep-Learning-Based Algorithms. Bioengineering (Basel) 2023;10:1164. [Crossref] [PubMed]
Yasaka K, Abe O. Deep learning and artificial intelligence in radiology: Current applications and future directions. PLoS Med 2018;15:e1002707. [Crossref] [PubMed]
Shaffer K. Deep Learning and Lung Cancer: AI to Extract Information Hidden in Routine CT Scans. Radiology 2020;296:225-6. [Crossref] [PubMed]
Häggström I, Leithner D, Alvén J, Campanella G, Abusamra M, Zhang H, Chhabra S, Beer L, Haug A, Salles G, Raderer M, Staber PB, Becker A, Hricak H, Fuchs TJ, Schöder H, Mayerhoefer ME. Deep learning for [(18)F]fluorodeoxyglucose-PET-CT classification in patients with lymphoma: a dual-centre retrospective analysis. Lancet Digit Health 2024;6:e114-e125.
Reid S, Schousboe JT, Kimelman D, Monchka BA, Jafari Jozani M, Leslie WD. Machine learning for automated abdominal aortic calcification scoring of DXA vertebral fracture assessment images: A pilot study. Bone 2021;148:115943. [Crossref] [PubMed]
Elmasri KM, Hicks Y, Yang X, Sun X, Pettit R, Evans W. Automatic detection and quantification of abdominal aortic calcification in dual energy x-ray absorptiometry. Procedia Comput Sci 2016;96:1011-21.
Verbeke F, Van Biesen W, Honkanen E, Wikström B, Jensen PB, Krzesinski JM, Rasmussen M, Vanholder R, Rensma PLCORD Study Investigators. Prognostic value of aortic stiffness and calcification for cardiovascular events and mortality in dialysis patients: outcome of the calcification outcome in renal disease (CORD) study. Clin J Am Soc Nephrol 2011;6:153-9. [Crossref] [PubMed]
Riasatian A, Babaie M, Maleki D, Kalra S, Valipour M, Hemati S, et al. Fine-Tuning and training of densenet for histopathology image representation using TCGA diagnostic slides. Med Image Anal 2021;70:102032. [Crossref] [PubMed]
Xu J, Pan Y, Pan X, Hoi S, Yi Z, Xu Z. RegNet: Self-Regulated Network for Image Classification. IEEE Trans Neural Netw Learn Syst 2023;34:9562-7. [Crossref] [PubMed]
Duan Z, Lu M, Ma J, Huang Y, Ma Z, Zhu F. QARV: Quantization-Aware ResNet VAE for Lossy Image Compression. IEEE Trans Pattern Anal Mach Intell 2024;46:436-50. [Crossref] [PubMed]
Moreno-Torres JG, Saez JA, Herrera F. Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans Neural Netw Learn Syst 2012;23:1304-12. [Crossref] [PubMed]
Mondal A, Shrivastava VK. A novel Parametric Flatten-p Mish activation function based deep CNN model for brain tumor classification. Comput Biol Med 2022;150:106183. [Crossref] [PubMed]
Huang Z, Li W, Xia XG, Tao R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans Image Process 2022;31:1895-910. [Crossref] [PubMed]
Sharif N, Gilani SZ, Suter D, Reid S, Szulc P, Kimelman D, Monchka BA, Jozani MJ, Hodgson JM, Sim M, Zhu K, Harvey NC, Kiel DP, Prince RL, Schousboe JT, Leslie WD, Lewis JR. Machine learning for abdominal aortic calcification assessment from bone density machine-derived lateral spine images. EBioMedicine 2023;94:104676. [Crossref] [PubMed]

Cite this article as: Shao Z, Song M, He B, Yi H, Li A, Chen G, Zhang Z, Pan Y. Evaluating convolutional neural network models for automated abdominal aortic calcification scoring in chronic kidney disease patients across multiple centers. Quant Imaging Med Surg 2026;16(5):347. doi: 10.21037/qims-2025-1-2612

Evaluating convolutional neural network models for automated abdominal aortic calcification scoring in chronic kidney disease patients across multiple centers

Introduction

Methods

Data collection and image processing

Table 1

AAC scoring and consensus

Model design and training process

Data analysis and statistical evaluation

Results

Standardized manual scoring outcomes

Table 2

Model consistency and generalization assessment

Table 3

Evaluation of subgroup classification performance

Utilization of Grad-CAM for activation mapping

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share