Feasibility of fully automatic assessment of cervical canal stenosis using MRI via deep learning

Xiaochen Feng; Yaying Zhang; Minming Lu; Chao Ma; Xiaoqiang Miao; Jiacheng Yang; Lina Lin; Yueyi Zhang; Kai Zhang; Ning Zhang; Yan Kang; Yu Luo; Kai Cao

doi:10.21037/qims-2025-67

Original Article

Feasibility of fully automatic assessment of cervical canal stenosis using MRI via deep learning

Xiaochen Feng^1# , Yaying Zhang^2#, Minming Lu^3#, Chao Ma^1#, Xiaoqiang Miao^4,5, Jiacheng Yang¹, Lina Lin¹, Yueyi Zhang¹, Kai Zhang¹, Ning Zhang¹, Yan Kang^4,5, Yu Luo², Kai Cao¹

¹Department of Diagnostic Radiology, Changhai Hospital, Shanghai, China; ²Department of Diagnostic Radiology, the Fourth People’s Hospital of Shanghai, Shanghai, China; ³Department of Spine, Changzheng Hospital, Shanghai, China; ⁴College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen, China; ⁵College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China

Contributions: (I) Conception and design: X Feng, M Lu, Yaying Zhang; (II) Administrative support: Y Kang, Y Luo, K Cao; (III) Provision of study materials or patients: C Ma, X Miao, J Yang, L Lin, Yueyi Zhang; (IV) Collection and assembly of data: K Zhang, N Zhang; (V) Data analysis and interpretation: X Feng, M Lu, Yaying Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Kai Cao, MD. Department of Diagnostic Radiology, Changhai Hospital, No. 168 Changhai Road, Yangpu District, Shanghai 200433, China. Email: mdkaicao163@163.com; Yan Kang, DE. College of Health Science and Environmental Engineering, Shenzhen Technology University, 3002 Lantian Road, Pingshan District, Shenzhen 518118, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China. Email: kangyan@sztu.edu.cn; Yu Luo, MD. Department of Diagnostic Radiology, the Fourth People’s Hospital of Shanghai, No. 1279 Sanmen Road, Hongkou Distict, Shanghai 200434, China. Email: andy_luo@tongji.edu.cn.

Background: Currently, there is no fully automated tool available for evaluating the degree of cervical spinal stenosis. The aim of this study was to develop and validate the use of artificial intelligence (AI) algorithms for the assessment of cervical spinal stenosis.

Methods: In this retrospective multi-center study, cervical spine magnetic resonance imaging (MRI) scans obtained from July 2020 to June 2023 were included. Studies of patients with spinal instrumentation or studies with suboptimal image quality were excluded. Sagittal T2-weighted images were used. The training data from the Fourth People’s Hospital of Shanghai (Hos. 1) and Shanghai Changzheng Hospital (Hos. 2) were annotated by two musculoskeletal (MSK) radiologists following Kang’s system as the standard reference. First, a convolutional neural network (CNN) was trained to detect the region of interest (ROI), with a second Transformer for classification. The performance of the deep learning (DL) model was assessed on an internal test set from Hos. 2 and an external test set from Shanghai Changhai Hospital (Hos. 3), and compared among six readers. Metrics such as detection precision, interrater agreement, sensitivity (SEN), and specificity (SPE) were calculated.

Results: Overall, 795 patients were analyzed (mean age ± standard deviation, 55±14 years; 346 female), with 589 in the training (75%) and validation (25%) sets, 206 in the internal test set, and 95 in the external test set. Four tasks with different clinical application scenarios were trained, and their accuracy (ACC) ranged from 0.8993 to 0.9532. When using a Kang system score of ≥2 as a threshold for diagnosing central cervical canal stenosis in the internal test set, both the algorithm and six readers achieved similar areas under the receiver operating characteristic curve (AUCs) of 0.936 [95% confidence interval (CI): 0.916–0.955], with a SEN of 90.3% and SPE of 93.8%; the AUC of the DL model was 0.931 (95% CI: 0.917–0.946), with a SEN in the external test set of 100%, and a SPE of 86.3%. Correlation analysis comparing the DL method, the six readers, and MRI reports between the reference standard showed a moderate correlation, with R values ranging from 0.589 to 0.668. The DL model produced approximately the same upgrades (9.2%) and downgrades (5.1%) as the six readers.

Conclusions: The DL model could fully automatically and reliably assess cervical canal stenosis using MRI scans.

Keywords: Cervical spine; central canal stenosis; magnetic resonance imaging (MRI); artificial intelligence (AI); deep learning (DL)

Submitted Jan 09, 2025. Accepted for publication May 14, 2025. Published online Aug 19, 2025.

doi: 10.21037/qims-2025-67

Introduction

Degenerative cervical myelopathy (DCM) is a commonly occurring disorder that is often linked to degenerative spondylosis. This includes the formation of osteophytes, herniated intervertebral discs, or ossification of the posterior longitudinal ligament (1). These factors can cause a gradual compression of the spinal cord, which may result in spinal cord ischemia and subsequent histopathological changes in the cervical spinal cord (2). DCM has emerged as a significant cause of spinal cord dysfunction among the aging population worldwide, leading to severe physical disability.

Magnetic resonance imaging (MRI) is a valuable tool in evaluating DCM as it not only shows the anatomical structures and location of cervical spine compression but also reveals changes in signal intensity within the cervical cord, indicating myelopathic pathology (2,3). The extent of stenosis in each region plays a crucial role in determining appropriate treatment approaches (4). However, providing detailed descriptions of this information in imaging reports can be repetitive and time-consuming. Additionally, although MRI is considered a routine means of diagnosing spinal stenosis, there are currently no widely accepted radiological standards or parameters to characterize cervical spinal canal stenosis. Kang et al. introduced a practical grading system, known as Kang’s grading system, for central canal and neural foraminal stenosis in the cervical and lumbar spine. This grading system has been widely adopted in clinical practice, including web-based radiology resources, and has been cited in numerous studies (1,5).

Deep learning (DL) has recently shown great promise in accurately performing medical imaging diagnostic tasks. In recent years, convolutional neural networks (CNNs) in medical image analysis and Transformers (6,7) in medical image analysis and processing (8) have made progress (9). In 2020, Merali et al. (10) trained a CNN model to evaluate the degree of spinal cord compression on cervical spine MRI; however, this model utilized only binary classification (present or absent) and was limited to T2 axial images. Zhang et al. (11) employed the Faster R-CNN algorithm to develop a DL model for assessing the degree of cervical spinal canal and neural foraminal stenosis on axial T2-weighted MRI. The model demonstrated performance comparable to that of subspecialist radiologists. Li et al. (12) utilized DL models for clinical decision-making in cervical spine diseases; however, their evaluation criteria were relatively subjective. Lee et al. (13) developed a DL model for the automatic diagnosis of degenerative cervical spine disease; however, they assessed spinal canal stenosis and spine cord T2 hyperintensity independently. To the best of our knowledge, no end-to-end DL model based on Kang’s system for the classification of cervical spinal canal stenosis has been developed to date.

In the present study, our aim was to develop a novel DL model with four predefined classes, utilizing both CNN and Transformers, to fully automate the detection and classification of cervical spinal cord compression in patients using T2-weighted sagittal MRI scans. The diagnostic capabilities of this model were compared with those of radiologists and visually assessed using gradient-weighted class activation mapping (Grad-CAM). We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-67/rc).

Methods

Study design and participants

This retrospective study was approved by the Ethics Committee of each participating hospital (Local Ethics Committee of Shanghai Changhai Hospital, Naval Medical University, No. CHEC20230610, Local Ethics Committee of Shanghai Changzheng Hospital, Naval Medical University, No. 2021SL044, Local Ethics Committee of the Fourth People’s Hospital of Shanghai, No. 2023098-001). The requirement for individual consent for this retrospective analysis was waived. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. All datasets were independent of each other. Clinical data were reviewed from medical records, and MRI images were acquired from the picture archiving and communication system (PACS).

Our data were collected from two clinical settings: physical examinations and outpatient visits. Patients who underwent cervical spine MRI scans at the Fourth People’s Hospital of Shanghai (Hos. 1) and Shanghai Changzheng Hospital (Hos. 2) were included. Patients were excluded if they met any of the following criteria: Patients with instrumentation, severe scoliosis, active infection, suspicious neoplastic disease, trauma, syringomyelia, suboptimal image quality. Finally, 795 patients’ cervical spine T2-weighted sagittal imaging were recruited. In the internal training and validation stage, 442 consecutive patients at Hos. 1 from July 2020 to December 2022 and 147 patients from September 2022 to June 2023 at Hos. 2 were collected for training a DL model. During the internal testing phase, we continuously collected 206 patients from Hos. 2 between January 2021 and December 2021 to evaluate the model’s performance. In the external testing phase, we collected 95 patients from Shanghai Changhai Hospital (Hos. 3) in June 2023 to further assess the model’s performance (Figure 1).

Figure 1 Flowchart of eligibility criteria and selection of the (A) internal dataset (training validation and test set) and (B) external test set. ROI, region of interest; T2W, T2-weighted.

Magnetic resonance (MR) parameters

All patients were examined on different centers’ MRI scanners [United Imaging, Shanghai, China; General Electric (GE) Healthcare, Chicago, IL, USA; Siemens, Erlangen, Germany; and Philips, Amsterdam, Netherlands; 1.5- and 3.0-T platforms] for MRI cervical spine studies. Sagittal T2-weighted and axial T2-weighted Digital Imaging and Communications in Medicine (DICOM) images were used for evaluation and reference, respectively. Table S1 provides details on the MRI scanners and sequences.

Image analysis

The training and test data including totally 890 cases (both of internal and external test sets) were labeled by two musculoskeletal (MSK) radiologists (K.C., 15 years of experience; Panpan Yang, 12 years of experience) with use of an open-source annotation software (14); bounding boxes were drawn to segment the region of interest (ROI) (C2/3 to C6/7 disc level area) at each cervical spine MRI. When drawing each bounding box, the annotating radiologist classified the stenosis as one of four categories (normal, mild, moderate, or severe) following Kang’s system, with the final results regarded as the standard reference. Reviews in each case required a consensus of the radiologists.

MR images were independently interpreted by two radiology residents (Xuying Cai, 2 years of experience; Qinghui Yu, 3 years of experience), two subspecialist neuroradiologists (Y.Z., 8 years of experience; X.F., 7 years of experience), and two clinicians (M.L., 10 years of spine surgery experience; Zhishen Niu, 5 years of spine surgery experience). All six readers familiarized themselves with the Kang system (Figure S1) through an article and underwent a 2-hour training session. All readers were trained on 10 cases to ensure standardization of case interpretation and recording of results. The training cases were not included in the reader study case set. Each reader was blinded to patient demographics and clinical history prior to assessment. The images were presented to the readers in anonymized and randomized batches of approximately 50 cases each. All readers read the images in the same randomized order coordinated by a research assistant. All of above steps were completed by the readers online, who then saved the image in DICOM format and uploaded it (the specific format is shown in Figure S2). The interpreters assessed the degree of stenosis in the cervical spine from the C2–3 to C6–7 levels. In cases where stenosis was unclear from sagittal images alone, the interpreters could refer to T2-weighted axial images for clarification. Any disagreements in assessment among the three groups were resolved through discussion until a consensus was reached. Each reader only had access to individual cervical spine level patch images. There were no special requirements for interpreting image time.

MRI report quantification

To evaluate the correlation between MRI reports and readers, we proposed a quantification method for reports of cervical canal stenosis. By analyzing the writing habits of radiologists in MRI reports, we rated disc levels in four grades based on the description of C2/3 to C6/7 in the reports, which were grade 0 (not be described), grade 1 (keywords: mild disc herniation, disc bulge, etc.), grade 2 (keywords: moderate stenosis of spina canal, etc.), and grade 3 (severe stenosis of spina canal, T2 signal change, spinal canal T2 hyperintensity, etc.). The MRI report was an overall evaluation of disc level, whereas the Kang system evaluates layer by layer. Therefore, we combined the grade of each layer of each disc level by taking the maximum value as the total result (Figure 2).

Figure 2 Pipelines of MRI report quantification for conducting correlation analysis with reader’s grading result. (A) The writing structure of general MRI report (keywords for quantification are highlighted by different colors, with green representing grade 1, yellow representing grade 2, and red representing grade 3). (B) An example of quantification process. (C) Method of combining grading results from each layer into results at entire disc level (take the maximum grade of all layers). MRI, magnetic resonance imaging.

DL model development

A two-component procedure was used to train the models (Figure 3). First, detection transformer (DETR) (15) with ResNet50 as backbone was trained to detect the ROIs (C2/3 to C6/7 disc level area, Figure 3A). The disc level detection dataset was manually labeled by two MSK radiologists. With the use of an annotation software (14), bounding boxes were drawn to segment the ROIs (C2/3 to C6/7 disc level area, Figure 3A). Secondly, a Transformer-based classification model Swin-Transformer was trained for ROI classification (Figure 3B). The Swin-Transformer classified cervical canal stenosis into four grades based on the Kang system (16). The specific process is reflected in Appendix 1.

Figure 3 Diagrams shows two-component DL model development and the details of statistical analysis. (A) First, the model performed ROI detection (C2/3 to C6/7disc level area shown) on T2-weighted sagittal images, (B) followed by classification of each ROI into different degrees of cervical central canal stenosis. (C) The calculation of inter-reader agreement and correlation coefficients was performed during the statistical analysis stage. CNN, convolutional neural network; DL, deep learning; FFN, feed-forward network; ROI, region of interest; T2W, T2-weighted.

Grad-CAM

Explaining the decision-making process of DL models remains a challenge, but there are ways to explain the decision-making process of models (17). Grad-CAM can be used to visualize the decision-making process of the model (18). Grad-CAM uses the output and gradient of the last attention block of Swin-Transformer to calculate the contribution of each token to the classification result, and then maps these contributions back to the spatial position of the original image to form a heat map (18). We generated Grad-CAM for ROI. The pytorch-gradcam package in Python 3.9 was used to generate the Grad-Cam (19) (Figure 4).

Figure 4 Sagittal T2-weighted MRI scan (left) is shown a 76-year-old male with cervical pain. Integrated gradients technique was applied to the DL model and highlights the regions of the MRI scan responsible for the prediction as a heatmap (right). In general, the higher the intensity value, the more important the feature for prediction. The heatmaps show that the activation of the DL model (red color) is mainly at the most severe areas of spinal stenosis. DL, deep learning; MRI, magnetic resonance imaging.

Statistical analysis

We compared the performance of DL, MRI reports, and six readers in the diagnosis of stenosis in the cervical spine using the interpretation results of two MSK radiologists as the standard reference. The area under the receiver operating characteristic curve (AUC) and confusion matrix criteria, including accuracy (ACC), sensitivity (SEN), and specificity (SPE), were calculated. Furthermore, we adopted kappa statistics to assess inter-reader agreement between four groups (DL method, radiology residents, subspecialist neuroradiologists, clinicians).

The correlation coefficient (R) was assessed by Spearman correlation statistic and the relationship between MRI reports and manual and automatic grading. A P value ≤0.05 was considered statistically significant. All statistical analyses were performed using the software SPSS 26.0 (IBM Corp., Armonk, NY, USA).

Results

Patient characteristics in datasets

A total of 2,412 images taken from middle of the T2-weighted sagittal imaging of 795 patients (2–3 images per patient) were used in this study. The patient group comprised 449 males [age: mean ± standard deviation (SD), 54±14 years; range, 14–90 years] and 346 females (age: mean ± SD, 55±14 years; range, 13–93 years). Overall, the mean age ± SD of all 795 patients was 55±14 years (range, 13–93 years).

The internal training data set of 589 patients was randomly split into 412 patients for training (75%) and 177 patients for validation (25%). The patient group comprised 316 males (age: mean ± SD, 56±15 years; range, 14–90 years) and 273 females (age: mean ± SD, 56±14 years; range, 13–93 years). Overall, the mean age ± SD of all 589 patients was 56±15 years (range, 13–93 years).

A series of 206 patients from Hos. 2 were selected for internal testing. The patient group comprised 133 males (age: mean ± SD, 51±9 years; range, 31–77 years) and 73 females (age: mean ± SD, 51±12 years; range, 21–86 years). Overall, the mean age ± SD of all 206 patients was 51±10 years (range, 21–86 years).

A series of 95 patients from Hos. 3 were selected for external testing. The patient group comprised 51 males (age: mean ± SD, 50±18 years; range, 15–86 years) and 44 females (age: mean ± SD, 47±15 years; range, 17–73 years). Overall, the mean age ± SD of all 95 patients was 49±16 years (range, 15–86 years) (Table 1).

Table 1

Demographics of patients

Characteristics	Training/validation (n=589)	Internal test (n=206)	External test (n=95)
Age (years)	56±15 [13–93]	51±10 [21–86]	49±16 [15–86]
Male	316 [54]	133 [65]	51 [54]
Weight (kg)	77±9 [45–99]	76±7 [60–90]	44 [46]
Hospital
Hos. 1	442 [75]	–	–
Hos. 2	147 [25]	206 [100]	–
Hos. 3	–	–	95 [100]
ROIs amount	8,007	2,852	1,294

Data are presented as mean ± SD [range] or n [%], unless otherwise stated. Hos. 1, the Fourth People’s Hospital of Shanghai; Hos. 2, Shanghai Changzheng Hospital; Hos. 3, Shanghai Changhai Hospital. ROI, region of interest; SD, standard deviation.

Diagnostic performance of the DL model

To validate the performance of the DL model, we divided the training images into training and validation sets at a ratio of 7.5:2.5 on a per-patient basis. According to the clinical diagnosis requirements, as well as the Kang system, we combined the four categories of labels for different target tasks, which was for diagnosis (grades 0–1 vs. grade 2 vs. grade 3), discovering the presence of cervical canal stenosis (grade 0 vs. grades 1–3), and distinguishing significant stenosis (grades 0–1 vs. grades 2–3). We trained four different DL models, and tested the performance on the validation set. The overall precision of each task ranged from 0.899 to 0.953 (Table 2).

Table 2

Deeping learning model validation performance under different categories grouping

Grouping	Precision	SEN	SPE	AUC	Support
Grading
0	0.944	0.954	0.955	0.991	1,076
1	0.884	0.862	0.956	0.978	683
2	0.880	0.887	0.967	0.988	512
3	0.694	0.709	0.981	0.974	141
Overall	0.899				2,412
Diagnosis (grades 0–1 vs. grade 2 vs. grade 3)
0–1	0.976	0.965	0.937	0.991	1,759
2	0.872	0.908	0.964	0.988	512
3	0.721	0.716	0.983	0.978	141
Overall	0.939				2,412
Stenosis (grade 0 vs. grades 1–3)
0	0.917	0.973	0.929	0.990	1,076
1–3	0.977	0.929	0.973	0.990	1,336
Overall	0.949				2,412
Significance (grades 0–1 vs. grades 2–3)
0–1	0.982	0.953	0.953	0.991	1,759
2–3	0.884	0.953	0.953	0.991	653
Overall	0.953				2,412

The five DL models in the table are trained using the same training set but different category groupings. Data in the “Support” columns are presented as number of support images. AUC, area under the receiver operating characteristic curve; DL, deep learning; SEN, sensitivity; SPE, specificity.

Kang’s system was employed for grading cervical canal stenosis. Each reader assessed the images based on T2-weighted sagittal images. All data were annotated by two MSK radiologists (K.C. and Panpan Yang), serving as the reference standard. The distribution of cervical stenosis for each grade evaluated by the DL model and the six readers is illustrated in Figure 5 and Table S2. A head-to-head comparison between the DL model and the six readers is presented in Figure 5B,5C.

Figure 5 Distribution of cervical canal stenosis for each grade evaluated by DL methods and six readers. Two MSK radiologists served as the reference standard. (A) The histogram shows the distribution of diagnosis results of cervical disc stenosis by the DL model and seven doctors. (B) Compared with the ground truth given by two MSK radiologists, seven sets of diagnostic results are displayed using a confusion matrix. (C) Results of DL model and reference standards in external validation set. (D) The plots of downgrades vs. upgrades between DL model and the six readers using the two test sets. Bars with values shaded in orange indicate the number of lesions upgraded, and bars shaded in blue indicate the number of lesions downgraded. Specifically, downgrades to score 1–2 (deep blue) and upgrades to score 4–5 (deep orange) are compared between the two methods. C1, clinician 1; C2, clinician 2; DL, deep learning; MSK, musculoskeletal; R1, subspecialist neuroradiologist; R2, subspecialist neuroradiologist 2; R3, radiology resident 1; R4, radiology resident 2.

Using a Kang system score ≥2 as the threshold for diagnosing DCM, the performance of the DL model, subspecialist neuroradiologists, clinicians, and radiology residents was compared (Figure 6). In the internal validation set, the DL model achieved an AUC of 0.964 [95% confidence interval (CI): 0.956–0.973], with a SEN of 93%, and SPE of 97% (Figure 6A). In the internal test set, radiology resident 1 achieved the highest AUC (0.937; 95% CI: 0.920–0.954), with a SEN of 93.7%, and SPE of 86.8%. The DL model achieved a similar AUC (0.936, 95% CI: 0.916–0.955), with a SEN of 90.3%, and SPE of 93.8% (Figure 6B). In the external test set, the DL model achieved an AUC of 0.931 (95% CI: 0.917–0.946), with a SEN of 100% and SPE of 86.3% (Figure 6D,6E).

Figure 6 Comparison performance of AI, radiologists, subspecialists, and radiology residents for the diagnosis of cervical spinal stenosis. ROC curves of the DL algorithm and doctors in identification of cervical canal stenosis in the (A) internal validation set and (B) internal test set. Diagnostic results of seven groups of cervical spinal canal stenosis, including AUC (95% CI), SEN, and SPE (C). (D) ROC curve of DL model in diagnosing cervical spinal stenosis in external validation set (E). AI, artificial intelligence; AUC, area under the receiver operating characteristic curve; CI, confidence interval; DL, deep learning; ROC, receiver operating characteristic; SEN, sensitivity; SPE, specificity.

Table 3 and Table S3 depict the inter-reader reliability evaluated by kappa statistics. All readers had substantial agreements with the reference standard (0.638≤ κ value ≤0.795; the DL method achieved an average level of agreement with a κ value of 0.736). Correlation analysis between MRI reports and grade for the DL model, six general readers, and the reference standard suggested moderate correlation, with R ranging from 0.590 to 0.669; notably, the DL model had the highest correlation (Table 4). In addition, the DL model had similar correlation (R=0.685; Table S4) at the external test set.

Table 3

Inter-reader reliability

Readers	κ (95% CI)	Standard error
DL vs. R1	0.741 (0.717–0.765)	0.012
DL vs. R2	0.710 (0.687–0.732)	0.011
DL vs. C1	0.688 (0.659–0.708)	0.012
DL vs. C2	0.654 (0.663–0.712)	0.014
DL vs. R3	0.760 (0.738–0.782)	0.011
DL vs. R4	0.712 (0.686–0.737)	0.013

κ was characterized as follows: poor (κ<0.1), slight (0.1≤κ≤0.2), fair (0.2<κ≤0.4), moderate (0.4<κ≤0.6), substantial (0.6<κ≤0.8), and almost perfect (0.8<κ≤1). κ, inter-reader agreement; C1, clinician 1; C2, clinician 2; CI, confidence interval; DL, deep learning; R1, subspecialist neuroradiologist; R2, subspecialist neuroradiologist 2; R3, radiology resident 1; R4, radiology resident 2.

Table 4

Correlation coefficients between stenosis grades and MRI reports according to the DL model and six readers in the internal test set

Readers	R	P value
DL	0.669	<0.001
R1	0.668	<0.001
R2	0.589	<0.001
C1	0.610	<0.001
C2	0.650	<0.001
R3	0.634	<0.001
R4	0.590	<0.001

The level of correlation significance was 0.01. The R was characterized as follows: weak correlation (0.1<R≤0.3), moderate correlation (0.3<R≤0.7), relatively high correlation (0.7<R≤0.9), very high correlation (R>0.9). C1, clinician 1; C2, clinician 2; DL, deep learning; MRI, magnetic resonance imaging; R, correlation coefficient; R1, subspecialist neuroradiologist; R2, subspecialist neuroradiologist 2; R3, radiology resident 1; R4, radiology resident 2.

Clinical evaluation of DL model

The ACC of the DL model vs. that of the six readers were tested with the data from Hos. 1, Hos. 2 (internal set), and Hos. 3 (external set), separately (Figure 5D).

In the internal test set, the overall agreement of the DL model with the reference standard was 79.2% (2,247/2,838). This included an agreement of 88.4% (1,332/1,506) for grade 0, 67.7% (804/1,188) for grade 1, 76.3% (103/135) for grade 2, and 88.9% (8/9) for grade 3. Radiology resident 2 exhibited the highest upgrading rate (overestimation of the degree of spinal stenosis) at 8.1% and the lowest downgrading rate (underestimation of the degree of spinal stenosis) at 2.2%. Conversely, subspecialist neuroradiologist 2 had the lowest upgrading rate at 0.7% among the human readers. Notably, the downgrading rate of our model was generally lower than that of the human readers, recorded at 5.1%, whereas the upgrading rate was slightly higher than that of the radiology resident, at 9.2%. In the external test set, the DL model had lower downgrading rate of 1.5%. The DL model had a relatively low downgrading rate of 1.5% and a high upgrading rate of 12.9%.

Discussion

In this study, we trained and tested a DL model on a dataset of 890 patients to detect spinal canal stenosis in cervical MRI scans. Our aim was to demonstrate the feasibility of applying existing DL models in an end-to-end manner for medical image classification tasks. We adopted Kang’s system for grading cervical spinal canal stenosis, with the data annotated by two senior MSK physicians serving as the reference standard. Notably, our model performed well across different scanner types. Internal testing yielded an AUC of 0.936 (95% CI: 0.916–0.955), with a SEN of 90.3% and SPE of 93.8%. In the external test set, the model achieved an AUC of 0.931 (95% CI: 0.917–0.946), SEN of 100%, and SPE of 86.3%. Inter-reader reliability, assessed via kappa statistics, showed substantial agreement (κ=0.736; P<0.001). Correlation analysis between MRI reports and the DL model indicated a moderate correlation (R=0.669; P<0.001).

According to the clinical diagnosis requirements, we combined the four categories of labels for various target tasks, which all achieved high precision (range, 0.899–0.953). Patients classified as grade 0 or 1 typically do not require treatment unless experiencing severe symptoms such as intense pain or numbness, which are rare. In contrast, grades 2 and 3 generally necessitate specific clinical management, including potential surgical intervention for severe cases. Therefore, it is advisable to apply distinct models based on the clinical scenario. Clinically, an incorrect classification within a one-point margin (e.g., grade 0–1 or grade 2–3) is often acceptable. Our DL model exhibited a slightly higher upgrading rate (9.2%) and a slightly lower downgrading rate (5.1%) compared to human readers. Additionally, when comparing the DL model and six radiologists against original MRI reports, the model showed a moderate correlation (R=0.669; P<0.001), surpassing human readers. The model’s stability and efficiency are equivalent to superior to those of radiologists, whose work is often arduous, repetitive, and time-consuming. These results suggest that our model could be a valuable tool in clinical practice.

The application of DL in medical diagnosis is expanding, yet its use in automated imaging interpretation for spinal disorders remains nascent (19-21). For instance, Merali et al. (10) trained a CNN model to detect spinal cord compression in cervical spine MRI scans using 201 training and 88 testing images, achieving an AUC of 0.95; however, they only used a dichotomous ‘compressed’ or ‘noncompressed’ grading system. Another study by Li et al. (12) developed a DL model that can provide clinical recommendations based on cervical spine MRI. The model demonstrated a high degree of consistency in clinical decision-making with spinal surgeons (Cohen’s κ>0.8). However, the labels for the data did not have specific standards and relied on subjective judgments from spine surgeons. Lee et al. (13) developed a model to detect and classify spinal canal and neural foraminal stenosis using axial and sagittal MRI sequences. The model achieved good consistency with MSK radiologists in the classification of the spinal canal (κ=0.76) and neural foramina (κ=0.66). Meanwhile, the grading criteria for spinal canal stenosis that they used did not include T2 hyperintensity.

Our study revealed that junior doctors’ diagnostic performance is comparable to that of senior doctors, with AUC values of 0.937 and 0.914, respectively. However, junior doctors exhibited higher overdiagnosis rates (6.0% and 8.1%); the reasons for this problem may be diverse. Firstly, MRI artifacts, such as Gibbs (truncation) artifacts (20), can significantly contribute to diagnostic errors by mimicking spinal cord lesions through the creation of high signal areas, which complicates diagnosis. Secondly, imaging features, such as the degree of spinal canal stenosis, may not align with clinical symptoms, leading to ambiguous diagnostic criteria. Thirdly, common degenerative processes, including intervertebral disc degeneration, osteophyte formation, and ligament hypertrophy (21), can result in imaging abnormalities. Furthermore, inexperienced physicians may over-interpret complex cases. Additionally, the resolution and imaging parameters of MRI equipment can significantly affect diagnostic ACC. The limited number of grade 3 cases (n=141) poses a challenge, with DL diagnostic ACC for grade 3 patients at 0.694, SEN at 70.9%, and SPE at 98.1%. Nevertheless, radiologists can manually identify T2 hyperintensity with relative ease, especially when combined with significant clinical symptoms such as severe low-neck pain, limb radiating pain, and numbness. Consequently, while our model currently does not outperform radiologists, it effectively augments their diagnostic ACC and efficiency. The primary goal of our DL model is to enhance, rather than replace, radiologists’ capabilities, ensuring seamless integration of human expertise and machine intelligence.

Our study has several limitations. Firstly, it is retrospective, which may have introduced selection bias. Secondly, our dataset predominantly includes patients with mild-to-moderate spinal stenosis or normal MRI scans, potentially limiting diagnostic ACC for severe cases. However, this population reflects real-world scenarios where severe diagnoses also consider clinical symptoms alongside imaging data. Using a surgical database in future work would improve the number of severe stenosis and spinal cord compression. Currently, our model is limited to analyzing T2-weighted sagittal images. In instances where stenosis is not clearly visible in the sagittal images, interpreters may refer to T2-weighted axial images for further clarification. However, the DL model is unable to perform this function. We are actively enhancing our diagnostic capabilities by integrating both T2 sagittal and axial images, which is an ongoing area of research.

Conclusions

We demonstrated that our DL model is reliable and may be used to fully automatically assess cervical spinal stenosis at MRI. In clinical practice, it still relies on the subjective opinion of the radiologist. Our DL model could provide more consistent and objective reporting under the supervision of a radiologist. Moving forward, it is important to focus on developing larger datasets and involving a consensus panel of international experts to reduce any labeling errors and biases.

Acknowledgments

We would like to sincerely thank our colleagues Qinghui Yu, Xuying Cai, and Yihao Tang for their assistance in data collection and the language editing of this manuscript.

Footnote

Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-67/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-67/dss

Funding: This study was supported by the National Natural Science Foundation of China (No. 82372045), the Shanghai Natural Science Foundation of China (No. 23ZR1478400), and the Medical Research Project of Shanghai Hongkou District Health Commission (No. Hong-2023-12).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-67/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This retrospective study was approved by the Ethics Committee of each participating hospital (Local Ethics Committee of Shanghai Changhai Hospital, Naval Medical University, No. CHEC20230610; Local Ethics Committee of Shanghai Changzheng Hospital, Naval Medical University, No. 2021SL044; Local Ethics Committee of the Fourth People’s Hospital of Shanghai, No. 2023098-001). The requirement for individual consent for this retrospective analysis was waived. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Kang Y, Lee JW, Koh YH, Hur S, Kim SJ, Chai JW, Kang HS. New MRI grading system for the cervical canal stenosis. AJR Am J Roentgenol 2011;197:W134-40. [Crossref] [PubMed]
Bernhardt M, Hynes RA, Blume HW, White AA 3rd. Cervical spondylotic myelopathy. J Bone Joint Surg Am 1993;75:119-28. [Crossref] [PubMed]
Martin AR, Tadokoro N, Tetreault L, Arocho-Quinones EV, Budde MD, Kurpad SN, Fehlings MG. Imaging Evaluation of Degenerative Cervical Myelopathy: Current State of the Art and Future Directions. Neurosurg Clin N Am 2018;29:33-45. [Crossref] [PubMed]
Harrop JS, Naroji S, Maltenfort M, Anderson DG, Albert T, Ratliff JK, Ponnappan RK, Rihn JA, Smith HE, Hilibrand A, Sharan AD, Vaccaro A. Cervical myelopathy: a clinical and radiographic evaluation and correlation to cervical spondylotic myelopathy. Spine (Phila Pa 1976) 2010;35:620-4. [Crossref] [PubMed]
Park HJ, Kim SS, Chung EC, Lee SY, Park NH, Rho MH, Choi SH. Clinical correlation of a new practical MRI method for assessing cervical spinal canal compression. AJR Am J Roentgenol 2012;199:W197-201. [Crossref] [PubMed]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 [Preprint]. 2020. Available online: https://arxiv.org/pdf/2010.11929/1000
Cao K, Xia Y, Yao J, Han X, Lambert L, Zhang T, et al. Large-scale pancreatic cancer detection via non-contrast CT and deep learning. Nat Med 2023;29:3033-43. [Crossref] [PubMed]
Lee A, Ong W, Makmur A, Ting YH, Tan WC, Lim SWD, Low XZ, Tan JJH, Kumar N, Hallinan JTPD. Applications of Artificial Intelligence and Machine Learning in Spine MRI. Bioengineering (Basel) 2024.
Merali Z, Wang JZ, Badhiwala JH, Witiw CD, Wilson JR, Fehlings MG. A deep learning model for detection of cervical spinal cord compression in MRI scans. Sci Rep 2021;11:10473. [Crossref] [PubMed]
Zhang E, Yao M, Li Y, Wang Q, Song X, Chen Y, Liu K, Zhao W, Xing X, Zhou Y, Meng F, Ouyang H, Chen G, Jiang L, Lang N, Jiang S, Yuan H. Deep learning model for the automated detection and classification of central canal and neural foraminal stenosis upon cervical spine magnetic resonance imaging. BMC Med Imaging 2024;24:320. [Crossref] [PubMed]
Li KY, Lu ZY, Tian YH, Liu XP, Zhang YK, Qiu JW, Li HL, Zhang YL, Huang JW, Ye HB, Tian NF. Deep learning models for MRI-based clinical decision support in cervical spine degenerative diseases. Front Neurosci 2024;18:1501972. [Crossref] [PubMed]
Lee A, Wu J, Liu C, Makmur A, Ting YH, Muhamat Nor FE, et al. Deep learning model for automated diagnosis of degenerative cervical spondylosis and altered spinal cord signal on MRI. Spine J 2025;25:255-64. [Crossref] [PubMed]
Tzutalin. LabelImg: A graphical image annotation tool. GitHub 2015. Available online: https://github.com/saicoco/object_labelImg
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Vedaldi A, Bischof H, Brox T, Frahm JM. editors. Computer Vision – ECCV 2020. Cham: Springer International Publishing; 2020:213-29.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:10012-22.
Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell 2019;1:206-15. [Crossref] [PubMed]
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. 2017:618-26.
Gildenblat J. PyTorch library for CAM methods. GitHub 2021. Available online: https://github.com/jacobgil/pytorch-grad-cam
Phillips C, Bagley B, McDonald MA, Schuster NM. Gibbs or Truncation Artifact on MRI Mimicking Degenerative Cervical Myelopathy. Pain Med 2022;23:857-9. [Crossref] [PubMed]
Moll LT, Kindt MW, Stapelfeldt CM, Jensen TS. Degenerative findings on MRI of the cervical spine: an inter- and intra-rater reliability study. Chiropr Man Therap 2018;26:43. [Crossref] [PubMed]

Cite this article as: Feng X, Zhang Y, Lu M, Ma C, Miao X, Yang J, Lin L, Zhang Y, Zhang K, Zhang N, Kang Y, Luo Y, Cao K. Feasibility of fully automatic assessment of cervical canal stenosis using MRI via deep learning. Quant Imaging Med Surg 2025;15(9):8457-8470. doi: 10.21037/qims-2025-67

Feasibility of fully automatic assessment of cervical canal stenosis using MRI via deep learning

Introduction

Methods

Study design and participants

Magnetic resonance (MR) parameters

Image analysis

MRI report quantification

DL model development

Grad-CAM

Statistical analysis

Results

Patient characteristics in datasets

Table 1

Diagnostic performance of the DL model

Table 2

Table 3

Table 4

Clinical evaluation of DL model

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share