Artificial intelligence aids doctors in diagnosing necrotizing enterocolitis and predicting surgery using abdominal radiographs: a multicenter study
Introduction
Neonatal necrotizing enterocolitis (NEC) is a condition characterized by diffuse or localized necrosis of the intestinal mucosa. In severe instances, it may involve the entire wall of the small intestine and colon. NEC is the leading cause of death from gastrointestinal diseases in premature infants, with a mortality rate ranging from 15% to 45% (1,2). The disease typically manifests as abrupt alterations in feeding tolerance, accompanied by nonspecific systemic signs such as apnea, respiratory failure, poor feeding, lethargy, temperature instability, and abdominal symptoms including distension, bilious gastric retention and/or vomiting, tenderness, rectal bleeding, and diarrhea (3). Initial management of NEC primarily involves conservative therapy, which includes bowel rest, total parenteral nutrition, and the judicious use of antibiotics. If the condition progresses, an emergency laparotomy may be necessary, potentially involving resection of the affected intestinal segment and the creation of an ileostomy. Abdominal radiography (AR) is integral to the diagnosis of NEC and is frequently employed to monitor disease progression. However, the sensitivity and specificity of AR are not sufficiently high, necessitating neonatologists and pediatric surgeons to integrate clinical findings with other diagnostic results for accurate diagnosis and informed therapeutic decisions. Despite these limitations, AR remains the most widely used diagnostic imaging tool for assessing infants with NEC (4).
One of the significant challenges in interpreting ARs is that the radiological manifestations of NEC are very subtle, making it challenging to identify and interpret. Different clinicians may have varying judgments and speculations about specific imaging features, which can delay accurate diagnosis and affect timely surgical intervention, particularly for less experienced doctors (5). Rehan et al. (6) found significant discrepancies among observers in their interpretations of both early signs (such as intestinal dilatation, air-fluid levels, and bowel wall thickening) and late signs (including portal venous gas and pneumoperitoneum). The study also noted that inter-observer agreement was notably higher among trained observers compared to those still in training. Consequently, clinicians with limited experience may benefit from consistent and high-accuracy support in the interpretation of ARs.
Convolutional neural networks (CNNs), a specific type of artificial intelligence (AI) and deep learning, have achieved significant success in computer vision. In recent years, CNNs have been widely used for X-ray image analysis. Researchers have applied CNNs to detect conditions such as pneumoperitoneum, intestinal obstruction, intussusception, vertebral fractures, and even catheters within the body on chest radiographs or ARs, with promising results (7-11). In an internal test set, Gao et al. (12) used squeeze-and-excitation networks (SENet) deep features to distinguish NEC from non-NEC, achieving an area under the receiver operating characteristic (ROC) curve (AUC) of 0.876, and an AUC of 0.820 for differentiating surgical NEC from medical NEC. The gold-standard diagnosis in their study was established by senior pediatricians and pediatric surgeons according to the Bell staging criteria modified by Walsh and Kliegman, with radiological signs as primary criteria and clinical parameters as secondary support; a diagnosis of NEC or surgical NEC required at least one primary and one secondary criterion. The study also reported that the trained models had prediction accuracy comparable to that of doctors with 10–20 years of clinical experience.
To date, the potential of CNNs to assist clinicians in the timely identification of NEC on ARs and in recognizing newborns with NEC who require surgical intervention is yet to be fully explored. This study aims to develop an AI model using multi-center AR data through a CNN-based algorithm and validate its efficacy using external testing sets. Additionally, we will visualize the interpretability of the developed AI model and investigate the impact of AI-assisted diagnosis on enhancing clinicians’ diagnostic capabilities. We present this article in accordance with the TRIPOD + AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2867/rc).
Methods
This retrospective study was conducted from January to June 2024 at three hospitals: Zhujiang Hospital of Southern Medical University, Boai Hospital of Zhongshan, and the Third Affiliated Hospital of Guangzhou Medical University. ARs of preterm infants were collected for analysis. This study was approved by the Ethics Review Committee of Zhujiang Hospital of Southern Medical University, Guangzhou, Guangdong, China (approval No. 2024-KY-123-01). The requirement of written informed consent was waived because of the retrospective nature of the study. The other institutions were informed and agreed with the study. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Data collection
Clinical data were collected from preterm infants at Zhujiang Hospital of Southern Medical University (set A, October 2010 to December 2023), Boai Hospital of Zhongshan (set B, August 2020 to December 2023), and the Third Affiliated Hospital of Guangzhou Medical University (set C, October 2018 to July 2022). The inclusion criteria were as follows: (I) gestational age less than 37 weeks; (II) availability of ARs taken during hospitalization and complete clinical documentation; (III) for non-NEC preterm infants: no diagnosis of NEC, presenting only with feeding intolerance or a completely normal abdomen; (IV) for NEC preterm infants (including medical-NEC and surgical-NEC), diagnosis established by senior clinicians using the modified Bell’s criteria, requiring at least one primary radiological and one secondary clinical criterion (13). Detailed characteristics of medical-NEC and surgical-NEC are provided in Table S1. For the Medical-NEC group, ARs were selected at the time of diagnosis. For the surgical-NEC group, the last AR prior to surgery was selected. Two pediatric surgeons with over 10 years of experience reviewed the data of all enrolled cases. The exclusion criteria were as follows: (I) congenital intestinal malformation; (II) poor image quality: large areas of foreign objects obscuring the abdomen. The inclusion and exclusion process referenced the study by Gao et al., and the detailed process is depicted in Figure 1 (12).
Image preprocessing
First, the open-source image annotation tool “LabelImg” (version: 1.8.6, https://github.com/tzutalin/labelImg) was used to manually annotate the abdominal region of each AR. The annotated region is a rectangular area extending from the highest point of the diaphragm to the lower edge of the ilium. The abdominal rectangular region of each AR was then cropped using Python (version 3.9) and saved in portable network graphics (PNG) format for model training and testing. The images were further preprocessed using the deep learning framework PyTorch (version 2.2.0). The images were resized to a resolution of 512×512 pixels to ensure compatibility with the CNN. Image normalization was applied to minimize the impact of variations in imaging devices and parameters. During the training phase, data augmentation techniques, including geometric and color augmentations, were employed to expand the training dataset and enhance model generalization(14). Geometric augmentations included rotations and translations, while color augmentations involved adjustments to brightness and contrast. For simplicity, the three types of ARs (non-NEC, medical-NEC, and surgical-NEC) were labeled as 0, 1, and 2, respectively.
Pre-training and transfer learning
Training and testing were conducted on an Nvidia GeForce RTX 4060 GPU with 8 GB of VRAM and CUDA 11.8 (Nvidia Corporation, Santa Clara, CA, USA). First, six commonly used pre-trained models—efficientnet, Inception_v3, visual geometry group (VGG), resnet, squeezenet, and densenet—were selected for training. These models were pre-trained on the ImageNet dataset and have been widely used as pre-training models for deep learning analysis of X-ray images (15). Next, transfer learning was performed on the preprocessed ARs. The final classification layer of each model was modified to output three categories (non-NEC, medical-NEC, and surgical-NEC). All weight parameters of the pre-trained model layers were fine-tuned using the Adam optimizer. The network weights were updated using the cross-entropy loss function. A learning rate scheduler was employed to optimize the convergence process of the model. The initial learning rate was set at 0.001 and was multiplied by 0.8 every eight epochs during training. Each model was trained for 50 epochs. The models were evaluated using four key metrics: accuracy, precision, recall, and F1-score. The epoch with the highest accuracy was selected to obtain the best model for internal testing.
Model training and validation
First, the images in set A were randomly divided into a training set (80%) and an internal test set (20%) for model training and internal testing. To ensure the randomness and reproducibility of the data splits, we used the random module in the Python programming language to shuffle the datasets. Specifically, we set a fixed random seed to ensure that the random shuffling process could be consistently reproduced across different runs. The same procedure was applied to the data in set B and the combined dataset of sets A and B. The best-performing models from these three internal validations were selected to perform external testing on set C. Model performance was evaluated by comparing the AUC of the model predictions for each classification. Additionally, six pediatric surgeons with varying levels of experience were invited to perform the three-category predictions on set C. This group included two residents (1–2 years of training experience), two fellows (5 years of seniority), and two attending physicians (10–15 years of seniority). The pre-processed ARs and other clinical information about the infants in set C were not disclosed to them prior to this task.
Model explainability and AI assistance
To intuitively explain the model’s focus, we utilized the algorithm libraries PyTorch GradCAM and TorchCAM (version 0.4.1) to perform interpretability analysis on the model. Gradient-weighted class activation mapping (Grad-CAM) is a method for explaining and visualizing the decision-making process of a CNN (16). Grad-CAM generates heatmaps by computing the gradients of the specific class output with respect to the feature maps of the last convolutional layer. These heatmaps highlight the regions of the input image that the model focuses on during the classification process. Specifically, the gradients are averaged across each feature map to produce a single importance score for each spatial location. This score is then upsampled to the input image size, creating a heatmap that visually shows the areas most relevant to the model’s prediction.
Additionally, we employed a creative explanation method called deep feature factorization (DFF) as a supplement (17). DFF uses non-negative matrix factorization to decompose the model’s activations into distinct concepts and calculates the correspondence between each pixel and these concepts. While Grad-CAM answers the question “where does the model see its predicted result in the image?”, DFF provides more detailed insights. It generates heatmaps that can show all the different concepts found in the image and how they are classified.
For each image in set C, the corresponding Grad-CAM and DFF heatmaps were generated based on the predictions of the best-performing model during the internal testing phase. The model’s predicted category for each image was also labeled. Invited doctors used this information as AI-assisted cues to perform a second round of readings of the set C ARs. The prediction accuracies from both rounds of readings were then compared to assess the impact of AI assistance on the doctors’ diagnostic performance.
Statistical analysis
All data were analyzed using SPSS 26.0 (IBM Corp., Armonk, NY, USA) for basic statistical analysis. The Kolmogorov-Smirnov test was used to verify that continuous variables conformed to a normal distribution. Categorical variables were reported as frequencies and percentages. Continuous variables were reported as medians and interquartile ranges (IQRs) for non-normal distributions, and as means and standard deviations or 95% confidence intervals for normal distributions. Relationships between the same variables in different groups were assessed using independent Mann-Whitney U tests for non-normally distributed continuous variables and Chi-squared tests for categorical variables. AUC comparisons were performed using DeLong’s test. The comparison of models and doctors on external tests was demonstrated using confusion matrices and accuracy. A two-tailed P value <0.05 was considered statistically significant.
Results
Baseline information
A total of 738 images from three datasets were collected, including 295 non-NEC, 307 medical-NEC, and 136 surgical-NEC cases. The baseline information of the enrolled preterm infants is presented in Tables 1,2. In set A, both the gestational age and birth weight in the non-NEC group were significantly higher than those in the NEC (medical + surgical) group (P<0.001). Additionally, the gestational age was significantly higher in the Medical-NEC group compared to the surgical-NEC group (P<0.001). No such differences were observed in sets B and C, which had smaller sample sizes (P>0.05).
Table 1
| Characteristics | Non-NEC | NEC | P value |
|---|---|---|---|
| Set A | |||
| Images | 182 | 165 | |
| Participants | 182 | 165 | |
| Sex, male | 103 (56.6) | 105 (63.6) | 0.181 |
| Gestational age (weeks) | 34.64 (32.68, 35.86) | 32.29 (29.43, 34.29) | <0.001 |
| Birth weight (kg) | 2.105 (1.700, 2.423) | 1.550 (1.200, 2.100) | <0.001 |
| Set B | |||
| Images | 80 | 212 | |
| Participants | 37 | 93 | |
| Sex, male | 19 (52.8) | 55 (59.1) | 0.512 |
| Gestational age (weeks) | 32.57 (28.21, 34.36) | 32.00 (30.14, 35.43) | 0.412 |
| Birth weight (kg) | 1.750 (1.100, 2.100) | 1.480 (1.190, 2.240) | 0.842 |
| Set C | |||
| Images | 33 | 66 | |
| Participants | 33 | 66 | |
| Sex, male | 20 (60.6) | 36 (54.5) | 0.566 |
| Gestational age (weeks) | 30.14 (28.36, 33.86) | 29.86 (28.00, 32.04) | 0.485 |
| Birth weight (kg) | 1.280 (1.000, 2.025) | 1.210 (0.900, 1.710) | 0.242 |
Data are presented as n, n (%), or median (IQR). IQR, interquartile range; NEC, necrotizing enterocolitis.
Table 2
| Characteristics | Medical-NEC | Surgical-NEC | P value |
|---|---|---|---|
| Set A | |||
| Images | 113 | 52 | |
| Participants | 113 | 52 | |
| Sex, male | 68 (60.2) | 37 (71.2) | 0.173 |
| Gestational age (weeks) | 33.00 (30.21, 34.86) | 30.21 (28.00, 33.25) | <0.001 |
| Birth weight (kg) | 1.600 (1.240, 2.155) | 1.450 (1.131, 1.895) | 0.055 |
| Set B | |||
| Images | 162 | 50 | |
| Participants | 72 | 21 | |
| Sex, male | 40 (55.6) | 15 (71.4) | 0.193 |
| Gestational age (weeks) | 32.00 (30.00, 36.14) | 31.71 (30.25, 35.00) | 0.786 |
| Birth weight (kg) | 1.460 (1.180, 2.230) | 1.490 (1.255, 2.285) | 0.622 |
| Set C | |||
| Images | 32 | 34 | |
| Participants | 32 | 34 | |
| Sex, male | 15 (46.9) | 21 (61.8) | 0.225 |
| Gestational age (weeks) | 29.71 (28.07, 34.50) | 29.93 (27.93, 31.29) | 0.400 |
| Birth weight (kg) | 1.365 (0.940, 2.053) | 1.090 (0.818, 1.598) | 0.083 |
Data are presented as n, n (%), or median (IQR). IQR, interquartile range; NEC, necrotizing enterocolitis.
Training and internal testing
Using the datasets from set A, set B, and the merged set (A + B), we conducted 50 epochs of training and internal testing for each. The performance metrics of each model type, including the highest accuracy, precision, recall, and F1-score, are presented in Table 3. The best model trained on set A was Efficientnet-b3, the best model trained on set B was Efficientnet-b2, and the best model trained on the merged set (A + B) was Efficientnet-b0. Figure 2 depicts the training process, showing that the training and testing losses for all three models exhibited a rapid decline during the initial phase of training, followed by a period of stability. The epoch corresponding to the highest accuracy was selected to obtain the final model. The ROC curves for each category predicted by these three models in the internal test are shown in Figure 3.
Table 3
| Models | Data set | Highest accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| Efficientnet-b3 | A | 0.882 | 0.847 | 0.836 | 0.840 |
| Inception_v3 | A | 0.794 | 0.732 | 0.727 | 0.729 |
| VGG11 | A | 0.838 | 0.819 | 0.730 | 0.751 |
| Resnet18 | A | 0.838 | 0.803 | 0.766 | 0.781 |
| Squeezenet1_0 | A | 0.824 | 0.814 | 0.715 | 0.741 |
| Densenet121 | A | 0.838 | 0.827 | 0.760 | 0.778 |
| Efficientnet-b2 | B | 0.914 | 0.922 | 0.925 | 0.918 |
| Inception_v3 | B | 0.845 | 0.832 | 0.806 | 0.818 |
| VGG11 | B | 0.931 | 0.936 | 0.935 | 0.932 |
| Resnet18 | B | 0.862 | 0.913 | 0.806 | 0.846 |
| Squeezenet1_1 | B | 0.621 | 0.722 | 0.494 | 0.512 |
| Densenet121 | B | 0.810 | 0.798 | 0.796 | 0.795 |
| Efficientnet-b0 | A + B | 0.921 | 0.914 | 0.897 | 0.904 |
| Inception_v3 | A + B | 0.701 | 0.467 | 0.555 | 0.508 |
| VGG11 | A + B | 0.835 | 0.826 | 0.787 | 0.801 |
| Resnet34 | A + B | 0.843 | 0.834 | 0.823 | 0.828 |
| Squeezenet1_1 | A + B | 0.740 | 0.718 | 0.637 | 0.646 |
| Densenet121 | A + B | 0.803 | 0.844 | 0.750 | 0.778 |
External testing
The ROC curves of these three models in predicting the external test set C are shown in Figure 3. The AUC values for predicting the three categories and the weighted average AUC were calculated separately, as detailed in Table 4. Among them, the Efficientnet-b0 model trained on the combined dataset (A + B) achieved an AUC of 0.883 for non-NEC, 0.640 for medical-NEC, and 0.837 for surgical-NEC. In the DeLong test, the AUC of the Efficientnet-b0 model was significantly higher than that of the Efficientnet-b3 model in predicting non-NEC and surgical-NEC (P<0.05). The AUC of the Efficientnet-b0 model was also significantly higher than that of the Efficientnet-b2 model in predicting non-NEC (P<0.05). Therefore, compared with single-center training data, multi-center training data enabled better prediction on new datasets.
Table 4
| Metrics | Efficientnet_b3 | P value 1 | Efficientnet_b0 | P value 2 | Efficientnet_b2 |
|---|---|---|---|---|---|
| Accuracy (weighted avg) | 0.455 | 0.566 | 0.566 | ||
| AUC (weighted avg) | 0.687 | 0.789 | 0.744 | ||
| AUC (non-NEC) | 0.740 | 0.002 | 0.883 | 0.012 | 0.749 |
| AUC (medical-NEC) | 0.570 | 0.242 | 0.640 | 0.773 | 0.659 |
| AUC (surgical-NEC) | 0.747 | 0.026 | 0.837 | 0.695 | 0.820 |
P value 1 represents the P value from the DeLong test comparing the AUCs of the Efficientnet-b0 and Efficientnet-b3 models. P value 2 represents the P value from the DeLong test comparing the AUCs of the Efficientnet-b0 and Efficientnet-b2 models. AUC, area under the curve; NEC, necrotizing enterocolitis.
Interpretability
The heatmaps based on Grad-CAM and DFF visualize how the model interprets the images. Figure 4 shows four randomly selected cases. The red and yellow regions in the Grad-CAM heatmap highlight the areas that the model focused on, which influenced the prediction category. On the right, the DFF heatmap shows the class predictions made by the neural network for broader regions and their corresponding confidence levels. The first case was a medical-NEC patient. The Grad-CAM heatmap illustrated that the model focused on the right abdomen, while the DFF heatmap indicated that the corresponding abdominal region was predicted as medical-NEC (purple) with a confidence level of 1.00. The second case was a surgical-NEC patient. The preoperative AR showed free air. During surgery, a perforation of the cecum and multiple segments of bowel necrosis were found. The Grad-CAM heatmap showed that the model focused on three areas, while the corresponding DFF heatmap indicated that most of the abdominal region was predicted as surgical-NEC. The third case was also a surgical-NEC patient, with a preoperative AR showing free air. During surgery, a perforation of the small intestine was found 30 cm from the ileocecal valve, along with a segment of necrotic bowel tissue measuring about 10 cm in length nearby. The Grad-CAM heatmap showed that the model focused on bilateral subdiaphragmatic areas, while the DFF heatmap indicated that the corresponding red region was predicted as surgical-NEC. The fourth case was a surgical-NEC patient. During surgery, a perforation at the base of the appendix, necrosis of the cecum, and part of the ascending colon were found. The Grad-CAM heatmap suggested that part of the model’s attention was distracted by a tube, while another highlighted area corresponded to the purple region in the DFF heatmap, which was incorrectly predicted as medical-NEC.
AI-assisted doctors
The confusion matrix shows the reading results of doctors with varying years of experience before and after AI assistance (Figure 5). The bar chart in Figure 6 displays their weighted average accuracy. The results indicated that, when provided only with ARs and no other clinical information, the prediction accuracy of Efficientnet-b0 model exceeded that of all six doctors and was significantly higher than that of the 1-to-2-year residents. When these doctors were given AI-assisted Grad-CAM and DFF heatmaps for each AR, their second-round predictions showed improved weighted average accuracy, particularly among junior doctors.
Discussion
In the radiological assessment of NEC, AR and ultrasound are the most commonly used modalities. Ultrasound has important supplementary diagnostic value in specific clinical scenarios, particularly in cases with diagnostic uncertainty or complex presentations. Although practices vary across institutions and clinicians, current evidence supports its use when ARs are inconclusive, especially when bowel perforation is clinically suspected but not confirmed, in which case ultrasound is the most frequently used adjunctive imaging modality (18). Despite advancements in abdominal ultrasound for assessing changes such as intestinal blood flow or ascites, AR remains the preferred diagnostic imaging method for neonates suspected of having NEC (19). Early and accurate identification of NEC using ARs, combined with other clinical signs and indicators, and timely prediction of the need for surgery are essential to improve the survival of neonates with NEC, potentially minimizing further intestinal damage.
A real-world issue is that, when a suspected NEC patient’s AR is taken at night and first read by an on-call low-year resident in neonatology or radiology, these trainee residents may not be sensitive to some of the signs of NEC in ARs. A study assessing 463 radiology trainees found that only about 28% of them scored more than 7 out of 10 on the reading test, a threshold considered indicative of successful NEC diagnosis (20). This study also found that observer error dominated all cases, meaning most trainees failed to recognize sufficient NEC imaging manifestations, potentially missing early clinical decision-making opportunities.
CNN provides a cutting-edge method for enhancing early radiological diagnosis and surgical prediction of NEC (21). The ability of CNN to identify symptoms of NEC and the extent to which these models can assist doctors in timely recognizing these symptoms have not been fully explored. Therefore, an AI model based on CNN was developed to identify NEC and perform surgery predictions using a dataset containing 738 ARs. According to the study results, the trained Efficientnet-b0 model achieved the highest accuracy with AUCs of 0.979, 0.924, and 0.966 for predicting non-NEC, medical-NEC, and surgical-NEC in internal test sets, respectively, and AUCs of 0.883, 0.640, and 0.837 for predicting non-NEC, medical-NEC, and surgical-NEC in external test sets, respectively. The model was more skilled at identifying non-NEC and surgical-NEC compared to medical-NEC. This may be due to the lack of imaging abnormalities in stage I NEC patients, making differential diagnosis challenging. In recent single-center studies, Wu et al. (22) trained a Resnet18 classifier based on ARs using ARs from 263 NEC cases, achieving AUCs of 0.886 and 0.876 in the validation and testing sets, respectively. Another study used a ResNet50 classifier trained from 494 ARs for NEC identification, achieving an AUC of 0.918 (23). These results indicate that CNN is effective in AR analysis for NEC detection, performing well in image classification tasks and showing great potential.
However, due to its complex structure and the characteristics of deep learning algorithms, CNN is often regarded as a “black box” model. That is, while the model can provide highly accurate predictions, it is difficult to explain how it works internally (15). This may be one of the major trust barriers to applying CNN technology in clinical practice. In studies involving ARs of NEC, researchers typically use Grad-CAM visualization techniques to highlight the image regions the model focuses on when making classification decisions. Given that NEC ARs may exhibit multiple different imaging manifestations in various areas, such as pneumatosis intestinalis, portal vein gas, or pneumoperitoneum, clinicians need to carefully distinguish all signs. Our study further utilized DFF technology to visualize the model’s classification decisions and confidence levels for each region of the AR. By combining Grad-CAM heatmaps and DFF heatmaps, clinicians can see not only the most salient regions but also the model’s predictions on secondary regions. With this visualized AI assistance, junior doctors can verify the imaging manifestations of each identified region step-by-step, reducing errors caused by a lack of reading experience. Our study showed that the diagnostic accuracy of doctors increased by 2.0–26.2% compared to their initial readings of the original ARs. AI-assisted modeling is particularly beneficial for less experienced doctors with lower initial accuracy rates.
Due to the rarity of NEC cases, training data for models are often derived from single centers, which raises concerns about their generalization ability. It is difficult to verify the robustness of a model by testing it on datasets isolated within a single center alone (24). Our study compared models trained in two centers with those trained in a single center and found that the multi-center model demonstrated better generalization capabilities based on external test results. This improvement may be attributed to several factors. First, single-center data may contain biases or deviations that can affect the fairness and accuracy of the model. Multi-center data can mitigate these biases by balancing them across different sources, making the model more fair and universally applicable. Second, medical teams at different hospitals may have varying surgical timing strategies for NEC. The optimal timing for surgery is when there is ischemic necrosis of the bowel, but before perforation or further deterioration occurs. Accurately pinpointing this optimal moment is challenging for clinicians, who rely on clinical symptoms and signs, laboratory indicators, and imaging findings. However, these judgments can vary based on the experience of different doctors, leading to potential bias in surgical timing decisions.
There are several limitations in this study. First, the sample size is relatively limited; incorporating more ARs from additional medical centers will enhance the model’s accuracy and generalizability. Second, the cohort primarily includes preterm infants, excluding term and late-preterm newborns. Although preterm infants constitute the majority of NEC cases, disease presentation in term infants may differ, potentially limiting the model’s applicability across the full neonatal population. Third, congenital intestinal malformations were excluded based on the final clinical diagnosis, as their imaging features can mimic NEC and may confound model training. Therefore, this AI model is intended for use in patients for whom clinicians have already considered and excluded major congenital gastrointestinal anomalies through comprehensive clinical evaluation. Future work will expand the dataset to include term infants and develop multi-class models capable of differential diagnosis, improving real-world clinical utility.
Conclusions
The CNN-based AI model demonstrates an effective capability to differentiate NEC ARs and identify cases requiring surgical intervention. With the aid of visualized heatmaps, clinicians—particularly junior doctors—can substantially enhance their diagnostic accuracy when interpreting images.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD + AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2867/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2867/dss
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2867/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was approved by the Ethics Review Committee of Zhujiang Hospital of Southern Medical University, Guangzhou, Guangdong, China (approval No. 2024-KY-123-01). The requirement of written informed consent was waived because of the retrospective nature of the study. The other institutions were informed and agreed with the study. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Clark RH, Gordon P, Walker WM, Laughon M, Smith PB, Spitzer AR. Characteristics of patients who die of necrotizing enterocolitis. J Perinatol 2012;32:199-204. [Crossref] [PubMed]
- Niño DF, Sodhi CP, Hackam DJ. Necrotizing enterocolitis: new insights into pathogenesis and mechanisms. Nat Rev Gastroenterol Hepatol 2016;13:590-600. [Crossref] [PubMed]
- Neu J, Walker WA. Necrotizing enterocolitis. N Engl J Med 2011;364:255-64. [Crossref] [PubMed]
- Epelman M, Daneman A, Navarro OM, Morag I, Moore AM, Kim JH, Faingold R, Taylor G, Gerstle JT. Necrotizing enterocolitis: review of state-of-the-art imaging findings with pathologic correlation. Radiographics 2007;27:285-305. [Crossref] [PubMed]
- Hollingsworth CL, Rice HE. The Duke Abdominal Assessment Scale: initial experience. Expert Rev Gastroenterol Hepatol 2010;4:569-74. [Crossref] [PubMed]
- Rehan VK, Seshia MM, Johnston B, Reed M, Wilmot D, Cook V. Observer variability in interpretation of abdominal radiographs of infants with suspected necrotizing enterocolitis. Clin Pediatr (Phila) 1999;38:637-43. [Crossref] [PubMed]
- Su CY, Tsai TY, Tseng CY, Liu KH, Lee CW. A Deep Learning Method for Alerting Emergency Physicians about the Presence of Subphrenic Free Air on Chest Radiographs. J Clin Med 2021;10:254. [Crossref] [PubMed]
- Cheng PM, Tran KN, Whang G, Tejura TK. Refining Convolutional Neural Network Detection of Small-Bowel Obstruction in Conventional Radiography. AJR Am J Roentgenol 2019;212:342-50. [Crossref] [PubMed]
- Kwon G, Ryu J, Oh J, Lim J, Kang BK, Ahn C, Bae J, Lee DK. Deep learning algorithms for detecting and visualising intussusception on plain abdominal radiography in children: a retrospective multicenter study. Sci Rep 2020;10:17582. [Crossref] [PubMed]
- Chen HY, Hsu BW, Yin YK, Lin FH, Yang TH, Yang RS, Lee CK, Tseng VS. Application of deep learning algorithm to detect and visualize vertebral fractures on plain frontal radiographs. PLoS One 2021;16:e0245992. [Crossref] [PubMed]
- Henderson RDE, Yi X, Adams SJ, Babyn P. Automatic Detection and Classification of Multiple Catheters in Neonatal Radiographs with Deep Learning. J Digit Imaging 2021;34:888-97. [Crossref] [PubMed]
- Gao W, Pei Y, Liang H, Lv J, Chen J, Zhong W. Multimodal AI System for the Rapid Diagnosis and Surgical Prediction of Necrotizing Enterocolitis. IEEE Access 2021;9:51050-64.
- Walsh MC, Kliegman RM. Necrotizing enterocolitis: treatment based on staging criteria. Pediatr Clin North Am 1986;33:179-201. [Crossref] [PubMed]
- Nowak F, Yung KW, Sivaraj J, De Coppi P, Stoyanov D, Loukogeorgakis S, Mazomenos EB. An investigation into augmentation and preprocessing for optimising X-ray classification in limited datasets: a case study on necrotising enterocolitis. Int J Comput Assist Radiol Surg 2024;19:1223-31. [Crossref] [PubMed]
- Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. [Crossref] [PubMed]
- Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision 2019;128:336-59.
- Collins E, Achanta R, Susstrunk S. Deep feature factorization for concept discovery. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Computer Vision - ECCV 2018: Proceedings of the 15th European Conference on Computer Vision; 2018 Sep 8-14; Munich, Germany. Berlin: Springer; 2018. p. 336-52.
- Ahle M, Ringertz HG, Rubesova E. The role of imaging in the management of necrotising enterocolitis: a multispecialist survey and a review of the literature. Eur Radiol 2018;28:3621-31. [Crossref] [PubMed]
- Hwang M, Tierradentro-García LO, Dennis RA, Anupindi SA. The role of ultrasound in necrotizing enterocolitis. Pediatr Radiol 2022;52:702-15. [Crossref] [PubMed]
- Sharma PG, Rajderkar DA, Sistrom CL, Slater RM, Mancuso AA. Bubbles in the belly: How well do radiology trainees recognize pneumatosis in pediatric patients on plain film? Br J Radiol 2022;95:20211101. [Crossref] [PubMed]
- van Druten J, Sharif MS, Chan SS, Cong C, Abdalla H. A Deep Learning Based Suggested Model to Detect Necrotising Enterocolitis in Abdominal Radiography Images. 2019 International Conference on Computing, Electronics & Communications Engineering (iCCECE); 2019-08. London, UK: 2019;118-23. doi:
10.1109/iCCECE46942.2019.8941615 . - Wu Z, Zhuo R, Liu X, Wu B, Wang J. Enhancing surgical decision-making in NEC with ResNet18: a deep learning approach to predict the need for surgery through x-ray image analysis. Front Pediatr 2024;12:1405780. [Crossref] [PubMed]
- Weller JH, Scheese D, Tragesser C, Yi PH, Alaish SM, Hackam DJ. Artificial Intelligence vs. Doctors: Diagnosing Necrotizing Enterocolitis on Abdominal Radiographs. J Pediatr Surg 2024;59:161592. [Crossref] [PubMed]
- Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.

