Artificial intelligence models assisting physicians in quantifying pancreatic necrosis in acute pancreatitis

Cheng-Xiang Lu; Jiali Zhou; Yong-Chang Feng; Si-Jun Meng; Xue-Ling Guo; Wen-Song Su; Tue Ngo; Tse Hao Hsu; Peng Lin; James Huang; Si-Tong Liu; Manuel L. B. Palacio; Wei-Lin Change; Glen Qin; Yi-Qun Hu; Ling-Hui Zhan

doi:10.21037/qims-24-841

Original Article

Artificial intelligence models assisting physicians in quantifying pancreatic necrosis in acute pancreatitis

Cheng-Xiang Lu^1#, Jiali Zhou^2,3#, Yong-Chang Feng^4#, Si-Jun Meng^5#, Xue-Ling Guo¹, Wen-Song Su¹, Tue Ngo⁴, Tse Hao Hsu⁴, Peng Lin⁴, James Huang⁴, Si-Tong Liu⁴, Manuel L. B. Palacio⁴, Wei-Lin Change⁴, Glen Qin⁴, Yi-Qun Hu^2,3, Ling-Hui Zhan¹

¹Department of Intensive Care Unit, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, China; ²Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, China; ³National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China; ⁴California Science and Technology University, California, CA, USA; ⁵Jiying Technology Co., Ltd., Hong Kong, China

Contributions: (I) Conception and design: CX Lu, LH Zhan, J Zhou; (II) Administrative support: YC Feng, T Ngo, TH Hsu, P Lin; (III) Provision of study materials or patients: XL Guo, WS Su; (IV) Collection and assembly of data: CX Lu, J Zhou, YQ Hu, LH Zhan, SJ Meng; (V) Data analysis and interpretation: J Huang, ST Liu, MLB Palacio, WL Change; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Glen Qin, PhD. California Science and Technology University, 1601 McCarthy Boulevard, Milpitas, California, CA 95035, USA. Email: glen.qin@cstu.org; Yi-Qun Hu, MD, PhD. Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, 201 Hubin South Road, Xiamen 361004, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China. Email: hyq0826@xmu.edu.cn; Ling-Hui Zhan, MSc. Department of Intensive Care Unit, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, 201 Hubin South Road, Xiamen 361004, China. Email: oriental_power00@163.com.

Background: Acute pancreatitis (AP) is a potentially life-threatening condition characterized by inflammation of the pancreas, which can lead to complications such as pancreatic necrosis. The modified computed tomography severity index (MCTSI) is a widely used tool for assessing the severity of AP, particularly the extent of pancreatic necrosis. The accurate and timely assessment of the necrosis volume is crucial in guiding treatment decisions and improving patient outcomes. However, the current diagnostic process relies heavily on the manual interpretation of computed tomography (CT) scans, which can be subjective and prone to variability among clinicians. This study aimed to develop a deep-learning network model to assist clinicians in diagnosing the volume ratio of pancreatic necrosis based on the MCTSI for AP.

Methods: The datasets comprised retrospectively collected plain and contrast-enhanced CT scans from 144 patients (6 with scores of 0 points, 42 with scores of 2 points, and 65 with scores of 4 points) and the National Institutes of Health contrast-enhanced CT scans from 45 patients with scores of 0 points. An improved fully convolutional neural networks for volumetric medical image segmentation (V-Net) model was developed to segment the pancreatic volume (i.e., the whole pancreas, necrotic pancreatic tissue, and non-necrotic pancreatic tissue) and to quantify the split volume ratios. The improved strategy included three stages of body up- and down-sampling adapted to the task of segmentation in AP, and the selection of objects, loss function, and smoothing coefficients. The model interpretations were compared with those of clinicians with different levels of experience. The reference standard was manually segmented by a pancreatic radiologist. Accuracy, macro recall, and macro specificity were employed to compare the diagnostic efficacy of the model and the clinicians.

Results: In total, 144 patients (mean age: 44±13 years; 40 females, 104 males) were included in the study. Optimal training results were obtained using the necrotic pancreatic tissue and whole pancreas as the input objects, and combining dice loss and 500 smoothing coefficients as the loss function for training. The dice coefficient for the whole pancreas was 0.811 and that for the necrotic pancreatic tissue was 0.761. The performance of the artificial intelligence model and clinicians were compared. The accuracy, macro recall, and macro specificity of the improved V-net were 0.854, 0.850 and 0.923, respectively, which were all significantly higher than those of the senior and junior clinicians (P<0.05).

Conclusions: Our proposed model could improve the effectiveness of clinicians in diagnosing pancreatic necrosis volume ratios in clinical settings.

Keywords: Artificial intelligence (AI); computed tomography (CT); deep learning; convolutional neural network (CNN); pancreatic necrosis

Submitted Apr 25, 2024. Accepted for publication Nov 11, 2024. Published online Dec 24, 2024.

doi: 10.21037/qims-24-841

Introduction

Under the revised Atlanta classification (1), acute pancreatitis (AP) is classified as mild, moderately severe, and severe. While most cases are mild, 8.8% of cases progress to severe acute pancreatitis (SAP) (1). SAP often leads to peripheral pancreatic tissue necrosis and multiple organ failure (2), and has a mortality rate that can be as high as 28% in cases of persistent organ failure (1). This condition is typically associated with a poor prognosis (3). The early diagnosis of SAP and corresponding care and treatment are critical to effectively prevent poor patient outcomes (4).

The modified computed tomography severity index (MCTSI) is an effective tool for evaluating the severity of AP (5), with higher MCTSI scores indicating a higher incidence of complications (6,7). The volume of pancreatic necrosis is assessed by the MCTSI, and classified as follows: 0% (0 points), less than 30% (2 points), and more than 30% (4 points) (8). It is challenging to distinguish necrotic from non-necrotic pancreatic tissue, and such assessments need to be made by well-trained experts in the specialty. However, even for the experts, the assessment typically yields only qualitative information, often based on visual analysis from medical imaging, such as CT scans (9).

Deep-learning systems can independently extract features for large-scale operations (10), and accurately and efficiently analyze images through image registration technology without the guidance of experts. However, the shape of the pancreas is irregular and highly variable. Further, computed tomography (CT) scans of the pancreas lack sharp contrast, and do not normally have clear and smooth borders. Thus, the segmentation of this organ is difficult (11-13).

Previous studies have attempted to improve the effectiveness of deep-learning systems in pancreas segmentation. Farag et al. used a deep convolutional random-forest network based on empirical probability statistics to improve pancreatic segmentation efficiency (14). Karasawa et al. used a vascular structure map to assist in the location of pancreatic tissues and organs (11). Li et al. employed the sliding-window algorithm for pancreatic segmentation to obtain more semantic features by highlighting single sagittal imaging data before CT three-dimensional (3D) reconstruction (12). To address the smooth continuity issues between different sagittal, coronal, and transverse planes, Cai et al. used recurrent neural networks to integrate adjacent sections and improve image segmentation accuracy (15). However, these studies only focused on the segmentation of the pancreas and surrounding tissues.

In recent years, research on the pancreas has shifted toward clinical practice, and 3D convolutional segmentation networks have become a focal point of this research (16,17). To analyze the relationship between pancreatic volume and age, gender, height, and weight, Cai et al. (18) developed a pancreatic segmentation model that applied two 3D U-Nets in cascade, which had an average dice similarity coefficient of 0.94 in the test data set. Weston et al. used convolutional neural networks (CNNs) to improve the speed and accuracy of radiotherapy dose prediction for organs at risk, and established a 3D U-Net CNN capable of segmenting the whole abdomen (19). The accuracy of their method in the segmentation of the pancreas was 0.79 (19).

Unlike traditional two-dimensional (2D) segmentation networks, the V-Net operates on 3D volumes, which enables it to capture spatial relationships across slices, which is essential for the accurate assessment of pancreatic necrosis volume. The use of residual connections in the V-Net helps mitigate the vanishing gradient problem, and makes it more suitable for training deeper networks. This feature was particularly beneficial in our context, where the precise segmentation of necrotic and non-necrotic tissue was necessary. Further, unlike other architectures, such as the U-Net or 2D convolutional networks, the V-Net can also efficiently handle variations in organ size and shape, which is a common challenge in pancreatic segmentation.

In this study, we proposed an automated CT pancreatic tissue segmentation method for assessing the severity of AP, which applies an improved V-net (20) to the diagnostic process. First, plain CT was used to obtain the initial pancreatic imaging features of the early-stage AP patient. Contrast-enhanced CT was then performed 72 hours later to acquire the pancreatic imaging features that can reflect the severity of the disease (21). The improved V-net was then used to segment the pancreatic imaging features over the two time points. After segmentation, the volume was calculated using the distance between pixels and the number of pixels. Finally, the severity of the AP patient was assessed by quantitatively comparing the pancreatic tissues from the two time points. Experiments on clinical AP data showed the effectiveness of the proposed method. The assessment results of the automated CT pancreatic tissue segmentation method were compared with the external test set assessments of three clinicians with 15 years of experience and two clinicians with 3 years of experience to provide more reference for the model’s application in clinical practice. We present this article in accordance with the TRIPOD reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-24-841/rc).

Methods

This study was approved by the Ethics Committee of Zhongshan Hospital of Xiamen University (No. 2022–137), and was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The requirement of informed consent from individual patients was waived because of the retrospective design of the study. Figure 1 summarizes the process used to construct the data sets. The following end-to-end workflow was implemented to analyze the CT images (Figure 2): image preprocessing; CNN segmentation of the pancreas (i.e., the whole pancreas, necrotic pancreatic tissue, and non-necrotic pancreatic tissue); and determination of AP severity based on the ratio of the total number of pixels in the three categories obtained by segmentation.

Figure 1 Flowchart showing inclusion and exclusion for the local data sets and NIH pancreas-CT sets. AP, acute pancreatitis; CT, computed tomography; NIH, National Institutes of Health; 3D, three-dimensional.

Figure 2 CT scan image (A) before and (B) after equalization. CT, computed tomography.

Dataset

Local data sets

Non-contrast-enhanced CT scans and enhanced intravenous CT scans (obtained ∼70 s after intravenous contrast injection in the portal-venous phase) of 144 patients with AP diagnosed between June 2015 to April 2022 were extracted from the imaging archive of Zhongshan Hospital of Xiamen University. To be eligible for inclusion in the study, the patients had to meet at least two of the following inclusion criteria: (I) have typical clinical symptoms of AP and persistent abdominal pain; (II) have serum amylase and/or lipase levels greater than three times the upper limit of normal; and (III) have abdominal ultrasound and/or CT images showing changes characteristic of pancreatitis (22). Additionally, the patients also had to meet the following inclusion criteria: (I) be experiencing the first onset of AP; and (II) be aged ≥18 years. The CT scan parameters used in the study were: section thickness of 1.5 to 2.5 mm, a pitch of 1.2 mm, and an image resolution of 512×512 pixels.

NIH data set

To balance the sample distribution and increase the sample size, CT scan images of 45 patients from a National Institutes of Health (NIH) dataset were also added to our dataset. The NIH Pancreas-CT set (23) was provided by the NIH Clinical Center. The NIH Pancreas-CT set comprised 82 abdominal contrast-enhanced 3D CT scans (obtained ~70 s after intravenous contrast injection in the portal-venous phase) from 53 male and 27 female subjects. The CT scans had a resolution of 512×512 pixels, with pixel sizes that varied depending on the specific scan, and a slice thickness ranging from 1.5–2.5 mm.

Data set partitioning

As Figure 1 shows, the local dataset, comprising local patient data, was partitioned in a ratio of 7:3 to create both the internal training/validation/test sets and the external test set. Specifically, the internal sets consisted of data from 103 AP patients and 45 subjects from the NIH Pancreas-CT dataset, divided in an 8:1:1 ratio into training, validation, and internal test sets, while the external test set consisted of 41 AP patients.

Ground-truth labeling

The local data were manually labeled by two readers (L.H.Z and C.X.L., each with 15 years of experience in interpreting CT scans of the pancreas) using ITK-SNAP (24), and verified by an experienced radiologist. The following three areas of the pancreas were segmented: the whole pancreas, necrotic tissue in the pancreas, and non-necrotic tissue in the pancreas. For each patient, a non-necrotic pancreatic tissue mask outlined by CT enhancement was combined with a necrotic pancreatic tissue mask outlined by CT enhancement in the same file; and a whole pancreas mask of the plain CT scan was combined with a necrotic pancreatic tissue mask outlined by CT enhancement in the same file. Each patient file contained these two file masks, which were used to create a four-dimensional (4D) mask in Neuroimaging Informatics Technology Initiative format.

Data preprocessing

Before the CT scan images were fed into the model, each image was equalized using the equalizeHist function in open OpenCV (Intel). The histogram equalization was based on the following cumulative distribution function (CDF) (25) transformation:

$h_{v} = r o u n d (\frac{c d f (v) - c d f_{m i n}}{(M \times N) - c d f_{m i n}}) \times (L - 1)$ [1]

where v is the old gray-scale value that is mapped into the new gray-scale value h; cdfmin is the minimum non-zero value of the CDF; M × N is the image size; and L is the number of new gray levels. Figure 2 compares a CT scan image before and after equalization. We found that applying image equalization improved the model performance significantly, as it made images from various sources more comparable.

Various augmentation techniques, including random rotation, shift, zoom in/out, all within the slide plain, were applied during the training of the model. These techniques were implemented using the ImageGenerator from TensorFlow (Google). To keep the relative location of each organ structure, we did not apply shear or flip. In one experiment, we cropped the image by about 50%, which still included the whole pancreas, and used the cropped image for training. In addition to the above 2D augmentations, an augmentation along the direction perpendicular to the slide plane was also applied. In each training step, a random section of data planes from the total sets was selected for each patient. For example, 32 of 64 data planes were selected. The following process was implemented: for each patient’s images, a z-direction offset was first randomly selected, and a predetermined number of images were then extracted.

Model

Our models were based on V-Net models. As Figure 3 shows, our final model used non-contrast-enhanced CT scan images with a resolution of 256×256×64 and a batch size of 3 as the input, and a mask with a resolution of 256×256×64×2 as the output. A few model design variables had to be optimized for the specific task of differentiating necrotic from non-necrotic pancreatic tissue. This was guided by our intuitive understanding of how necrotic regions can be distinguished from non-necrotic regions in the pancreas. We conjectured that low-level features, such as intensity and local morphology, were more important than high-level features, such as overall shape, for two reasons. First, as necrotic areas in the pancreas may liquefy, the non-necrotic areas can have any shape (26). Second, on contrast-enhanced CT, necrotic areas of the pancreas have a different contrast compared to non-necrotic areas of the pancreas. This approach is very different to that adopted in some common segmentation tasks. However, this approach improved our model and thus validated our conjecture. However, this does not mean that pancreas segmentation should be based on local information only. As discussed below, the overall picture, including the relative position of different organs, may also be very important.

Figure 3 Workflow diagram of the model. The model comprised three sub-sampling and three up-sampling steps. In the result processing phase, masks of the two channels were reconstructed, after which the ratio of pixels of necrotic tissue to healthy tissue was calculated.

Optimizing model structures

The first model design variable to be optimized in the V-Net model was the number of stages. We focused more on the low-level information (i.e., more neurons in the shallow stages). However, as Figure 4A,4B show, increasing the number of layers from 4 to 5 did not improve the model performance. The next variable to be optimized was the kernel size for the CNN within each layer in the V-Net model. Kernels such as 7×7×7, 5×5×5, 3×3×3, and 1×3×3 (z/y/x) were explored. The model with the smallest kernel (1×3×3) performed the best, as the small kernel allowed the network to remain in high resolution. For the rest of the study, this parameter was fixed as 1×3×3, given that it demonstrated the best performance. The stride for the CNN was 1×1×1 to maintain the high resolution and enable the residual network of the V-Net model to better capture and preserve spatial information throughout the layers. Conversely, the kernel for the CNN between each layer was either 2×2×2, or 1×2×2 and it was found that the performances of both kernel sizes were similar. For the rest of the study, this parameter was fixed as 1×2×2. The strike of the CNN was the same as the kernel. Notably, for the “z” direction, the kernel size and strike were always “1” for the whole model. This meant that there was no convolution in the “z” direction, and our model was a pseudo-2D model. Another variable explored was the numbers of CNNs in each layer. The number of feature maps (filters) was doubled from one layer to the next layer. The last variable explored was the number of feature maps for the first stage. In this study, we explored a range from 8 to 64, and finally settled on 16 feature maps, as increasing the number to more than 16 made the network heavy and did not improve the dice coefficient.

Figure 4 Dice coefficient vs. epochs. (A) Four stages; (B) five stages.

Other optimizing strategies

As 3D segmentation can be treated as voxel-wise classification, early studies used binary cross entropy as the loss function (27). However, this can introduce bias toward the more dominant class, especially in medical image segmentation, where the to-be-detected object is often much smaller than the background, which leads to under detecting. A few approaches have been proposed, such as loss function based on the dice coefficient (20). In this study, various loss functions based on dice loss (DL) were explored. If there is only one class of object to be segmented from the background, the DL can be defined as:

$DL = 1 - \frac{2 | X \cap Y | + s}{| X | + | Y | + s}$ [2]

where $| X |$ is the number of voxels that are labeled as part of the to-be-detected object; $| Y |$ is the number of voxels that are predicted to be part of the object; $| X \cap Y |$ is the number of voxels that are both labeled and predicted as part of the object; and s is a small smoothing term used to avoid division by zero. As this study sought to separate two classes (non-necrotic pancreatic tissue and necrotic pancreatic tissue; necrotic pancreatic tissue and the whole pancreas), the loss function had to include information about both classes. Thus, we used a combined DL (28) function that we defined as:

$Combined DL = 1 - \frac{2 (| X 1 \cap Y 1 | + | X 2 \cap Y 2 |) + s}{| X 1 | + | Y 1 | + | X 2 | + | Y 2 | + s}$ [3]

where $| X 1 |$ is the number of voxels labeled as class 1 (e.g., the non-necrotic region); $| Y 1 |$ is the number of voxels of predicted class 1l; and $| X 1 \cap Y 1 |$ is the number of voxels that are both labeled and predicted as class 1; while $| X 2 |$ , $| Y 2 |$ and $| X 2 \cap Y 2 |$ are the number of voxels of labeled class 2 (e.g., the whole pancreas region), the number of voxels of predicted class 2, and the number of voxels that are both labeled and predicted as class 2, respectively. If the object sizes of the two classes are very different, the combined DL can introduce bias toward the bigger object class. Another loss function that can include information about both classes is average DL (29), which we defined as:

$Average DL = 1 - 0.5 * {\frac{2 (| X 1 \cap Y 1 |) + s}{| X 1 | + | Y 1 | + s} + \frac{2 (X 2 \cap Y 2) + s}{| X 2 | + | Y 2 | + s}}$ [4]

where $| X 1 |$ , $| Y 1 |$ , $| X 1 \cap Y 1 |$ , $| X 2 |$ , $| Y 2 |$ and $| X 2 \cap Y 2 |$ have the same definitions as those provided above; and s represents the smoothing coefficient.

In this study, we found that s, the smoothing parameter, played an important role in the training process; thus, its value had to be selected carefully, especially when the detected object could be “missing”; for example, if the pancreas of a patient with severe necrotizing pancreatitis was completely liquefied without any part remaining. In such cases, if the smoothing parameter is too small, the loss function cannot provide a meaningful gradient for training the model (30). Conversely, if the smooth value is too large, the DL function loses sensitivity when the number of predicted error is already relatively low. As Figure 5 shows, the DL and prediction errors were plotted with various smooth values. A novel strategy for building the loss function is to use two dice coefficients, calculated with two different smooth values. Using this strategy, the loss function could have a meaningful gradient in a wider range of errors. At a resolution of 256×256×64, the volume of a pancreas can range from 10,000–100,000 voxels. To ensure a fair comparison between models, this study used a 4D mask file of necrotic and non-necrotic pancreatic tissue and a CT-enhanced original image file of each patient for training, and to calculate the dice score for evaluation.

Figure 5 Dice loss vs. prediction errors with different smoothing parameters.

For each patient, the whole pancreas was separated into non-necrotic and necrotic parts. In earlier attempts, the machine-learning model tried to identify the non-necrotic and necrotic pancreatic tissue. An alternative strategy is to identify the whole pancreas and the necrotic tissue. The whole pancreas comprises both the non-necrotic and necrotic pancreatic tissue; the non-necrotic pancreatic tissue can be determined by subtracting the necrotic pancreatic tissue from the whole pancreas. In this study, we used the whole pancreas as one object and the necrotic pancreatic tissue as another object in the DL calculation. To ensure a fair comparison between the different strategies, in this study, when calculating the dice coefficient for evaluating the model performance, the smooth parameter was fixed at 500, which was less than 5% of the pancreas volume.

Statistical analysis

Experimental environment

The hardware for this study was configured as follows: experimental platform: Ubuntu 20.04.1 LTS; processor: Intel Xeon E5-2620 (version 4), 2.10 GHz with 4 NVIDIA TITAN Xp-12GB; development environment: Python 3.7.5; and deep-learning framework: PyTorch 1.12.1, using CUDA 11.3/CUDNN 8.2 for image acceleration.

SPSS statistical software (SPSS 26.0: SPSS; Chicago, IL, USA) was used to conduct the statistical analysis of the participants’ information. The age data are expressed as the mean ± standard deviation. In addition, the Chi-square test was used to examine the differences between the groups. A P value <0.05 was considered statistically significant.

Model performance

To evaluate segmentation performance, the combined dice similarity coefficient, the average dice similarity coefficient, and the percentage difference in volume (V) were calculated. The automatically segmented region of interest voxels from the output label map and the ground-truth label map scaled by voxel (i) size were used to calculate the performance metrics as follows:

$V = {Sum}_{i \in R O I} (v_{i})$ [5]

Artificial intelligence (AI)-based measures compared with clinical assessments

The performance of our AI model was compared with that of six clinicians (denoted as C1–C6) using the external test set. C1 and C2 had 3–5 years of experience each, while C3, C4, and C5 had 12–15 years of experience each. Of the six clinicians, five were clinical doctors, and one was a radiologist who generated the ground truth. Each clinician independently labeled each patient in the external test set (by selecting categorical options), did not have access to the results predicted by the AI model, and was blind to the other clinicians’ interpretations. Their results were compared with the ground truth to evaluate the reliability of the AI-based explanations. The volume ratio and Bland-Altman consistency evaluation were used to measure the reliability of the AI-based explanations. The outcome measures included accuracy, macro sensitivity, and macro specificity. SPSS statistical software was used for the statistical analysis. The data are expressed as the mean ± standard deviation. The data of the junior clinicians, senior clinicians, and the AI-based system were compared. Other data collected included the time taken to evaluate 927 images in 41 patients.

Results

Data set information

The total data set comprised 5,706 image pairs, of which 2,952 were plain CT scan images, and 2,754 were enhanced CT scans, from 144 patients (mean age: 44±13 years; 40 females, 104 males) (Table 1). The training set comprised 3,902 image pairs from 72 patients (mean age: 44±13 years; 18 females, 54 males); the validation set comprised 502 image pairs from 11 patients (mean age: 44±12 years; 3 females, 8 males); and the internal test set comprised 1,302 image pairs from 20 patients (mean age: 47±13 years; 6 females, 14 males). The external test set comprised 927 image pairs from 41 patients (mean age: 43±13 years; 13 females, 28 males; 6 mild, 15 moderate, and 20 severe cases of AP).

Table 1

Characteristics of included AP patients

Information	Training set (n=72)		Validation set (n=11)		Internal test set (n=20)		External test set (n=41)		P value
Information	N	Mean ± SD [range]	N	Mean ± SD [range]	N	Mean ± SD [range]	N	Mean ± SD [range]	P value
Age, years	–	44±13 [23–82]	–	44±12 [26–62]	–	47±13 [25–77]	–	43±13 [19–70]	0.771
Sex
Male	54	43±14 [23–82]	8	39±9 [26–54]	14	46±14 [35–68]	28	41±13 [23–70]	0.587
Female	18	46±13 [24–68]	3	57±4 [54–62]	6	48±12 [35–68]	13	46±13 [29–62]	0.526

AP, acute pancreatitis; SD, standard deviation.

AI model validation

Validation of different strategies

The model performance results are summarized in Table 2. Notably, both the dice coefficient of the necrotic pancreatic tissue and the dice coefficient of the whole pancreas were significantly improved compared to those achieved when the non-necrotic pancreatic tissue and necrotic pancreatic tissue were used as the two objects in the loss function.

Table 2

Model performance training and testing results

Dice coefficient	Training			Testing
Dice coefficient	Whole pancreas	Necrotic pancreatic tissue	Non-necrotic pancreatic tissue	Whole pancreas	Necrotic pancreatic tissue	Non-necrotic pancreatic tissue
Combined dice loss
Smooth =1	–	0.694	0.708	–	0.580	0.184
Smooth =500	–	0.688	0.733	–	0.551	0.285
Smooth =50/5,000	–	0.674	0.717	–	0.570	0.262
Average dice loss
Smooth =1	–	0.729	0.017	–	0.781	0.008
Smooth =500	–	0.679	0.895	–	0.535	0.407
Smooth =50/5,000	–	0.681	0.878	–	0.515	0.428
Combined dice loss
Smooth =500	0.946	0.961	–	0.811	0.761	–
Average dice loss
Smooth =500	0.939	0.962	–	0.773	0.778	–

The performance of the model’s best strategy in the task

For the model in which the necrotic pancreatic tissue and whole pancreas were used as the input objects, and combined DL was used as the loss function, the volume ratio of the necrotic pancreatic tissue to the whole pancreas is shown in Figure 6. For the training dataset, the predictions matched the ground truth closely for the validation and testing datasets, the model predictions and ground truths matched well in patients with severely necrotic pancreatic tissue and non-necrotic pancreatic tissue. Notably, many patients had a non-necrotic pancreatic tissue, and they were all diagnosed correctly by the model. Conversely, for the two patients whose pancreas was slightly necrotic (who received scores of 2 points), the prediction errors were relatively higher.

Figure 6 The performance of the model’s best strategy in the task. (A) Volume ratio of the necrotic pancreatic tissue to the whole pancreas in the model, with the horizontal axis representing the true necrosis volume ratio and the vertical axis representing the predicted necrosis volume ratio. (B) The predicted performance for a patient whose pancreas was slightly necrotic in validation and testing.

AI-based measures compared with clinical assessments

AI generated explanations that often aligned with clinicians’ explanations

The correlation of the necrotic volume ratios between the model predictions and the ground truth was high (Figure 7A). The Bland-Altman analyses showed that the model interpretations were in good measurement with the clinicians’ interpretations (Figure 7B).

Figure 7 The artificial intelligence model generated explanations that often aligned with the clinicians’ explanations. (A) Volume ratio of the necrotic pancreatic tissue to the whole pancreas with improved V-net for 41 CT scans in patients. (B) Bland-Altman plots for 41 CT scans in patients with the volume of pancreatic necrosis. CT, computed tomography; Ioa, index of agreement.

Comparing AI to clinicians in the diagnosis of volume of pancreatic necrosis

As the confusion matrix shows (Figure 8), the accuracy, macro recall, and macro specificity of the two clinicians with 3–5 years of experience were 0.497±0.048, 0.55±0.016, and 0.75±0.002, respectively, and those of the three clinicians with 12–15 years of experience were 0.586±0.042, 0.542±0.031, and 0.767±0.028 respectively. While the values for the improved V-net were 0.854, 0.850 and 0.923, respectively. The results revealed a statistically significant difference in macro recall and macro specificity between the junior clinicians and the AI model (P<0.05). Additionally, the differences between the senior clinicians and AI model were all significant (P<0.05). However, none of the differences between the junior and senior clinicians reached statistical significance. Comparing the time to complete the image review, the average diagnostic time for the five doctors who reviewed 927 images was 2,196 s, while that of the AI model was only 13 s.

Figure 8 Comparison of the manual evaluation approach and the artificial intelligence-based assessment approach using the confusion matrix of five clinicians [two with 3–5 years of experience (D,E) and three with 12–15 years of experience (A-C)] and the improved V-net (F), respectively.

Discussion

The MCTSI is a useful scale for assessing the severity of AP; however, its use is limited by the experience of observers. In this study, we developed an automated CT pancreatic segmentation AI model to assess the severity of AP. This model may facilitate quantitative studies on the use of the MCTSI in medical imaging to assess the severity of AP.

We found that the necrotic pancreatic tissue dice coefficient and whole pancreas dice coefficient were significantly improved compared to those obtained when using the non-necrotic and necrotic pancreatic tissue as the two objects in the loss function. Properly framing the segmentation task and clearly defining the target tissues were critical in achieving these improvements (31). Separating necrotic pancreatic tissue from the whole pancreas is much easier than separating non-necrotic tissue from necrotic tissue. It is still not clear why the model is better able to distinguish between necrotic pancreatic tissue and the whole pancreas than necrotic and non-necrotic pancreatic tissue. It may be that a deep learning model based on the whole pancreas is more consistent, as all samples include the pancreas, but not all samples include non-necrotic tissue in the pancreas.

In clinical terms, the early identification and management of AP of varying levels of severity is crucial to control the progression of the disease. However, the reproducible and accurate diagnosis of the severity of AP is still a challenge in clinical diagnosis and treatment. Our study showed that the AI model had higher accuracy, macro recall, and macro specificity than the clinicians in evaluating patients with pancreatic necrosis volume ratios, indicating that the AI model could assist clinicians to determine the severity of pancreatic necrosis.

In addition to achieving high segmentation precision and specificity, our 3D V-net deep-learning model was also able to assess the pancreatic necrosis volume ratios significantly faster than the clinicians. On average, the model took 13 s to complete this task, while the clinicians took 2,196 s. These results highlight the potential of deep-learning models to assist physicians in clinical decision-making tasks, allowing for more efficient and accurate patient care. Further, the application of AI could address the issue of high resource consumption and reduce the economic costs associated with the long-term training of clinicians and imaging experts. AI also has the advantages of objectivity, accuracy, and strong consistency, and thus could assist clinicians with different levels of training and experience to accurately judge and identify the severity of AP.

The assessment of the severity of AP relies not only on common imaging evaluations but also on a range of clinical scoring systems and biomarkers, all of which hold significant value in clinical practice. For example, clinical scoring systems, such as the Ranson score, Acute Physiology and Chronic Health Evaluation II score, and Bedside Index of Severity in Acute Pancreatitis score, enable the rapid determination of disease severity based on the patient’s clinical presentation and laboratory data. These scoring methods are widely used in clinical practice due to their ability to provide timely and relatively accurate assessments, guiding early intervention and treatment strategies.

Moreover, the measurement of biomarkers, such as C-reactive protein (CRP), offers strong support in evaluating the severity of AP. CRP, an acute-phase reactant, is closely associated with the severity of inflammation, particularly in the early stages of AP. Changes in CRP levels can assist clinicians to quickly assess disease progression.

However, imaging evaluation has irreplaceable advantages. It enables the direct visualization of lesions, clarifying the severity of the disease, and the early detection of potential severe complications, which can significantly improve treatment outcomes and reduce mortality. Imaging provides a clear depiction of a lesion’s evolution, aiding in the assessment of treatment effectiveness and adjustments to the therapeutic approach (32). Systems like the MCTSI quantify the extent of pancreatic inflammation and necrosis, providing an objective metric for prognostic evaluation. This quantitative assessment helps predict disease progression and patient outcomes more accurately, offering a scientific basis for personalized treatment planning. Despite the higher costs and the need for specialized equipment and personnel, imaging evaluation is crucial in major decision-making processes. Therefore, imaging assessment remains indispensable in clinical workflows. The integration of our algorithms into clinical practice could potentially enhance the accuracy and speed of diagnosis while significantly reducing operational costs, which may increase of AI-assisted imaging systems among physicians.

This study had several limitations, First, it is still uncertain if the explanations of our AI model can enhance skill development in clinical trainees. We intend to carry out a prospective trial among medical students in a controlled training environment to explore this matter. Second, the effectiveness of segmenting samples from a small number of classes was seriously affected by data imbalance. Third, our dataset was obtained from external sources and lacked polycentricity.

Conclusions

In this study, we developed an automated CT pancreas segmentation AI model designed to assess the severity of AP by accurately diagnosing the pancreatic necrosis volume ratio. This model could have several potential benefits in clinical practice. First, the implementation of this AI model in a real-world clinical setting could significantly reduce the variability and subjectivity inherent in the manual interpretation of CT scans, thereby facilitating more consistent and reliable diagnoses. Such consistency could enhance the overall quality of healthcare delivery by facilitating more accurate treatment planning and improving patient outcomes. Further, the study revealed that the average diagnostic time for the five clinicians who examined 927 images was 2,196 s, while that of the AI system was only 13 s. The implementation of this paradigm could facilitate the diagnostic process, and reduce the time and effort required for clinicians to evaluate complex AP cases. By automating essential components of the diagnostic process, clinicians can redirect their attention to other aspects of patient care, potentially enhancing efficiency in a demanding healthcare setting.

Acknowledgments

The authors would like to thank He-Song Qiu, Jannifer Jia, and Eugene Chang, for their assistance with the artificial intelligence technology, and the participants for their time.

Funding: This study was supported by the Xiamen Science and Technology Bureau (Nos. 3502Z20199021 and 3502Z20214ZD1044).

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-24-841/rc

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-841/coif). S.J.M. is an employee of Jiying Technology Co., Ltd. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by the Ethics Committee of Zhongshan Hospital of Xiamen University (No. 2022-137). The requirement of informed consent from individual patients was waived because of the retrospective design of this study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Banks PA, Bollen TL, Dervenis C, Gooszen HG, Johnson CD, Sarr MG, Tsiotos GG, Vege SSAcute Pancreatitis Classification Working Group. Classification of acute pancreatitis--2012: revision of the Atlanta classification and definitions by international consensus. Gut 2013;62:102-11. [Crossref] [PubMed]
Gompertz M, Lara I, Fernández L, Miranda JP, Mancilla C, Watkins G, Palavecino P, Berger Z. Mortality of acute pancreatitis in a 20 years period. Rev Med Chil 2013;141:562-7. [Crossref] [PubMed]
Zerem E. Treatment of severe acute pancreatitis and its complications. World J Gastroenterol 2014;20:13879-92. [Crossref] [PubMed]
Beger HG, Rau BM. Severe acute pancreatitis: Clinical course and management. World J Gastroenterol 2007;13:5043-51. [Crossref] [PubMed]
Alberti P, Pando E, Mata R, Vidal L, Roson N, Mast R, Armario D, Merino X, Dopazo C, Blanco L, Caralt M, Gomez C, Balsells J, Charco R. Evaluation of the modified computed tomography severity index (MCTSI) and computed tomography severity index (CTSI) in predicting severity and clinical outcomes in acute pancreatitis. J Dig Dis 2021;22:41-8. [Crossref] [PubMed]
Mohey N, Hassan TA. Correlation between modified CT severity index and retroperitoneal extension using the interfascial planes in the grading of clinically suspected acute severe pancreatitis. Egyptian Journal of Radiology and Nuclear Medicine 2020;51:1-10. [Crossref]
Du J, Zhang J, Zhang X, Jiang R, Fu Q, Yang G, Fan H, Tang M, Chen T, Li X, Zhang X. Computed tomography characteristics of acute pancreatitis based on different etiologies at different onset times: a retrospective cross-sectional study. Quant Imaging Med Surg 2022;12:4448-61. [Crossref] [PubMed]
Mortele KJ, Wiesner W, Intriere L, Shanker S, Zou KH, Kalantari BN. A modified CT severity index for evaluating acute pancreatitis. AJR Am J Roentgenol. 2004;183:1261-5. [Crossref] [PubMed]
Banday IA, Gattoo I, Khan AM, Javeed J, Gupta G, Latief M. Modified Computed Tomography Severity Index for Evaluation of Acute Pancreatitis and its Correlation with Clinical Outcome: A Tertiary Care Hospital Based Observational Study. J Clin Diagn Res 2015;9:TC01-5. [Crossref] [PubMed]
Harshit Kumar A, Singh Griwan M. A comparison of APACHE II, BISAP, Ranson's score and modified CTSI in predicting the severity of acute pancreatitis based on the 2012 revised Atlanta Classification. Gastroenterol Rep (Oxf) 2018;6:127-31. [Crossref] [PubMed]
Karasawa K, Oda M, Kitasaka T, Misawa K, Fujiwara M, Chu C, Zheng G, Rueckert D, Mori K. Multi-atlas pancreas segmentation: Atlas selection based on vessel structure. Med Image Anal 2017;39:18-28. [Crossref] [PubMed]
Li H, Li J, Lin X, Qian X. A Model-Driven Stack-Based Fully Convolutional Network for Pancreas Segmentation. 2020 5th International Conference on Communication, Image and Signal Processing (CCISP); 2020: IEEE.
Antwi K, Wiesner P, Merkle EM, Zech CJ, Boll DT, Wild D, Christ E, Heye T. Investigating difficult to detect pancreatic lesions: Characterization of benign pancreatic islet cell tumors using multiparametric pancreatic 3-T MRI. PLoS One 2021;16:e0253078. [Crossref] [PubMed]
Farag A. Le Lu, Roth HR, Liu J, Turkbey E, Summers RM. A Bottom-Up Approach for Pancreas Segmentation Using Cascaded Superpixels and (Deep) Image Patch Labeling. IEEE Trans Image Process 2017;26:386-99. [Crossref] [PubMed]
Cai J, Lu L, Xing F, Yang L. Pancreas segmentation in CT and MRI images via domain specific network designing and recurrent neural contextual learning. arXiv preprint arXiv:180311303. 2018.
Lim SH, Kim YJ, Park YH, Kim D, Kim KG, Lee DH. Automated pancreas segmentation and volumetry using deep neural network on computed tomography. Sci Rep 2022;12:4075. [Crossref] [PubMed]
Si K, Xue Y, Yu X, Zhu X, Li Q, Gong W, Liang T, Duan S. Fully end-to-end deep-learning-based diagnosis of pancreatic tumors. Theranostics 2021;11:1982-90. [Crossref] [PubMed]
Cai J, Guo X, Wang K, Zhang Y, Zhang D, Zhang X, Wang X. Automatic quantitative evaluation of normal pancreas based on deep learning in a Chinese adult population. Abdom Radiol (NY) 2022;47:1082-90. [Crossref] [PubMed]
Weston AD, Korfiatis P, Philbrick KA, Conte GM, Kostandy P, Sakinis T, Zeinoddini A, Boonrod A, Moynagh M, Takahashi N, Erickson BJ. Complete abdomen and pelvis segmentation using U-net variant architecture. Med Phys 2020;47:5609-18. [Crossref] [PubMed]
Milletari F, Navab N, Ahmadi S-A, editors. V-net: Fully convolutional neural networks for volumetric medical image segmentation. 2016 fourth international conference on 3D vision (3DV); 2016: IEEE.
Li F, Cai S, Cao F, Chen R, Fu D, Ge C, et al. Guidelines for the diagnosis and treatment of acute pancreatitis in China (2021). J Pancreatol 2021;4:67-75. [Crossref]
Boxhoorn L, Voermans RP, Bouwense SA, Bruno MJ, Verdonk RC, Boermeester MA, van Santvoort HC, Besselink MG. Acute pancreatitis. Lancet 2020;396:726-34. [Crossref] [PubMed]
Roth HR, Farag A, Turkbey E, Lu L, Liu J, Summers RM. Data from pancreas-ct. the cancer imaging archive. IEEE Transactions on Image Processing. 2016.
Yushkevich PA, Gao Yang, Gerig G. ITK-SNAP: An interactive tool for semi-automatic segmentation of multi-modality biomedical images. Annu Int Conf IEEE Eng Med Biol Soc 2016;2016:3342-5. [Crossref] [PubMed]
Patel O, Maravi YP, Sharma S. A comparative study of histogram equalization based image enhancement techniques for brightness preservation and contrast enhancement. arXiv preprint arXiv:13114033. 2013.
Cao F, Mei W, Li F. Timing, approach, and treatment strategies for infected pancreatic necrosis: a narrative review. J Pancreatol 2022;5:159-63. [Crossref]
Ronneberger O, Fischer P, Brox T, editors. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18; 2015: Springer.
Zhang Y, Liu S, Li C, Wang J. Rethinking the dice loss for deep learning lesion segmentation in medical images. Journal of Shanghai Jiaotong University (Science) 2021;26:93-102. [Crossref]
Prencipe B, Altini N, Cascarano GD, Brunetti A, Guerriero A, Bevilacqua V. Focal dice loss-based V-Net for liver segments classification. Applied Sciences 2022;12:3247. [Crossref]
Taghanaki SA, Zheng Y, Kevin Zhou S, Georgescu B, Sharma P, Xu D, Comaniciu D, Hamarneh G. Combo loss: Handling input and output imbalance in multi-organ segmentation. Comput Med Imaging Graph 2019;75:24-33. [Crossref] [PubMed]
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci U S A 2019;116:22071-80. [Crossref] [PubMed]
Zhao Y, Wei J, Xiao B, Wang L, Jiang X, Zhu Y, He W. Early prediction of acute pancreatitis severity based on changes in pancreatic and peripancreatic computed tomography radiomics nomogram. Quant Imaging Med Surg 2023;13:1927-36. [Crossref] [PubMed]

Cite this article as: Lu CX, Zhou J, Feng YC, Meng SJ, Guo XL, Su WS, Ngo T, Hsu TH, Lin P, Huang J, Liu ST, Palacio MLB, Change WL, Qin G, Hu YQ, Zhan LH. Artificial intelligence models assisting physicians in quantifying pancreatic necrosis in acute pancreatitis. Quant Imaging Med Surg 2025;15(1):135-148. doi: 10.21037/qims-24-841

Artificial intelligence models assisting physicians in quantifying pancreatic necrosis in acute pancreatitis

Introduction

Methods

Dataset

Local data sets

NIH data set

Data set partitioning

Ground-truth labeling

Data preprocessing

Model

Optimizing model structures

Other optimizing strategies

Statistical analysis

Experimental environment

Model performance

Artificial intelligence (AI)-based measures compared with clinical assessments

Results

Data set information

Table 1

AI model validation

Validation of different strategies

Table 2

The performance of the model’s best strategy in the task

AI-based measures compared with clinical assessments

AI generated explanations that often aligned with clinicians’ explanations

Comparing AI to clinicians in the diagnosis of volume of pancreatic necrosis

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share